A program is CPU tied if it would go faster if the CPU to be faster, i.e. It spends the majority of that time merely using the CPU (doing calculations). A program that computes new digits that π will commonly be CPU-bound, it"s simply crunching numbers.

You are watching: I/o bound vs cpu bound

A regimen is I/O bound if it would go much faster if the I/O subsystem was faster. Which precise I/O system is meant can vary; I typically associate it through disk, yet of course networking or interaction in basic is common too. A routine that looks v a huge record for some data might end up being I/O bound, due to the fact that the bottleneck is climate the reading of the data native disk (actually, this instance is possibly kind of old-fashioned this days with numerous MB/s coming in native SSDs).

CPU Bound method the rate at which procedure progresses is minimal by the speed of the CPU. A job that performs calculations top top a small set of numbers, for instance multiplying tiny matrices, is likely to be CPU bound.

I/O Bound means the price at i beg your pardon a process progresses is limited by the rate of the I/O subsystem. A job that processes data native disk, because that example, count the number of lines in a document is most likely to it is in I/O bound.

Memory bound way the price at i beg your pardon a process progresses is limited by the amount memory obtainable and the rate of the memory access. A job that processes huge amounts the in memory data, for example multiplying large matrices, is most likely to be storage Bound.

Cache bound method the price at i beg your pardon a procedure progress is limited by the amount and speed the the cache available. A task that just processes much more data than fits in the cache will be cache bound.

I/O Bound would certainly be slower 보다 Memory Bound would be slower than Cache Bound would be slower than CPU Bound.

The solution to gift I/O tied isn"t have to to get an ext Memory. In part situations, the accessibility algorithm could be designed roughly the I/O, memory or Cache limitations. Watch Cache Oblivious Algorithms.

Multi-threading is where it has tendency to matter the most

In this answer, I will certainly investigate one necessary use instance of distinguishing between CPU vs IO bounded work: once writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a routine that sums all the worths of a solitary vector:

#define size 1000000000unsigned int is;unsigned int amount = 0;size_t ns = 0;for (i = 0; ns Parallelizing that by separating the array equally because that each of her cores is of minimal usefulness top top common contemporary desktops.

For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop through CPU: Intel core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I obtain results choose this:

*

Plot data.

Note the there is a lot of variance between run however. However I can"t rise the selection size lot further due to the fact that I"m currently at 8GiB, and also I"m no in the mood for statistics throughout multiple runs today. This seemed but like a typical run after ~ doing plenty of manual runs.

Benchmark code:

POSIX C pthread source code supplied in the graph.

And here is a C++ version that to produce analogous results.

plot script

I don"t know sufficient computer style to totally explain the form of the curve, however one point is clear: the computation go not become 8x much faster as naively expected as result of me making use of all my 8 threads! For part reason, 2 and also 3 threads was the optimum, and also adding an ext just provides things lot slower.

Compare this come CPU bound work, which actually does get 8 times faster: What execute "real", "user" and "sys" mean in the output of time(1)?

The reason it is all processors re-superstructure a single memory bus linking come RAM:

CPU 1 --\ Bus +-----+CPU 2 ---\__________| lamb |... ---/ +-----+CPU N --/so the memory bus easily becomes the bottleneck, not the CPU.

This happens because adding two numbers takes a solitary CPU cycle, memory reads take around 100 CPU cycles in 2016 hardware.

So the CPU work-related done per byte of entry data is also small, and we contact this one IO-bound process.

The only way to rate up the computation further, would be to rate up individual storage accesses with brand-new memory hardware, e.g. Multi-channel memory.

Upgrading to a faster CPU clock for example would no be very useful.

Other examples

matrix multiplication is CPU-bound top top RAM and GPUs. The input contains:

2 * N**2numbers, but:

N ** 3multiplications are done, and that is sufficient for parallelization come be worth it for practical big N.

This is why parallel CPU matrix multiplication libraries prefer the adhering to exist:

http://www.netlib.org/scalapack/pblas_qref.htmlhttp://icl.cs.utk.edu/magma/software/

Cache usage makes a large difference to the speed of implementations. View for example this didactic GPU comparison example.

See also:

Why deserve to GPU perform matrix multiplication faster than CPU?BLAS equivalent of a LAPACK function for GPUs

Networking is the prototypical IO-bound example.

Even when we send a solitary byte that data, the still take away a big time to with it"s destination.

Parallelizing small network requests like HTTP requests can offer a huge performance gains.

If the network is currently at full capacity (e.g. Downloading a torrent), parallelization can still increase enhance the latency (e.g. You have the right to load a net page "at the very same time").

A dummy C++ CPU bound procedure that takes one number and crunches it a lot:

serialparallel

Sorting shows up to be CPU based on the complying with experiment: space C++17 Parallel Algorithms imposed already? which confirmed a 4x performance development for parallel sort, however I would choose to have a an ext theoretical confirmation together well

The well known Coremark benchmark native EEMBC clearly checks just how well a suite of problems scale. Sample benchmark result clearing mirroring that:

How to discover out if you space CPU or IO bound

Non-RAM IO bound favor disk, network: ps aux, then examine if CPU% / 100 . If yes, you room IO bound, e.g. Impede reads are just waiting because that data and also the scheduler is skipping the process. Then use more tools like sudo iotop to decide which IO is the difficulty exactly.

Or, if execution is quick, and also you parametrize the variety of threads, you can see it conveniently from time that performance improves as the variety of threads increases for CPU bound work: What execute "real", "user" and "sys" mean in the calculation of time(1)?

RAM-IO bound: harder come tell, as lamb wait time that is included in CPU% measurements, view also:

How to check if app is cpu-bound or memory-bound?https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound

Some options:

Intel torture Roofline (non-free): https://software.intel.com/en-us/articles/intel-advisor-roofline (archive) "A Roofline graph is a visual depiction of application performance in relation to hardware limitations, including memory bandwidth and computational peaks."

GPUs

GPUs have an IO bottleneck once you very first transfer the intake data from the regular CPU readable ram to the GPU.

Therefore, GPUs deserve to only be far better than CPUs because that CPU bound applications.

Once the data is transferred to the GPU however, it can operate top top those bytes quicker than the CPU can, because the GPU:

has more data localization than most CPU systems, and also so data have the right to be accessed much faster for some cores 보다 others

exploits data parallelism and also sacrifices latency by simply skipping over any type of data that is not prepared to be activate on immediately.

Since the GPU has to operate on huge parallel entry data, that is far better to just skip to the next data that could be accessible instead of waiting for the present data to be come available and block all various other operations choose the CPU greatly does

Therefore the GPU deserve to be much faster then a CPU if your application:

can be very parallelized: different chunks the data have the right to be treated independently from one one more at the same timerequires a huge enough variety of operations per input byte (unlike e.g. Vector addition which does one addition per byte only)there is a large number of intake bytes

These designs options originally target the applications of 3D rendering, whose main steps are as displayed at What room shaders in OpenGL and what perform we need them for?

vertex shader: multiplying a bunch that 1x4 vectors by a 4x4 matrixfragment shader: calculation the color of every pixel the a triangle based upon its relative place withing the triangle

and so us conclude that those applications room CPU-bound.

With the arrival of programmable GPGPU, we can observe numerous GPGPU applications that offer as examples of CPU bound operations:

Image handling with GLSL shaders?

*

Local image processing to work such as a blur filter are very parallel in nature.

Is it possible to develop a heatmap from allude data at 60 times per second?

Plotting of heatmap graphs if the plotted function is facility enough.

*

https://www.youtube.com/watch?v=fE0P6H8eK4I "Real-Time fluid Dynamics: CPU vs GPU" by Jesús Martín Berlanga

Solving partial differential equations such together the Navier Stokes equation of fluid dynamics:

highly parallel in nature, due to the fact that each point only interacts v their neighbourthere tend to be sufficient operations every byte

See also:

Why space we still utilizing CPUs instead of GPUs?What are GPUs poor at?https://www.youtube.com/watch?v=_cyVDoyI6NE "CPU vs GPU (What"s the Difference?) - Computerphile"

CPython an international Intepreter Lock (GIL)

As a quick case study, I want to point out come the Python global Interpreter Lock (GIL): What is the worldwide interpreter lock (GIL) in CPython?

This CPython implementation detail prevents multiple Python subject from efficiently using CPU-bound work. The CPython docs say:

CPython implementation detail: In CPython, due to the an international Interpreter Lock, just one thread can execute Python password at once (even though particular performance-oriented libraries could overcome this limitation). If you desire your application to make better use of the computational sources of multi-core machines, you are advised to usage multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an ideal model if you want to run multiple I/O-bound work simultaneously.

See more: Wh A One Or Two Letter Representation Of An Element, A Chemical Symbol Represents What Of An Element

Therefore, here we have an instance where CPU-bound contents is not an ideal and I/O bound is.