What is the best cpu and gpu combination?

GPGPU: What is stopping the convergence of GPU and CPU?

  • A GPU contains multiple (~8-16) compute units.  A compute unit contains many (~32) processing elements, each with its own registers.  Each element is, roughly, a thread, but all elements within the same unit must execute the same code (SIMD). It follows that a GPU can simulate a CPU by using a single processing element in a single compute unit.  This is hugely inefficient (no parallelism), but otherwise OK. So why can't a GPU run an operating system, using a single compute unit / processing element as a CPU? Part of the answer is how memory access is managed.  A CPU uses a lot of logic to predict, prefetch and pipeline data, while a GPU does not (instead, it simply switches to another thread while it waits).  But if that is the only important difference then convergence doesn't seem so hard - "just" a matter of bolting the memory management from a CPU onto a GPU and enabling it in certain contexts. So what other problems exist? Alternative ways of asking this question (I believe!): Why did Larrabee fail? (ie technically, why is this a hard problem?) Why does Nvidia need ARM cores for Denver rather than just using a single compute unit / processing element as described above? Why are CPUs and GPUs two different classes of things, rather than points on a continuous sliding scale that describes the degree of parallelism required in a processor? Why not have a GPU whose compute units contain 1, 2, 4, 8, ... processing elements, with progressively less prediction / pipelining logic as the parallelism increases?

  • Answer:

    In many ways GPUs currently exist at a level of development for software that is closer to what we had in the 1960s and 1970s, where very talented developers focused huge amounts of resources on optimizing their code for specific hardware architectures. They could then achieve stellar results (by the measure of the day). Eventually, people started to depend more and more on compilers to do that for them. They started to assume that the hardware was "fast enough" to mask inefficiencies in their code. GPUs do not have such capabilities. Next, the reason GPUs are so much "faster" than traditional CISC architectures is that they have absolved themselves of a lot of things that a regular developer assumes exists, even if she doesn't know it. Things like cache coherency. Branch prediction. Out of order execution. This is similar to the above point. GPUs are fast because they take simple architectures and stamp hundreds of them onto a chip. In some odd little way, it's like the ideas behind Danny Hillis' paper and the Connection Machine (CM1/CM2) where thousands of simple CPUs are running in parallel. The problem is, people discovered it's really hard to write "general purpose" applications on that platform. Sure, it can blow away other systems at certain things, like CFD or weather simulations, but that's not what most applications are. In addition, GPUs, by their very nature, often have very rigid constraints on code size, etc, and to keep them fed, a locality of reference requirement in the way data is managed. So the near future, to me, if I were a "betting man", is a tighter coupling of the two pieces. A continuation of the current CPU architectures with a massively parallel (by comparison) adjunct GPU. A form of asymmetrical multiprocessing that has been popular at various points in the past. You continue to ues the general purpose CPU for most code, but the GPU for what it's good for. The start of this is witnessed in things like Apple's Core Image and compositing engine which offloads a lot of the heavy lifting onto the GPU, but not the overall coordination. Or Grand Central Dispatch where you write code that the OS then determines where best to run. All architectures come with trade-offs, and so no one architecture is ever going to solve all problems optimally. A renaissance of architectures is just what this industry needs, not just another single architecture.

Christopher Petrilli at Quora Visit the source

Was this solution helpful to you?

Other answers

This is a really good question which has a number of answers depending on where you're looking from. As you've rightly said, if all we needed was to bolt on memory protection functionality into GPU, then doing that is not at all hard from a hardware perspective. In fact, some of this has already been done! For a couple of generations now, GPUs have support virtual memory and have even had translation lookaside buffers (TLBs). So the question is, why hasn't this happened yet? Two important issues I can think of are: (1) software and platform support and (2) poor single-threaded performance. Even if we had a GPU that could support modern OSes, you'd still have a lot of work to do writing all the processor-specific parts of your OS. And then you'd have to deal with all the I/O related issues, interfaces to the peripherals and such. The second issue is that when run in single-threaded mode, current generations GPUs are just too slow. They don't have branch prediction, superscalar execution or out-of-order execution and their caches are primitive. These techniques are all essential for building high-performance single-threaded processors. Also, their clock rates in the 600 MHz - 1 GHz range, so they can't make up for the lack of these features with a faster clock. These are the reasons why NVIDIA needs ARM cores - because the platform and software are already up and running for ARM cores, which also deliver very reasonable reasonable performance. While we'd like to have a tunable spectrum between highly-parallel architectures like GPUs and single-thread performance monsters like CPUs, engineering such a processor is a hard research problem. One reason for this is that when you're building hardware you have to put everything you need down on silicon. Once you do that you pay costs of having these things whether you use them or not. The logic you put down and don't use still consumes area, increasing the cost of your chip, affects frequency and hence performance, because it might cause certain logic paths to be longer and it might still consume power. These reasons and the fact that we don't really know how to build such a processor yet, are probably the reason why we haven't seen something like this yet. A final issue is due to the CMOS process technology. GPUs are typically made on high-density processes which sacrifice performance to gain area so that they can pack a lot of SIMDs into their chips. CPUs are built on high-performance processes which sacrifice area and power to gain frequency. Putting these two types of processes together is also hard problem.

Pramod Subramanyan

Actual processing is a tiny component of modern CPUs. This was really brought home to me  recently looking at the layout of a small, relatively simply mobile CPU core, and noticing that the two multipliers took up only about one tenth of the floorplan. If you are as old as I am, you will remember CPUs that occupied an entire chip and did not include an integer multiplier because the size would have made the cost prohibitive. And now we have a CPU so small its typically not even the main processor in a phone, where two multipliers are just a small component. The cost and complexity of a CPU is not in the processing units, but in the logic to ensure that the processing units are fully utilized all the time. A modern CPU is designed to extract the absolute maximum possible degree of parallel processing from one or more inherently serial streams of instructions, some of which may include large and unpredictable delays fetching data from storage. A GPU is a different beast. It has far more instruction streams and they are much better structured for parallel processing, so far more of the GPU is dedicated to actual processing than a CPU. Most issues that introduce complexity in a CPU design by requiring it to schedule around pauses, can be resolved in a GPU by simply working on a different problem for a while.You could in theory combine the instruction re-ordering capability of the CPU, with the ability to process inherently parallel workloads. However, since the actually process part is such a small component of  CPU, its not clear what you would gain. The resulting device would still be about the same size as a CPU and GPU combined and not as fast. If you remove the reordering logic from the CPU, the performance drops.Regarding Larrabee - the goal of Larrabee was to replace the GPU with something that understood Intel's instruction set, because Intel kind of believe nVidia's idea that GPUs might be useful in general purpose computing. It was never intended to replace the CPU. It did not actually fail so much as get re-purposed - Xeon Phi has been through several iterations and is being used in supercomputers, although nowhere else that I know of. As is fairly evident, outside of graphics and certain kinds of scientific computing, GPUs are not that useful.

Simon Kinahan

In the past few years, there has been increasing amount of focus on expanding the usage scenarios for GPGPU. Most of the reasons people have mentioned above are correct and relevant. However to be able to run a full fledged operating system, GPUs need a little more support. Support for consistent exception state or Precise exception handling: When ever an exception occurs during execution of the program, the hardware must guarantee that all the instructions before certain point are executed and none below that are executed. The exception can then be handled and execution be restarted from that point. However such a state with respect to individual instruction is extremely hard to attain in SIMD architectures. We need a lot of control signals to each SIMD unit for sequential ordering. These are costly in terms of power, area and/or performance. This is also the reason why it is not possible to support I/O on GPU. They raise an interrupt which needs to be handled appropriately. Efficient Context Switching: Supporting demand paging (page faults) requires you to tolerate long latencies. CPUs switch context and execute a different process mean while. This is tougher to implement on GPU for two reasons. The first one is supporting exception state as mentioned in the first point. The second reason is that we need to save the registers (which form architectural state) to memory when we do a context switch. While, this is easy to implement on a CPU which has a few tens of registers, it is a monumental task on a GPU with several thousands of registers, often to the tune of size bigger than L1 cache in GPU. (It needs so many registers for feeding the huge number of SIMD cores). Hence an efficient context switch mechanism is necessary before GPUs can be used more generally. In addition, a GPU is not good for general tasks for the various reasons mentioned. They dont have branch prediction (needs support to be able to revert back in case of mis-prediction), out-of-order processing (costly in terms of power, area negating the benefit of GPU over CPU in that regard). GPUs now have support for cache coherency, but they are not optimized for performance. Most of the applications ported for GPU are embarrasingly parallel and dont communicate across threads. COST: Even if the above problems are solved, no company would be willing to make one hypothetical GPU with lot of CPU like cores, because of the single and most important reason, they cant make money out of it. Every company out there is to make money and the cost of die increases exponentially with the area.Hence you wouldnt find one. However as noted by some, we already have heterogenous chips, which feature both CPU and GPU on the same die. This is the direction in which industry is going ahead with tighter integration. Software support: The operating systems and software are written to a particular ISA. Although the software or applications are slowly adopting Opencl/CUDA, it is a humongous task to port an operating system to run on these GPUs. Note that Opencl/CUDA themselves usually use a run-time which runs on CPU. Answering some of your questions: Larrabee was a a lofty goal by Intel to do software based game rendering. They assumed given enough processing power, and an efficient software, such a chip would be able to compete with a GPU. This was how games were rendered before Nvidia and ATI produced custom hardware for 3d graphics. It is generally known that a custom hardware would do things much more efficiently in terms of power and performance for a specific task. Intel believed they can somehow magically do it with a number of x86 cores and good software layer to render games. To cut the chase, they failed. However, they realised they could still launch this product for HPC market as an alternative to GPGPU as it supports x86 which is a boon to industry. They are yet to launch one and the verdict is still out on that one. Rest of your questions essentially are answered by the comments I made earlier. They need ARM cores, for running operating systems. Or they need to emulate ARM ISA on GPU to run an OS. Rather than strive for a continuous scale of complexity, the industry is moving toward a heterogenous world, which makes sense for various reasons.

Bhargava Reddy

What is stopping the convergence of the pickup truck and the sportscar? While these things nominally serve the same purpose, special adaptions for one purpose harm the other purpose.  It is the same way with GPUs and CPUs.  There are "APUs" which integrate the GPU and CPU together on the same chip, but there are inherently tradeoffs in design.

Phillip Remaker

The short answer is: GPUs are designed to tolerate latency in the presence of high concurrency; We want CPUs to be tuned to minimize latency in the presence of normally quite modest concurrency.I hate to argue with the premises of your question, but.  For instance, it's more appropriate in the Cuda model to think of a "block" being equivalent to a CPU's "thread".  (A Cuda multiprocessor is best thought of as a ~32-wide SIMD core with predication.)  Another nit to pick is that Larrabee didn't fail: it's used in the fastest computer in the world (et al, with a fairly aggressive followon about to arrive.)I think the main answer is that a hardware design must inevitably be specific to its intended use-case.  Almost everything about a GPU design is poorly suited to function as a CPU, and vice versa.  This isn't just a single parameter (your "degree of parallelism"), since the target use affects the design profoundly.  GPUs don't push clock, because plentiful-regular concurrency rewards a slow-wide design.  There's no reason to make GPU registers fast, because things can just get there when they get there (since there are lots of other SIMDs in flight.)  Even GPU RAM is at most a few cm away with controlled traces, so runs at higher clocks, but is fixed-size (CPU memory is usually narrower and lower-clocked, but lacking oceans of concurrency, often underutilized.)I think the better question is whether these domains can mix and hybridize.  AMD seems to have a plan in mind to do just that, with low-latency CPU-like units essentially as peers with higher-latency, wider GPU-like units.  I don't see any reason this can't work (AMD's issue is always execution...)  I haven't seen talk from Intel about this kind of thing (and Knights Landing seems pretty firmly focused on HPC rather than graphics.)  Nvidia, although mostly ruling the current GP-GPU world, doesn't have an obviously great story about the glorious CPU-GPU-hybrid future (betting on Power doesn't seem incredibly safe...)

Mark Hahn

Servers don't need s, because modern CPUs have their own (vector) processors: AltiVec, MMX, SSE. These are fully integrated into the and thus do not have the nasty latency and bandwidth issues that one has in communicating with GPUs. This is the primary reason you won't see total convergence of CPUs and GPUs - why pay extra for hardware that does you no good in a key market? GPUs are special purpose processors. So are DMA and protocol engines (e.g. hardware processors) in I/O devices (heck, GPUs started as blitters; see http://en.wikipedia.org/wiki/Blitter ). There are special purpose processors all over our computer systems these days, because silicon is incredibly cheap. However, it's usually a bad idea to try and use a special purpose processor for something other than what it was designed for because: while it's doing that other thing, it's not doing what it was designed for. it won't do the other thing as well as the main CPU does; what you're hoping to gain is parallelism, but if that's what you wanted, why not add more CPUs ... ?

Erik Fair

You are right that the general purpose aspects of a GPU could be developed more.But while 512 cores and gigaflops of power sounds very cool, it's really not terribly useful in the real world.If you write software, it's hard enough to code for a single thread.  When you start adding additional threads for loading, and background coding it gets much harder.   So a well engineered application might have a main thread and a couple of background threads.  Any more would be wasted.There's no reasonable engineering solution to throw more cores at a single task and make it faster.  You need more depth not more width. There are a couple of exceptions where vast paralellism is beneficial.  But when we encounter this sort of shallow-but-wide task, we already do use GPGPU.

Glyn Williams

See (CUDA) and . But don't expect them to "merge" – they are very different: the GPU is very specialized while the CPU must stay very general – two opposite and valid ideals. On modern GUI , you need both simultaneously.

Mark Janssen

I cannot comment on the hardware challenges, but have the following to add from our use case. We did experiment with GPGPUs a while ago for our server application. Fortunately our stack had a number of algorithms that map very well to SIMD. While the server class CPUs support SIMD, the GPUs have some nice properties like very low latency to local memory and much wider SIMD. But in the end we decided to not use them. The main reason we decided to not use it is the lack of open software stacks. Typically you write the GPU part of the code in OpenCL (or Cuda etc.) and that link that to your main program that runs on the host. The connection is all handled by GPGPU drivers that are closed source. Our customers value our product and service because we take complete responsibility for the appliance and having closed source components, that too for such critical ones was just not acceptable. These GPGPU drivers have significant functionality in them, including virtual machine and scheduler etc. Another problem is finding engineers who know how to program (implement, tune and most importantly debug)  in OpenCL. The tools were also not as good as what are available on a typical Linux/Unix box. The programs need to be very aware of data placement (register memory, local memory vs GDDR memory) etc. The performance reason is that in order to benefit from GPGPUs, you need to maintain the C/D (Computation / Cost of Data Transfer) ratio high. All the GPGPU devices we found in the market were PCIe devices. A x16 Gen2 PCIe device perhaps cannot do more than 5-6 GB/s, add actual computation work on the card and the number will be lower than that. In our case, which is a streaming work load, the C/D ratio proved too low to justify the complexity. I would like to see wider SIMD and lower latency memory support added to the server CPUs. That way we get most of the benefits without reams of close source software stacks. That said, there are workloads that map well to the current GPGPUs. But they need fairly high C/D.

Nitin Muppalaneni

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.