Xjyuk

Question

I don't understand why CPU manufacturers make multi core chips. Scaling of multiple cores is horrible, this is highly application specific and I am sure you can point certain program or code that runs great on many cores, but most of the time the scaling is garbage.
It's a waste of silicon die space and a waste of energy.

Games for example almost never use more than 4 cores, science and engineering simulation like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores yet you have these 64 core workstations... it's a waste of money and energy, better to buy a 1500W heater for the winter, much cheaper.

Why they don't make make CPU with just one big core? I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, its reliable and simple way to increase performance, multiple cores increase performance only in limited number of programs and the scaling is horrible and unreliable

How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency. — 20 hours ago
What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency) — 19 hours ago
Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is. — 19 hours ago
If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores. — 19 hours ago

score 32 · Accepted Answer · 2019-06-12 11:29:48Z

The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.

To make a CPU do more, you have to plan what doing more entails. There are really three options:

Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.

Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.

Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.

We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.

Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.

This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.

The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.

These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.

You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.

Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.

This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.

Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.

As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.

The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2. — 9 hours ago
To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited. — 7 hours ago
Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes. — 7 hours ago
Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions. — 7 hours ago
@PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do. — 6 hours ago

pjc50pjc50 35.3k34391 · Accepted Answer · 2019-06-12 12:02:09Z

Data dependency

It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.

There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.

A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."

The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.

Power management

Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.

Rewriting the software is the only way forwards

The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.

Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo). — 7 hours ago
But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power. — 6 hours ago

whatsisnamewhatsisname 961615 · Accepted Answer · 2019-06-12 20:08:18Z

In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.

By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.

For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.

And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.

With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.

If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects. — 6 hours ago
Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc. — 6 hours ago
And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA) — 6 hours ago

score 6 · Accepted Answer · 2019-06-12 19:55:39Z

Wow, I can tell this is being asked by someone under 30!

Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.

Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.

The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.

Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.

@wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate. — 2 hours ago

heketehekete 53916 · Accepted Answer · 2019-06-12 12:36:51Z

You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.

Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.

It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.

As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".

The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!" — 6 hours ago
Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec) — 6 hours ago
Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.) — 6 hours ago
All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores). — 13 mins ago

score 3 · Accepted Answer · 2019-06-13 04:57:50Z

Good question, or at least one with an interesting answer. Summary:

The cost of multiple cores scale close to linearly

The cost of widening the pipeline scales quadratically

Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).

Donald Knuth said in a 2008 interview

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!

Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.

Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.

So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.

But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.

I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.

Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.

Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.

If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.

See the More Cores or Wider Cores? section of in Modern Microprocessors
A 90-Minute Guide! for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.

Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).

There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)

Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.

To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.

Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).

So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.

And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.

If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.

Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!

Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.

But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.

Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.

But we can't have magical super-wide CPUs

So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.

TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.

BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.

EvilSnackEvilSnack 1211 · Accepted Answer · 2019-06-13 04:41:46Z

Let me draw an analogy:

If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.

So if you want to get more typing done, you have to get more monkeys.

machturmachtur 112 · Accepted Answer · 2019-06-13 05:17:55Z

multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert

score 32 · Accepted Answer · 2019-06-12 11:29:48Z

The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.

To make a CPU do more, you have to plan what doing more entails. There are really three options:

Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.

Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.

Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.

We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.

Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.

This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.

The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.

These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.

You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.

Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.

This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.

Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.

As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.

The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2. — 9 hours ago
To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited. — 7 hours ago
Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes. — 7 hours ago
Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions. — 7 hours ago
@PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do. — 6 hours ago

pjc50pjc50 35.3k34391 · Accepted Answer · 2019-06-12 12:02:09Z

Data dependency

It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.

There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.

A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."

The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.

Power management

Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.

Rewriting the software is the only way forwards

The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.

Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo). — 7 hours ago
But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power. — 6 hours ago

whatsisnamewhatsisname 961615 · Accepted Answer · 2019-06-12 20:08:18Z

In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.

By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.

For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.

And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.

With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.

If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects. — 6 hours ago
Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc. — 6 hours ago
And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA) — 6 hours ago

score 6 · Accepted Answer · 2019-06-12 19:55:39Z

Wow, I can tell this is being asked by someone under 30!

Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.

Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.

The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.

Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.

@wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate. — 2 hours ago

heketehekete 53916 · Accepted Answer · 2019-06-12 12:36:51Z

You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.

Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.

It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.

As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".

The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!" — 6 hours ago
Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec) — 6 hours ago
Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.) — 6 hours ago
All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores). — 13 mins ago

score 3 · Accepted Answer · 2019-06-13 04:57:50Z

Good question, or at least one with an interesting answer. Summary:

The cost of multiple cores scale close to linearly

The cost of widening the pipeline scales quadratically

Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.