Why not make one big cpu core?How does multi-core CPU implement asynchronous coordination?CPU Soft core on FPGABuilding a simple PC - looking for a CPUCPU Core frequencyHow can a 8-bit CPU calculate big numbers?Why is RAM not put on the CPU chip?Can you stack one cpu on top of another, same cpu?Is there any SMPS controllers with CPU core inside?Why are there not more cores in modern CPUs?Treating Multiple CPU Cores as One

SFDX force:org:clone not working

Why is Skinner so awkward in Hot Fuzz?

Certain list transform

Are athletes' college degrees discounted by employers and graduate school admissions?

Why is C++ template use not recommended in space/radiated environment?

Short story about psychologist analyzing demon

Is fission/fusion to iron the most efficient way to convert mass to energy?

What is the theme of analysis?

Given a bit string of length 20, how many ways can such a string be generated if either all 0's or all 1's need to be grouped together in the string?

The best in flight meal option for those suffering from reflux

Approach sick days in feedback meeting

Realistic, logical way for men with medieval-era weaponry to compete with much larger and physically stronger foes

My parents claim they cannot pay for my college education; what are my options?

Harley Davidson clattering noise from engine, backfire and failure to start

What did the 8086 (and 8088) do upon encountering an illegal instruction?

How can religions without a hell discourage evil-doing?

Using if statement with ArcPy and GetCount?

Is it possible to have battery technology that can't be duplicated?

Am I being scammed by a sugar daddy?

What does this circuit symbol mean?

How long would it take for sucrose to undergo hydrolysis in boiling water?

Difference between grep -R (uppercase) and -r (lowercase)

I received a gift from my sister who just got back from

Why can't we feel the Earth's revolution?



Why not make one big cpu core?


How does multi-core CPU implement asynchronous coordination?CPU Soft core on FPGABuilding a simple PC - looking for a CPUCPU Core frequencyHow can a 8-bit CPU calculate big numbers?Why is RAM not put on the CPU chip?Can you stack one cpu on top of another, same cpu?Is there any SMPS controllers with CPU core inside?Why are there not more cores in modern CPUs?Treating Multiple CPU Cores as One






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








8












$begingroup$


I don't understand why CPU manufacturers make multi core chips. Scaling of multiple cores is horrible, this is highly application specific and I am sure you can point certain program or code that runs great on many cores, but most of the time the scaling is garbage.
It's a waste of silicon die space and a waste of energy.



Games for example almost never use more than 4 cores, science and engineering simulation like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores yet you have these 64 core workstations... it's a waste of money and energy, better to buy a 1500W heater for the winter, much cheaper.



Why they don't make make CPU with just one big core? I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, its reliable and simple way to increase performance, multiple cores increase performance only in limited number of programs and the scaling is horrible and unreliable










share|improve this question











$endgroup$







  • 7




    $begingroup$
    How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
    $endgroup$
    – Tom Carpenter
    20 hours ago






  • 9




    $begingroup$
    Like 8 different operations at once? ...
    $endgroup$
    – Colin
    19 hours ago







  • 6




    $begingroup$
    What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
    $endgroup$
    – Tom Carpenter
    19 hours ago






  • 4




    $begingroup$
    Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
    $endgroup$
    – winny
    19 hours ago






  • 10




    $begingroup$
    If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
    $endgroup$
    – JRE
    19 hours ago

















8












$begingroup$


I don't understand why CPU manufacturers make multi core chips. Scaling of multiple cores is horrible, this is highly application specific and I am sure you can point certain program or code that runs great on many cores, but most of the time the scaling is garbage.
It's a waste of silicon die space and a waste of energy.



Games for example almost never use more than 4 cores, science and engineering simulation like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores yet you have these 64 core workstations... it's a waste of money and energy, better to buy a 1500W heater for the winter, much cheaper.



Why they don't make make CPU with just one big core? I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, its reliable and simple way to increase performance, multiple cores increase performance only in limited number of programs and the scaling is horrible and unreliable










share|improve this question











$endgroup$







  • 7




    $begingroup$
    How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
    $endgroup$
    – Tom Carpenter
    20 hours ago






  • 9




    $begingroup$
    Like 8 different operations at once? ...
    $endgroup$
    – Colin
    19 hours ago







  • 6




    $begingroup$
    What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
    $endgroup$
    – Tom Carpenter
    19 hours ago






  • 4




    $begingroup$
    Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
    $endgroup$
    – winny
    19 hours ago






  • 10




    $begingroup$
    If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
    $endgroup$
    – JRE
    19 hours ago













8












8








8


2



$begingroup$


I don't understand why CPU manufacturers make multi core chips. Scaling of multiple cores is horrible, this is highly application specific and I am sure you can point certain program or code that runs great on many cores, but most of the time the scaling is garbage.
It's a waste of silicon die space and a waste of energy.



Games for example almost never use more than 4 cores, science and engineering simulation like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores yet you have these 64 core workstations... it's a waste of money and energy, better to buy a 1500W heater for the winter, much cheaper.



Why they don't make make CPU with just one big core? I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, its reliable and simple way to increase performance, multiple cores increase performance only in limited number of programs and the scaling is horrible and unreliable










share|improve this question











$endgroup$




I don't understand why CPU manufacturers make multi core chips. Scaling of multiple cores is horrible, this is highly application specific and I am sure you can point certain program or code that runs great on many cores, but most of the time the scaling is garbage.
It's a waste of silicon die space and a waste of energy.



Games for example almost never use more than 4 cores, science and engineering simulation like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores yet you have these 64 core workstations... it's a waste of money and energy, better to buy a 1500W heater for the winter, much cheaper.



Why they don't make make CPU with just one big core? I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, its reliable and simple way to increase performance, multiple cores increase performance only in limited number of programs and the scaling is horrible and unreliable







cpu






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 8 hours ago









gattsbr

1032




1032










asked 20 hours ago









wav scientistwav scientist

24218




24218







  • 7




    $begingroup$
    How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
    $endgroup$
    – Tom Carpenter
    20 hours ago






  • 9




    $begingroup$
    Like 8 different operations at once? ...
    $endgroup$
    – Colin
    19 hours ago







  • 6




    $begingroup$
    What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
    $endgroup$
    – Tom Carpenter
    19 hours ago






  • 4




    $begingroup$
    Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
    $endgroup$
    – winny
    19 hours ago






  • 10




    $begingroup$
    If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
    $endgroup$
    – JRE
    19 hours ago












  • 7




    $begingroup$
    How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
    $endgroup$
    – Tom Carpenter
    20 hours ago






  • 9




    $begingroup$
    Like 8 different operations at once? ...
    $endgroup$
    – Colin
    19 hours ago







  • 6




    $begingroup$
    What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
    $endgroup$
    – Tom Carpenter
    19 hours ago






  • 4




    $begingroup$
    Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
    $endgroup$
    – winny
    19 hours ago






  • 10




    $begingroup$
    If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
    $endgroup$
    – JRE
    19 hours ago







7




7




$begingroup$
How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
$endgroup$
– Tom Carpenter
20 hours ago




$begingroup$
How would you propose making 1 core equivalent to 8? You can't just bump up the clock frequency.
$endgroup$
– Tom Carpenter
20 hours ago




9




9




$begingroup$
Like 8 different operations at once? ...
$endgroup$
– Colin
19 hours ago





$begingroup$
Like 8 different operations at once? ...
$endgroup$
– Colin
19 hours ago





6




6




$begingroup$
What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
$endgroup$
– Tom Carpenter
19 hours ago




$begingroup$
What "more stuff" do you want to do? You either have to go parallel to do "more stuff" (i.e. more CPU cores), go more complex instruction set (more transistors yes, but also more heat and slower), or increase the clock speed (power scales to the square of frequency)
$endgroup$
– Tom Carpenter
19 hours ago




4




4




$begingroup$
Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
$endgroup$
– winny
19 hours ago




$begingroup$
Distance from one edge of a big CPU to the other end is a big issue. If you need to transfer data between them synchronously, that will set the limit for your clock speed. Take a look at AMDs HyperTransport for details in how complex this is.
$endgroup$
– winny
19 hours ago




10




10




$begingroup$
If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
$endgroup$
– JRE
19 hours ago




$begingroup$
If you consolidate everything into one core, then try to do multiple independent things at once, you wil have "multi core by detail" - each opcode and its instructions act like virtual cores, but with shared parts that slow you down. Simpler and cleaner - and faster - by far to formalize the split into cores, and assign tasks to them as needed. A single big core would be a bigger waste than multicores.
$endgroup$
– JRE
19 hours ago










8 Answers
8






active

oldest

votes


















32












$begingroup$

The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.



To make a CPU do more, you have to plan what doing more entails. There are really three options:




  1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.



    Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.



    Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.



    We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.







  1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.



    This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.



    The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.



    These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.



    You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.







  1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.



    This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.



    Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.



    As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.







share|improve this answer











$endgroup$








  • 2




    $begingroup$
    The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
    $endgroup$
    – penguin359
    9 hours ago






  • 2




    $begingroup$
    To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
    $endgroup$
    – alex.forencich
    7 hours ago






  • 1




    $begingroup$
    Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
    $endgroup$
    – Peter Cordes
    7 hours ago






  • 1




    $begingroup$
    Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
    $endgroup$
    – Peter Cordes
    7 hours ago






  • 2




    $begingroup$
    @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
    $endgroup$
    – Davislor
    6 hours ago



















17












$begingroup$

Data dependency



It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.



There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.



A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."



The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.



Power management



Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.



Rewriting the software is the only way forwards



The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.






share|improve this answer









$endgroup$








  • 1




    $begingroup$
    Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
    $endgroup$
    – Peter Cordes
    7 hours ago






  • 1




    $begingroup$
    But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
    $endgroup$
    – Peter Cordes
    6 hours ago


















10












$begingroup$

In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.



By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.



For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.



And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.



With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.






share|improve this answer









$endgroup$








  • 1




    $begingroup$
    If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
    $endgroup$
    – Peter Cordes
    6 hours ago










  • $begingroup$
    Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
    $endgroup$
    – Peter Cordes
    6 hours ago










  • $begingroup$
    And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
    $endgroup$
    – Peter Cordes
    6 hours ago


















6












$begingroup$

Wow, I can tell this is being asked by someone under 30!



Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.



Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.



The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.



Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.






share|improve this answer











$endgroup$












  • $begingroup$
    How did you know? Yes, I am 27 lol
    $endgroup$
    – wav scientist
    6 hours ago






  • 4




    $begingroup$
    @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
    $endgroup$
    – Connor Wolf
    2 hours ago


















3












$begingroup$

You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.



Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.



It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.



As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".






share|improve this answer









$endgroup$








  • 2




    $begingroup$
    The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
    $endgroup$
    – Peter Cordes
    6 hours ago










  • $begingroup$
    Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
    $endgroup$
    – Peter Cordes
    6 hours ago






  • 1




    $begingroup$
    Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
    $endgroup$
    – Peter Cordes
    6 hours ago










  • $begingroup$
    All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
    $endgroup$
    – hekete
    13 mins ago


















3












$begingroup$

Good question, or at least one with an interesting answer. Summary:



  • The cost of multiple cores scale close to linearly

  • The cost of widening the pipeline scales quadratically

  • Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).




Donald Knuth said in a 2008 interview




I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!




Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.



Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.



So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.



But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.





I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.




Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.



Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.



If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.



See the More Cores or Wider Cores? section of in Modern Microprocessors
A 90-Minute Guide!
for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.



Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).



There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)



Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.



To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.




Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).



So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.



And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.




If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.



Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!



Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.



But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.



Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.




But we can't have magical super-wide CPUs



So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.



TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.




BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.






share|improve this answer











$endgroup$




















    2












    $begingroup$

    Let me draw an analogy:



    If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.



    So if you want to get more typing done, you have to get more monkeys.






    share|improve this answer








    New contributor



    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





    $endgroup$




















      0












      $begingroup$

      multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert






      share|improve this answer








      New contributor



      machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      $endgroup$













        Your Answer






        StackExchange.ifUsing("editor", function ()
        return StackExchange.using("schematics", function ()
        StackExchange.schematics.init();
        );
        , "cicuitlab");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "135"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f443186%2fwhy-not-make-one-big-cpu-core%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        8 Answers
        8






        active

        oldest

        votes








        8 Answers
        8






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        32












        $begingroup$

        The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.



        To make a CPU do more, you have to plan what doing more entails. There are really three options:




        1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.



          Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.



          Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.



          We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.







        1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.



          This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.



          The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.



          These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.



          You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.







        1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.



          This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.



          Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.



          As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.







        share|improve this answer











        $endgroup$








        • 2




          $begingroup$
          The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
          $endgroup$
          – penguin359
          9 hours ago






        • 2




          $begingroup$
          To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
          $endgroup$
          – alex.forencich
          7 hours ago






        • 1




          $begingroup$
          Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 2




          $begingroup$
          @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
          $endgroup$
          – Davislor
          6 hours ago
















        32












        $begingroup$

        The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.



        To make a CPU do more, you have to plan what doing more entails. There are really three options:




        1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.



          Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.



          Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.



          We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.







        1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.



          This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.



          The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.



          These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.



          You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.







        1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.



          This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.



          Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.



          As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.







        share|improve this answer











        $endgroup$








        • 2




          $begingroup$
          The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
          $endgroup$
          – penguin359
          9 hours ago






        • 2




          $begingroup$
          To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
          $endgroup$
          – alex.forencich
          7 hours ago






        • 1




          $begingroup$
          Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 2




          $begingroup$
          @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
          $endgroup$
          – Davislor
          6 hours ago














        32












        32








        32





        $begingroup$

        The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.



        To make a CPU do more, you have to plan what doing more entails. There are really three options:




        1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.



          Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.



          Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.



          We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.







        1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.



          This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.



          The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.



          These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.



          You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.







        1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.



          This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.



          Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.



          As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.







        share|improve this answer











        $endgroup$



        The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.



        To make a CPU do more, you have to plan what doing more entails. There are really three options:




        1. Make the core run at a higher clock frequency - The trouble with this is we are already hitting the limitations of what we can do.



          Power usage and hence thermal dissipation increases to the square or frequency - if you double the frequency you nominally quadrouple the power dissipation.



          Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.



          We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.







        1. Add more complex instructions - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware accelleration.



          This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.



          The trouble is, as you add more complicated instructions, you add more complexity and make the chip gets bigger. As a direct result the CPU gets slower - the acheivable clock frequencies drop as propagation delays rise.



          These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.



          You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.







        1. Make the CPU more parallel - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.



          This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple task, then having multiple CPU cores allows you to do more things at once.



          Because the individual CPU cores are effectively seperate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.



          As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.








        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 19 hours ago

























        answered 19 hours ago









        Tom CarpenterTom Carpenter

        40.8k378124




        40.8k378124







        • 2




          $begingroup$
          The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
          $endgroup$
          – penguin359
          9 hours ago






        • 2




          $begingroup$
          To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
          $endgroup$
          – alex.forencich
          7 hours ago






        • 1




          $begingroup$
          Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 2




          $begingroup$
          @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
          $endgroup$
          – Davislor
          6 hours ago













        • 2




          $begingroup$
          The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
          $endgroup$
          – penguin359
          9 hours ago






        • 2




          $begingroup$
          To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
          $endgroup$
          – alex.forencich
          7 hours ago






        • 1




          $begingroup$
          Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 2




          $begingroup$
          @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
          $endgroup$
          – Davislor
          6 hours ago








        2




        2




        $begingroup$
        The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
        $endgroup$
        – penguin359
        9 hours ago




        $begingroup$
        The Pentium MMX used the P5 microarchitecture, but it's not called Pentium 5. It was followed by the P6 microarchitecture and then, later, the Pentium II or Pentium 2.
        $endgroup$
        – penguin359
        9 hours ago




        2




        2




        $begingroup$
        To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
        $endgroup$
        – alex.forencich
        7 hours ago




        $begingroup$
        To your point 2: modern chips are actually capable of running much faster than they do, except if they did they would melt. So the real practical limit is thermal. Because of this, adding new instructions, even complex ones, doesn't really decrease the clock speed as the clock speed is already thermally limited.
        $endgroup$
        – alex.forencich
        7 hours ago




        1




        1




        $begingroup$
        Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
        $endgroup$
        – Peter Cordes
        7 hours ago




        $begingroup$
        Having 256-bit busses between L2 and L1d cache does help, and is done in practice, or or 512-bit on Intel since Haswell (2013). See also How can cache be that fast? for a diagram of Sandybridge. Note that even back then it has a 32-byte wide ring bus connecting all the cores and the mem controllers. Moving cache lines faster mostly helps with workloads with poor locality that touch a cache line but don't stop to process all of it. Or with SIMD which can keep up even when processing all bytes.
        $endgroup$
        – Peter Cordes
        7 hours ago




        1




        1




        $begingroup$
        Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
        $endgroup$
        – Peter Cordes
        7 hours ago




        $begingroup$
        Modern Skylake-server chips have 64-byte SIMD vectors; wider load/store paths allow you to process more 32-bit integers in parallel. A single 256-bit integer is rarely useful, but wider SIMD can be. SIMD isn't exclusively CISC. All high-performance modern ISAs have SIMD extensions; IBM POWER, AArch64 (again very RISCy), MIPS has optional SIMD extensions.
        $endgroup$
        – Peter Cordes
        7 hours ago




        2




        2




        $begingroup$
        @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
        $endgroup$
        – Davislor
        6 hours ago





        $begingroup$
        @PeterCordes All the things he listed as examples of “CISC” were really examples of SIMD. Classic CISC architectures did not have them (other than Cray supercomputers), whereas modern RISC chips do.
        $endgroup$
        – Davislor
        6 hours ago














        17












        $begingroup$

        Data dependency



        It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.



        There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.



        A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."



        The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.



        Power management



        Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.



        Rewriting the software is the only way forwards



        The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.






        share|improve this answer









        $endgroup$








        • 1




          $begingroup$
          Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
          $endgroup$
          – Peter Cordes
          6 hours ago















        17












        $begingroup$

        Data dependency



        It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.



        There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.



        A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."



        The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.



        Power management



        Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.



        Rewriting the software is the only way forwards



        The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.






        share|improve this answer









        $endgroup$








        • 1




          $begingroup$
          Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
          $endgroup$
          – Peter Cordes
          6 hours ago













        17












        17








        17





        $begingroup$

        Data dependency



        It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.



        There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.



        A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."



        The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.



        Power management



        Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.



        Rewriting the software is the only way forwards



        The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.






        share|improve this answer









        $endgroup$



        Data dependency



        It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.



        There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.



        A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."



        The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.



        Power management



        Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.



        Rewriting the software is the only way forwards



        The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 18 hours ago









        pjc50pjc50

        35.3k34391




        35.3k34391







        • 1




          $begingroup$
          Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
          $endgroup$
          – Peter Cordes
          6 hours ago












        • 1




          $begingroup$
          Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
          $endgroup$
          – Peter Cordes
          7 hours ago






        • 1




          $begingroup$
          But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
          $endgroup$
          – Peter Cordes
          6 hours ago







        1




        1




        $begingroup$
        Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
        $endgroup$
        – Peter Cordes
        7 hours ago




        $begingroup$
        Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo).
        $endgroup$
        – Peter Cordes
        7 hours ago




        1




        1




        $begingroup$
        But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power.
        $endgroup$
        – Peter Cordes
        6 hours ago











        10












        $begingroup$

        In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.



        By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.



        For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.



        And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.



        With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.






        share|improve this answer









        $endgroup$








        • 1




          $begingroup$
          If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
          $endgroup$
          – Peter Cordes
          6 hours ago















        10












        $begingroup$

        In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.



        By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.



        For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.



        And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.



        With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.






        share|improve this answer









        $endgroup$








        • 1




          $begingroup$
          If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
          $endgroup$
          – Peter Cordes
          6 hours ago













        10












        10








        10





        $begingroup$

        In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.



        By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.



        For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.



        And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.



        With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.






        share|improve this answer









        $endgroup$



        In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.



        By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.



        For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.



        And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.



        With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 10 hours ago









        whatsisnamewhatsisname

        961615




        961615







        • 1




          $begingroup$
          If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
          $endgroup$
          – Peter Cordes
          6 hours ago












        • 1




          $begingroup$
          If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
          $endgroup$
          – Peter Cordes
          6 hours ago







        1




        1




        $begingroup$
        If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects.
        $endgroup$
        – Peter Cordes
        6 hours ago












        $begingroup$
        Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc.
        $endgroup$
        – Peter Cordes
        6 hours ago












        $begingroup$
        And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA)
        $endgroup$
        – Peter Cordes
        6 hours ago











        6












        $begingroup$

        Wow, I can tell this is being asked by someone under 30!



        Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.



        Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.



        The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.



        Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.






        share|improve this answer











        $endgroup$












        • $begingroup$
          How did you know? Yes, I am 27 lol
          $endgroup$
          – wav scientist
          6 hours ago






        • 4




          $begingroup$
          @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
          $endgroup$
          – Connor Wolf
          2 hours ago















        6












        $begingroup$

        Wow, I can tell this is being asked by someone under 30!



        Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.



        Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.



        The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.



        Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.






        share|improve this answer











        $endgroup$












        • $begingroup$
          How did you know? Yes, I am 27 lol
          $endgroup$
          – wav scientist
          6 hours ago






        • 4




          $begingroup$
          @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
          $endgroup$
          – Connor Wolf
          2 hours ago













        6












        6








        6





        $begingroup$

        Wow, I can tell this is being asked by someone under 30!



        Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.



        Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.



        The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.



        Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.






        share|improve this answer











        $endgroup$



        Wow, I can tell this is being asked by someone under 30!



        Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.



        Trouble is, that assumption was temporarily incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.



        The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.



        Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 11 hours ago

























        answered 11 hours ago









        GrahamGraham

        2,994612




        2,994612











        • $begingroup$
          How did you know? Yes, I am 27 lol
          $endgroup$
          – wav scientist
          6 hours ago






        • 4




          $begingroup$
          @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
          $endgroup$
          – Connor Wolf
          2 hours ago
















        • $begingroup$
          How did you know? Yes, I am 27 lol
          $endgroup$
          – wav scientist
          6 hours ago






        • 4




          $begingroup$
          @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
          $endgroup$
          – Connor Wolf
          2 hours ago















        $begingroup$
        How did you know? Yes, I am 27 lol
        $endgroup$
        – wav scientist
        6 hours ago




        $begingroup$
        How did you know? Yes, I am 27 lol
        $endgroup$
        – wav scientist
        6 hours ago




        4




        4




        $begingroup$
        @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
        $endgroup$
        – Connor Wolf
        2 hours ago




        $begingroup$
        @wavscientist - A lot of this stuff was explored in the 80s. There didn't use to be the current x86/ARM monoculture. There used to be lots of different processor architectures that explored different ways to get more performance out of the same clock-rate.
        $endgroup$
        – Connor Wolf
        2 hours ago











        3












        $begingroup$

        You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.



        Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.



        It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.



        As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".






        share|improve this answer









        $endgroup$








        • 2




          $begingroup$
          The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
          $endgroup$
          – Peter Cordes
          6 hours ago






        • 1




          $begingroup$
          Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
          $endgroup$
          – hekete
          13 mins ago















        3












        $begingroup$

        You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.



        Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.



        It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.



        As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".






        share|improve this answer









        $endgroup$








        • 2




          $begingroup$
          The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
          $endgroup$
          – Peter Cordes
          6 hours ago






        • 1




          $begingroup$
          Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
          $endgroup$
          – hekete
          13 mins ago













        3












        3








        3





        $begingroup$

        You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.



        Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.



        It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.



        As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".






        share|improve this answer









        $endgroup$



        You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.



        Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.



        It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.



        As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 18 hours ago









        heketehekete

        53916




        53916







        • 2




          $begingroup$
          The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
          $endgroup$
          – Peter Cordes
          6 hours ago






        • 1




          $begingroup$
          Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
          $endgroup$
          – hekete
          13 mins ago












        • 2




          $begingroup$
          The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
          $endgroup$
          – Peter Cordes
          6 hours ago






        • 1




          $begingroup$
          Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
          $endgroup$
          – Peter Cordes
          6 hours ago










        • $begingroup$
          All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
          $endgroup$
          – hekete
          13 mins ago







        2




        2




        $begingroup$
        The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!"
        $endgroup$
        – Peter Cordes
        6 hours ago












        $begingroup$
        Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec)
        $endgroup$
        – Peter Cordes
        6 hours ago




        1




        1




        $begingroup$
        Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
        $endgroup$
        – Peter Cordes
        6 hours ago




        $begingroup$
        Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.)
        $endgroup$
        – Peter Cordes
        6 hours ago












        $begingroup$
        All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
        $endgroup$
        – hekete
        13 mins ago




        $begingroup$
        All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores).
        $endgroup$
        – hekete
        13 mins ago











        3












        $begingroup$

        Good question, or at least one with an interesting answer. Summary:



        • The cost of multiple cores scale close to linearly

        • The cost of widening the pipeline scales quadratically

        • Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

        Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).




        Donald Knuth said in a 2008 interview




        I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!




        Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.



        Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.



        So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.



        But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.





        I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.




        Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.



        Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.



        If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.



        See the More Cores or Wider Cores? section of in Modern Microprocessors
        A 90-Minute Guide!
        for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.



        Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).



        There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)



        Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.



        To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.




        Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).



        So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.



        And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.




        If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.



        Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!



        Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.



        But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.



        Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.




        But we can't have magical super-wide CPUs



        So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.



        TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.




        BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.






        share|improve this answer











        $endgroup$

















          3












          $begingroup$

          Good question, or at least one with an interesting answer. Summary:



          • The cost of multiple cores scale close to linearly

          • The cost of widening the pipeline scales quadratically

          • Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

          Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).




          Donald Knuth said in a 2008 interview




          I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!




          Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.



          Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.



          So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.



          But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.





          I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.




          Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.



          Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.



          If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.



          See the More Cores or Wider Cores? section of in Modern Microprocessors
          A 90-Minute Guide!
          for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.



          Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).



          There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)



          Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.



          To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.




          Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).



          So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.



          And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.




          If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.



          Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!



          Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.



          But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.



          Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.




          But we can't have magical super-wide CPUs



          So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.



          TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.




          BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.






          share|improve this answer











          $endgroup$















            3












            3








            3





            $begingroup$

            Good question, or at least one with an interesting answer. Summary:



            • The cost of multiple cores scale close to linearly

            • The cost of widening the pipeline scales quadratically

            • Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

            Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).




            Donald Knuth said in a 2008 interview




            I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!




            Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.



            Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.



            So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.



            But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.





            I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.




            Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.



            Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.



            If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.



            See the More Cores or Wider Cores? section of in Modern Microprocessors
            A 90-Minute Guide!
            for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.



            Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).



            There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)



            Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.



            To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.




            Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).



            So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.



            And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.




            If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.



            Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!



            Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.



            But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.



            Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.




            But we can't have magical super-wide CPUs



            So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.



            TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.




            BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.






            share|improve this answer











            $endgroup$



            Good question, or at least one with an interesting answer. Summary:



            • The cost of multiple cores scale close to linearly

            • The cost of widening the pipeline scales quadratically

            • Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard.

            Costs are in die-area, (manufacturing cost) and/or power (which indirectly limits frequency).




            Donald Knuth said in a 2008 interview




            I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!




            Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. Maybe dual core to reduce context-switch costs when multiple programs are running; pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.



            Or physically it would be single core (simple cache hierarchy) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.



            So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.



            But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.





            I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.




            Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.



            Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.



            If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.



            See the More Cores or Wider Cores? section of in Modern Microprocessors
            A 90-Minute Guide!
            for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar. Then explains how we hit the power-wall right around the P4 era, leading to a fundamental shift in scaling with smaller transistors, and going for higher ILP instead of frequency.



            Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).



            There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)



            Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.



            To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. But cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate.




            Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).



            So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.



            And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.




            If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.



            Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!



            Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.



            But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.



            Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.




            But we can't have magical super-wide CPUs



            So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.



            TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.




            BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 2 hours ago

























            answered 2 hours ago









            Peter CordesPeter Cordes

            51649




            51649





















                2












                $begingroup$

                Let me draw an analogy:



                If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.



                So if you want to get more typing done, you have to get more monkeys.






                share|improve this answer








                New contributor



                EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                $endgroup$

















                  2












                  $begingroup$

                  Let me draw an analogy:



                  If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.



                  So if you want to get more typing done, you have to get more monkeys.






                  share|improve this answer








                  New contributor



                  EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  $endgroup$















                    2












                    2








                    2





                    $begingroup$

                    Let me draw an analogy:



                    If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.



                    So if you want to get more typing done, you have to get more monkeys.






                    share|improve this answer








                    New contributor



                    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.





                    $endgroup$



                    Let me draw an analogy:



                    If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.



                    So if you want to get more typing done, you have to get more monkeys.







                    share|improve this answer








                    New contributor



                    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.








                    share|improve this answer



                    share|improve this answer






                    New contributor



                    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.








                    answered 2 hours ago









                    EvilSnackEvilSnack

                    1211




                    1211




                    New contributor



                    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.




                    New contributor




                    EvilSnack is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                    Check out our Code of Conduct.























                        0












                        $begingroup$

                        multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert






                        share|improve this answer








                        New contributor



                        machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.





                        $endgroup$

















                          0












                          $begingroup$

                          multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert






                          share|improve this answer








                          New contributor



                          machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.





                          $endgroup$















                            0












                            0








                            0





                            $begingroup$

                            multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert






                            share|improve this answer








                            New contributor



                            machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            $endgroup$



                            multicores aren't usually multiscalar. and multiscalar cores aren't multicores. it would be sort of perfect finding a multiscalar architecture running at several megahert but in general it's bridges would be not consumer enabled but costly so the tendency is multicore programming at lower megahert rather than short instruction at high clock speeds. multiple instruction cores are cheaper and easier to command that's why it's a bad idea having a multiscalar architectures at several gigahert







                            share|improve this answer








                            New contributor



                            machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.








                            share|improve this answer



                            share|improve this answer






                            New contributor



                            machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.








                            answered 1 hour ago









                            machturmachtur

                            112




                            112




                            New contributor



                            machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.




                            New contributor




                            machtur is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Electrical Engineering Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2felectronics.stackexchange.com%2fquestions%2f443186%2fwhy-not-make-one-big-cpu-core%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                ParseJSON using SSJSUsing AMPscript with SSJS ActivitiesHow to resubscribe a user in Marketing cloud using SSJS?Pulling Subscriber Status from Lists using SSJSRetrieving Emails using SSJSProblem in updating DE using SSJSUsing SSJS to send single email in Marketing CloudError adding EmailSendDefinition using SSJS

                                Кампала Садржај Географија Географија Историја Становништво Привреда Партнерски градови Референце Спољашње везе Мени за навигацију0°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.340°11′ СГШ; 32°20′ ИГД / 0.18° СГШ; 32.34° ИГД / 0.18; 32.34МедијиПодациЗванични веб-сајту

                                19. јануар Садржај Догађаји Рођења Смрти Празници и дани сећања Види још Референце Мени за навигацијуу