Many workloads are of two types:
- Poorly scalable across cores, almost everything in one thread
- Good scalability across cores, almost anything across many threads
The number of workloads between them, which for example holds 4 or 8 cores but no more, is simply too small.
For poorly scalable workloads, you want some really fast cores. The amount primarily depends on how many non-scalable workloads you want to run at one time. For a game there may be 3 to 6, often not.
Then you have highly scalable workloads. You can, of course, make it faster by cramming as many large, fast cores as possible onto a chip. But this is not the most efficient way, because large cores take up disproportionately extra space to extract the last bits of performance. And that space is scarce on the chip, every mm2 costs more.
If we go to SPEC2017 Standards Looking at Lake Alder, we see that the P core has a score of 8.14 for integer operations and 14.16 for floating point. The E-core has a score of 5.25 for an integer (65% of performance) and 7.66 for a floating point (54%). Approximately half and two thirds of the performance, depending of course on the amount of work.
However, the space occupied by the electronic core is a lot of smallest. I can’t find the exact sizes at the moment, but let’s say 4 electron cores fit on the same surface as the 1 P-core. Together, these 4 cores have (with perfect scaling) 2.6x the integer or 2.2x the floating point performance on the same surface as the 1 P-core!
So in the future I see a lot of chips with 4 to 8 performance cores for single thread performance, and tens of efficiency cores for multi-threaded performance. And not only does Intel seem to be aware of this, but AMD is also said to be working on the problem Zen4 Dance cores;