%CPU utilization is a lie

Hyperthreading Terminology and Analogy Debates

A significant portion of the discussion centers on the correct and understandable terminology for describing modern CPU capabilities, particularly concerning Hyperthreading (HT) or Simultaneous Multi-Threading (SMT). Users debate the accuracy of terms like "cores" versus "threads" and engage in analogical reasoning to explain HT to a non-technical audience.

tgma expresses frustration with the non-standard terminology, stating, "The way they refer to cores in their system is confusing and non-standard. The author talks about a 5900X as a 24 core machine and discusses as if there are 24 cores, 12 of which are piggybacking on the other 12. In reality, there are 24 hyperthreads that are pretty much pairwise symmetric that execute on top of 12 cores with two sets of instruction pipeline sharing same underlying functional units."
saghm's analogy of "2-ply toilet paper" to explain hyperthreading to his brother is praised for its accessibility. He describes it as "You don't quite have 24 distinct things, but you have 12 that are roughly twice as useful as the individual ones, although you can't really separate them and expect them to work right."
nayuki offers a "chefs in a kitchen" analogy: "Putting two chefs in the same kitchen doesn't let you cook twice the amount of food in the same amount of time, because sometimes the two chefs need to use the same resource at the same time - e.g. sink, counter space, oven. But, the additional chef does improve the utilization of the kitchen equipment, leaving fewer things unused." This is met with approval by BobbyTables2, who adds, "Especially when it come to those advertisements “6 large rolls == 18 normal rolls”. Sure it might be thicker but nobody wipes their butt with 1/3 a square…"
BrendanLong, the author, acknowledges the feedback and updates the terminology: "Thanks for the feedback. I think you're right, so I changed a bunch of references and updated the description of the processor to 12 core / 24 thread. In some cases, I still think "cores" is the right terminology though, since my OS (confusingly) reports utilization as-if I had 24 cores."
sroussey questions the fundamental neatness of these terms: "Eh, what’s a thread really? It’s a term for us humans. The difference between two threads and one core or two cores with shared resources? Nothing is really all that neat and clean."

Performance Benefits and Drawbacks of Hyperthreading

A major theme revolves around when and why Hyperthreading provides a performance benefit, with many users sharing experiences where disabling HT improved performance, while others highlight specific workloads where it excels. The nuance of workload dependency and resource contention is frequently mentioned.

hinkley expresses skepticism about HT's consistent performance benefits: "How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your by turning off hyperthreading in the BIOS. It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with."
tgma counters that workload dependency is key: "It has a lot to do with your workload as well as if not moreso than the chip architecture. The primary trade-off is the cache utilization when executing two sets of instruction streams."
hinkley adds thermal throttling as another factor: "That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU."
gruez disputes the thermal throttling argument, suggesting that disabling HT would be counterproductive if thermal budget exists: "Disabling SMT likely saves negligible amount of power, but disables any performance to be gained from the other thread. If there's thermal budget available, it's better to spend it by shoving more work onto the second thread than to leave it disabled."
twoodfin provides examples of where HT is beneficial: "operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading. I’m familiar with one such system where the throughput benefit is ~15%... IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)"
hinkley brings up software licensing as a reason to disable HT: "Yeah that was another thing. You run Oracle you gotta turn that shit off in the BIOS otherwise you're getting charged 2x for 20% more performance." wmf corrects this, stating, "AFAIK Oracle does not charge extra for SMT."
ckozlowski reflects on the historical context of HT, linking it to Intel's Pentium 4 and its long pipelines: "Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline... For a 30+ stage pipeline, that's a lot of wasted clock cycles. So hyper-threading was a way to recoup some of those losses. I recall reading at the time that it was a "latency hiding technique"."
justsomehnguy elaborates on the Pentium 4 era: "But the GHz race was lead to the monstruosity of 3.06GHz CPUs where the improvement in speed didn't quite translated to the improvement in performance. And while the Northwood fared well (especially considering the disaster of Willamette) GHz/performance wise, the Prescott wasn't and mostly showed the same performance in non-SSE/cache bound tasks... so Intel needed to push the GHz further which required a longer pipeline and brought even more penalty on a prediction miss."
tom_ questions the necessity of HT if the core itself is efficient: "Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already. (Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)"
wmf suggests that IBM's SMT8 on POWER processors might be for "low-IPC spaghetti code."
TristanBall speculates that licensing (both per-core and PVU) and different cost-benefit analyses for high-end systems influence IBM's approach: "I suspect part of it is licensing games, both in the sense of "avoiding per core license limits" which absolutely matters when your DB is costing a million bucks, and also in the 'enable the highest PVU score per chassis' for ibm's own license farming. Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too."
esseph points to differing approaches by Intel and AMD: "Intel vs AMD, you'll get a different answer on the hyperthreading question."
loeg argues that disabling HT is often a workaround for poor scheduler or application design: "HT provides a significant benefit to many workloads. The use cases that benefit from actually disabling HT are likely working around pessimal OS scheduler or application thread use. (After all, even with it enabled, you're free to not use the sibling cores.) Otherwise, it is an overgeneralization to say that disabling it will benefit arbitrary workloads."
robocat adds additional reasons for disabling HT: "Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients."
mkbosmans highlights that HPC workloads are often memory or vector execution port bound, making them poor candidates for SMT: "Especially in HPC there are lots of workloads that do not benefit from SMT. Such workloads are almost always bottlenecked on either memory bandwidth or vector execution ports. These are exactly the resources that are shared between the sibling threads."
bee_rider mentions the Xeon Phi's 4 threads per core, noting its niche nature and reliance on specific programming models.
c2h5oh notes architectural differences: "HT performance varies wildly between CPU architectures and workloads. e.g. AMD implementation, especially in later Zen cores, is closer to a performance of a full thread than you'd see in Intel CPUs. Provided you are not memory bandwidth starved."

CPU Utilization as a Misleading Metric

A prominent theme is the inadequacy and misleading nature of standard CPU utilization metrics (%CPU) for truly understanding system performance, especially concerning latency and resource contention. Users discuss how %CPU often doesn't reflect actual useful work and can be obscured by factors like waiting, thermal throttling, and oversubscription.

jiggawatts expresses concern about the overreliance on throughput metrics: "I've noticed an overreliance on throughput as measured during 100% load as the performance metric, which has resulted in hardware vendors "optimising to the test" at the expense of other, arguably more important metrics. For example: single-user latency when the server is just 50% loaded."
twoodfin clarifies that for their particular system, throughput benefits extend to moderate utilization bands.
hinkley points out the trade-off between throughput and latency: "Throughput and latency are usually at odds with each other."
** BrendanLong** mentions that "for whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading," and that for their system's throughput benefits extend to the 50-70% utilization band where "p99 latency is not stressed."
eklitzke emphasizes that actionable metrics are key: "What most people care about is some combination of latency and utilization. As a very rough rule of thumb, for many workloads you can get up to about 80% CPU utilization before you start seeing serious impacts on workload latency. Beyond that you can increase utilization but you start seeing your workload latency suffer from all of the effects you mentioned."
judge123 shares a relatable experience: "I wish I had this article back then!" after explaining a server at 60% utilization to a manager.
hinkley advocates the use of queueing theory for understanding utilization limits: "Up to a hair over 60% utilization the queuing delays on any work queue remain essentially negligible. At 70 they become noticeable, and at 80% they've doubled. And then it just turns into a shitshow from there on."
BrendanLong stresses the importance of granular measurement periods for SLOs: "A big part of this is that CPU utilization metrics are frequently averaged over a long period of time (like a minute), but if your SLO is 100 ms, what you care about is whether there's any ~100 ms period where CPU utilization is at 100%."
kccqzy agrees, emphasizing even more granular measurement: "If your SLO is 100 ms you need far more granular measurement periods than that. You should measure the p99 or p100 utilization for every 5-ms interval or so."
PaulKeeble highlights that HT's practical gains are modest and latency increases: "they are typically actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency."
pama draws a parallel to GPU utilization, where % utilization can be highly misleading and 100% can mean vastly different performance levels.
BrendanLong suggests measuring MIPS against theoretical maximums as an alternative, but notes the difficulty in defining and measuring MIPS accurately.
kristopolous and bionsystem share anecdotes of their insights being dismissed in job interviews and other contexts.
N_Lens confirms the experience of quick spikes to 100% utilization: "Anytime CPU% goes over 50-60% suddenly it'll spike to 100% rather quickly, and the app/service is unusable. Learned to scale earlier than first thought."
CCs criticizes using tools like stress-ng for benchmarking, stating it's designed to max out components rather than simulate realistic application behavior.
0xbadcafebee advocates for application-specific performance testing: "The benchmark is basically application performance testing, which is the most accurate representation you can get. Test the specific app(s) your server is running, with real-world data/scenarios, and keep cranking up the requests, until the server falls over."
dragontamer explains that %CPU can be high even when threads are blocked on I/O or memcpy, and HT can help by allowing a CPU-bound thread and a memory-bound thread to run concurrently.
smallstepforman points out that idle time due to latency is still counted as busy: "Any latency (eg. wait for memory) is still calculated as busy core."
kqr defends using "semi-crappy indicators" like %CPU if they are the best available, citing their utility in capacity planning with queueing theory, recommending a conservative 40% threshold.
mayama suggests a combination of %CPU and loadavg for a more complete picture, as loadavg can indicate I/O waits.
zekrioca notes that the article's findings on utilization and queueing theory applications are not new.
therealdrag0 acknowledges the practical utility of imperfect metrics.
timzaman questions the relevance of the topic given its perceived basic nature, which therealdrag0 attributes to the wide range of knowledge in the industry.
steventhedev deems %CPU "misleading at best, and should largely be considered harmful," preferring system load.
mustache_kimono references Brendan Gregg's seminal work on CPU utilization being "wrong," highlighting that %CPU measures all busy states, including waiting, while IPC measures useful work.
4gotunameagain playfully asks about the "Brendan" obsession with CPU utilization concerns.
ChaoPrayaWave treats CPU usage as a hint and prioritizes response times and queue lengths.
kunley critiques the article's premise as a "vibecode" with a misleading title.
swiftcoder shares an anecdote about debating CPU utilization with management and the concept of "busy waiting."
HPsquared notes that GPU utilization in Task Manager also appears misleading regarding power consumption.
Aissen points out that the matrixprod benchmark may not be universally applicable and that HT can still benefit it.
bob1029 suggests power consumption as a more telling metric, especially for high-core-count processors, noting his experience with ML workloads where utilization and power/noise are decoupled.
fennecfoxy defines %util as idle cycles across logical cores.
fuzzfactor offers detailed instructions on using Windows Task Manager and Resource Monitor for more insightful performance analysis.