-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect count of logical cores instead of physical #2071
Comments
Hi @mako2580, Thanks for posting the issue. I managed to reproduce your results on my machine
Unfortunately, the same improvement isn't there for higher levels that use more cpu.
I'm not an expert on this topic but my understanding is that hyper-threading will only help when the running program doesn't fully saturate your cores (so when it's bound by something other than cpu). This should be the case when compressing large files (which should be I/O bound). Hence the speed improvement for level 10 but not for level 19 isn't terribly surprising to me. I wasn't around when the decision to use the number of physical cores was made so maybe @Cyan4973, @terrelln or @felixhandte can provide some context here? Trying different options (not just -T) for specific host machines/data is something zstd users should do to extract the most utility out of the library. As for selecting a good default, I suppose if we can show that T(#logical_cores) is never worse than -T(#physical_cores) but sometimes quite a bit better, that would be a compelling enough reason to switch? |
Retested for compression level 19, and I'm getting expected results. -T#logical_cores is about 15 % faster than -T#physical_cores. What type of storage are you using for test? I was doing test in ramdisk, to avoid I/O bound as much as possible, that might be reason why you are getting almost the same results even with high compression level.
|
This is a difficult topic. Basically, the idea that one can use hyper-threading to improve performance has some truth, but it depends on multiple factors, and it's less true for The main idea is that modern OoO cores have multiple execution units in each core (and multiple cores per cpus, don't confuse those 2 levels). These execution units are able to issue instructions in parallel, as long as they are unrelated. At its peak, an Intel CPU can issue up to 4 micro-instructions per cycle. Of course, no-one reaches this theoretical limit, and any code featuring > 3 micro-instructions per cycle is considered very well optimized. In practice, most codes are not that optimized, and it's common for many applications to linger in the ~0.5 instructions per clock territory. That's because the primary performance limitation nowadays is the memory wall. Fetching some random data in memory costs hundreds of cycles, and the core just stalls, starving from data, before being able to proceed on further calculation. Modern cpus can re-order instructions to take advantage of memory fetch stalls, but there is a limit, both to cpu and program capabilities. At some point, the program cannot proceed forward, because there is absolutely nothing else to do, except waiting for the one critical data that every future action depend on. The cost for this capability is not null : higher processing requirements for the cpu, on top of sharing essential core resources, such as L1 cache, virtual registers, or micro-op cache. So it'd better be useful. Now let's talk about That being said, this is actually an "ideal" situation. In practice, one can't completely avoid fetching data out of cache as compression level become higher. Because So, as the average nb of micro-instructions per cycle becomes lower, there is more room for other threads to tap into unused resources. So at higher compression levels, I'm not surprised to see that hyper-threading is actually beneficial. Now, that would be one reason to use logical cores. But there are other considerations to factor in. Primarily, the "system" within which Therefore, due to these additional constraints, it felt "safer", or more conservative, to select a default which preserves system resources for other tasks, especially system-critical ones. Now, it's not that using more threads is forbidden : at any rate, if someone wants more threads, he can requests so. It's more about "what's a good default", and I'm sure this can be debated at length. In the future, if this is an important topic, we can probably pour more resources on the topic, to provide a more complete and satisfactory answer, with maybe more performance and more control. But there is only so much resource we can spend at any moment in time, and these days, we have settled for a "good enough" middle ground, by selecting this "nb of cores" policy. Maybe something that can still be improved upon in some future. |
Thanks for detailed answer. You agree, that using number of logical cores can have performance improvement. Not the big one, but it's something, The only problem you have, is that there will be no room for OS (background threads and free memory). That's good reason, but I think it is not that critical, as it might seen. First one - Free memory - you will still run out of memory if you have CPU with many physical cores and not enough RAM. You are not solving problem with RAM consumption, you are just moving that line further. And if you do run out of memory, so for the next time you are going to run zstd you just specify -T#number_of_threads_you_have_ram_for, but if you have enough of RAM you get max performance even on default. And the second one - OS background tasks - I think that might only happen if you set high process priority, but i think that's not the default case. Even if I wanted to be highly paranoid, I would rather set some constant that would reserve unused threads for background task - for example i will use (#logical_threads - 1) for -T0. Because in systems with higher number of cores you let half of cores unused, which is for background tasks too much IMHO. |
But that's the job of the CPU scheduler. My linux system does not become unresponsive or sluggish when zstd is using all logical processors; instead, the overall CPU consumption of zstd is lowered when other programs also need to consume CPU. Obviously, I don't mean to say that all systems will handle 100% CPU consumption gracefully. What I mean is that if the user knows it works well for them, there's nothing bad in allowing them to enable this behaviour systematically. In my case, I was trying to tweak What you say makes a lot of sense, and I do not believe the default behaviour of |
I think that adding separate option is acceptable, but in that case it should be written more explicitly in |
I'm not comfortable with changing the default for the time being. Note that anyone can take advantage of hyper-threading, but it must be requested explicitly, it's just not the "default". |
Yann, I support your decision, but can we have an argument that explicitly requests to use the count of logical cores without the caller having to known the exact number of cores? As I said before, currently in Bash I can do |
Which command name would make sense ? |
Can't |
Not with current code. At |
I guess there are two ways to make this:
|
Describe the bug
Parameter -T0 is detecting physical cores, but I think it should be detecting logical cores instead, to take advantage of hyperthreading.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
zstd should use all logical cores to speed up compression.
Screenshots and charts
As you can see from data below, parameter -T0 is the same as -T2, which is number of physical cores, which is expectable from source code. Only on default compression level 3 is using only number of physical cores faster than number of logical cores, but on any other compression level situation change. And with higher compression level the speed up is more visible.
Compression level 2:
Compression level 3:
Compression level 4:
Compression level 10:
Desktop:
The text was updated successfully, but these errors were encountered: