-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable fully threaded IO for installs #1876
Conversation
What does this mean in terms of max open handles to close? is that limited to the number of hardware threads too? |
Nope. I will get to that. I'd like some HPC folk to test this as is though
…On Mon, 27 May 2019, 03:06 Daniel Silverstone, ***@***.***> wrote:
What does this mean in terms of max open handles to close? is that limited
to the number of hardware threads too?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1876>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADZ7XTDGIF4U6HSIPKA7KTPXKROTANCNFSM4HPWJBRA>
.
|
I assume you'll be recommending they build a version to trial from your branch then? If you want it merged before then, then we should open an issue to block 1.19.0 to ensure we don't forget to get the rlimits or similar in place should we not back this out. |
Also, on the off-chance that we can work out why appveyor is squiffing, if you rebase onto master, we have |
I'd like to test this but I'm not really sure how. I've cloned @rbtcollins bug-1866, but when I build the only binary I get is |
@saethlin The binary is called Once you've done that it should have installed into the usual place ( Once you're done with your testing you can simply do |
What I normally do is build it with cargo build ---release, then cp |
No improvement, ~5 second regression but that's within measurement uncertainty here.
|
Thanks for testing!
…On Mon, 27 May 2019, 09:47 Ben Kimock, ***@***.***> wrote:
No improvement, ~5 second regression but that's withing measurement
uncertainty here.
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
info: latest update on 2019-05-26, rust version 1.36.0-nightly (f49269398 2019-05-25)
info: downloading component 'rustc'
86.4 MiB / 86.4 MiB (100 %) 24.4 MiB/s in 3s ETA: 0s
info: downloading component 'rust-std'
60.7 MiB / 60.7 MiB (100 %) 23.5 MiB/s in 2s ETA: 0s
info: downloading component 'cargo'
info: downloading component 'rust-docs'
info: installing component 'rustc'
86.4 MiB / 86.4 MiB (100 %) 8.8 MiB/s in 9s ETA: 0s
info: installing component 'rust-std'
60.7 MiB / 60.7 MiB (100 %) 9.1 MiB/s in 7s ETA: 0s
info: installing component 'cargo'
info: installing component 'rust-docs'
11.0 MiB / 11.0 MiB (100 %) 443.2 KiB/s in 2m 3s ETA: 0s
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1876>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADZ7XRAUYIIZLGVIXB63VLPXMAPTANCNFSM4HPWJBRA>
.
|
12c60c8
to
c3e1732
Compare
@saethlin could you please give this a spin on your NFS home dir? It's not adaptively throttled or tuned yet, but it should do a much better job of latency hiding - or at least I hope it will; I will be adding the requisite instrumentation to add those features soon, but if this doesn't actually trample your server - if its safe - then I'd like to see about getting the work merged incrementally. If it does have negative impact, then we need to wait for the improved IO work. |
Looks like a substantial improvement, and I can't detect any trampling. Nice!
|
Sweet, down from 1m 58s. Me likey! |
I'm going to add more instrumentation and so forth before we merge this, but you're welcome to use this branch for now :) |
I wonder if "WslFs" might have something to do with this? More info |
@rbtcollins Are you close to wanting this to be properly reviewed? I don't see changes since your comment about adding instrumentation. |
my calculations say 800ms+ of that 15 are in mkdir today, so might reasonably be able to get to 15s before we look at parallel decode of components, and or streaming decode of components. I'll worry about that and refactoring to support that when the time comes. For now, I think the win is sufficiently large and clear that I'm going to just beat things into a somewhat happier shape and then ask for review for merging. |
But I do think I need to detect large latency spikes otherwise we can make things worse dispatching IO into e.g. overloaded old disks or whatever. So hopefully this week I'll wrap that up. |
9fa59ae
to
5e314d7
Compare
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or to a specific number if desired for testing/tuning. This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse lock around the fd-handle table inside WSL.
Generalises the threaded IO closing to be fully threaded disk IO.
Adds a new environment variable to control tracing. Tracing is manual and can be expanded. We can probably get better systemic tracing from a suite of platform specific tools over time, but for now this was low hanging fruit that allows pretty good insight without too much code intrusion. Runtime overhead when disabled is extremely low (I can unpack rust-docs on Linux in 1s).
I don't think so: the plan9 fs stuff is super interesting but thats for interop with windows mounted dirs as opposed to working with the WSL owned dirs, isn't it? |
This was only needed because we were not controlling the permissions accurately during unpack, which can now be corrected.
Saw the request for testing, here's a couple data points :) MacBook Pro (15-inch, 2018) stable rustup on macbook pro
threadedio rustup on macbook pro
Results: Ran this a couple of times with similar results, small improvements on On a shared shell server stable rustup on a shell server
threadedio rustup on a shell server
Results: Due to the nature of its usecase, the shared shell server has limited RAM availability per-user and slow IO (HDD-based storage & lots of users). Interestingly enough, on five attempts, |
Awesome, thank you.
I presume this is ulimit based? Is it a hard limit or soft? (e.g. could rustup raise it to a higher limit if we made a setrlimit call? Could you please try this branch with
Also if you can share some details:
Assuming IO thread disabling worked I can put a check for that into the heuristic about when to use threads. And/or use a smaller stack per thread - currently that isn't tuned, so if this is many users, tight limits but lots of cores, it may be also that we need to consider the ram footprint in assessing the thread count. Another thing we could do is handle memory allocation errors by letting IO complete and retrying; though I'd rather not be bouncing right off of the limits since that often means you're already into swap and the point here is to be avoiding slowness not causing it :). Lastly, are you able to build a rust hello world project with cargo in release mode in that shell server?- |
Some testing on Windows 10 here (1903 18362). Compiling the Compiler warning at
|
On a tiny aarch64 device, 4 cores with 1GB memory, writing to an SD card. Anecdote: while compiling the branch, I experimented with cooling... I ended up sticking the entire board in a jar of oil, and that let me finish the build. Stable (1:39)
Simple run (1:32)
IO threads disabled (2:06)
IO threads = 1 (1:50)
IO threads = 2 (1:32)
IO threads = 3 (1:54)
IO threads = 4 (1:30)
|
@passcod woah you don't have |
Not available for this platform, I guess? |
On Windows 10: On WSL(ubuntu): |
@rbtcollins About Rustup output
Edit: Using Windows 1903 (on Bootcamp on a MacBook Pro) |
@Coder-256 I wish it was :) - Just coincidentally the MS Defender team have released signature updates about now hinting to defender how to deal with rustup under WSL similarly to rustup on Windows (scan on read rather than scan on close) - this has massively reduced the duration of the CloseHandle syscall and that improves performance the same as it did creating a manual Defender exclusion. Current rustup in my not upgraded WSLv1:
This patch:
Three days ago, nothing else changed except defender signature updates...
|
Here's more info:
2x Intel 6-Core Xeon 3,07 GHz, 24 cores total listed in Here's a bunch of logs with various thread number settings, some successful, some with crashes:
TL;DR: with the flags (threadcount <=18) , I was able to install nightly with this branch's
Yes - I built rustup from this branch on the server too. Additionally, just in case:
|
@Walther great thank you; the speed up is likely the removal of post-unpack chmodding: instead we set mode appropriately during unpack, which on systems with pagecache pressure will avoid paging the dentries and inodes back in after the unpack. (Some page-in is required to move into place, but many sub trees are moved as units, so still quite a saving). Well we certainly aren't hitting 2G of memory, so hmm I'm not sure how to auto-detect a reasonable threshold; those last 6 threads seem to cause the issue; that raises memory pressure two-fold: more concurrent IO requests with their buffers on the heap - so 6 more threads is up to 6 potentially large files (you are crashing on a 160MB file); as well as 6* the thread stack size, which I'm guessing is defaulting to 8MB so 48MB. Can you do a test and alloc in a loop until you get a failure - how much ram does a process get? |
Haven't dealt with straight memory allocation in Rust yet, so here's a short naive snippet I used:
With a handful of runs, seemed to stall at the same spot of the loop. |
@Walther can you try with 24 threads and RUST_MIN_STACK=1048576 ? Similar for 2097152 I have some future work planned to dynamically add/remove threads which may help here, but setting a thread stack size explicitly may be a simple first step. you only have 21MB per CPU, and if that limit is per process rather than per process group.. Hmm, its interesting that the RSS rlimit isn't set but we're still hitting a limit. Do you know if its cgroups or ?? |
No luck with those parameters. This seems fairly odd, considering the compilation of
Edit ref. discord chat With min-stack 1M, worked with 19 threads (new record! :D)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few tiny niggles, which I'm mostly happy to receive pushback on, otherwise 👍 This looks awesome.
item.finish = precise_time_s(); | ||
} | ||
|
||
#[allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than unused variables (which I assume is mode
on Windows) could we not check mode
against 0644/0755
and return an error if the mode is bad as part of our tar verification plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest rust can do attributes on parameters. Yay. But no, wrong later. push back here. I can call it _mode instead if you prefer to have it narrow; alex didn't like that in tar-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine for it to stay like this for now.
println!("{} deferred IO operations", prev_files); | ||
} | ||
let buf: Vec<u8> = vec![0; prev_files]; | ||
assert!(32767 > prev_files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This magic number here -- could it be a named constant just to make it clearer what it's up to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay seems fair to me for now. Clear what to do if the assert fires, so good comment.
As an aside, I think the appveyor failure is spurious -- appveyor has been being more and more unreliable. In comparison, my tests on Travis/Windows have been 100% good. (I broke Travis once, but otherwise no strange link errors etc) |
When directories complete, start writing the files/dirs within that directory that have been decompressed already. Avoids a stat() + create_dir_all() in the main thread permitting more concurrent IO dispatch in exchange for memory pressure.
Heavily deferred directories can lead to very large amounts of RAM holding pending IOs, and under low memory limits this caused failures. Reduce the stack size for IO worker threads to 1MB, as the default of 8MB on Linux is wasteful for IO worker threads.
item.finish = precise_time_s(); | ||
} | ||
|
||
#[allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine for it to stay like this for now.
println!("{} deferred IO operations", prev_files); | ||
} | ||
let buf: Vec<u8> = vec![0; prev_files]; | ||
assert!(32767 > prev_files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay seems fair to me for now. Clear what to do if the assert fires, so good comment.
See the individual commits for what this patch does, but in short on all platforms it will use up to CPU-count threads during unpacking, will no longer do a separate pass to set permissions, and may use more memory. Uninstall hasn't been optimised yet (and upgrades perform uninstalls, so the instructions below separate the steps for better reporting of results)
I'd like to get some wider testing that we've had so far. I can supply test binaries if needed, but its very easy to test with:
Let us know how this goes. I'm particularly interested in knowing any regressions that would make it unsuitable for merging.
This is known not to solve WSL performance issues: #1866.
It is thought to help with NFS issues #1867 and should further help windows installation times #1540.