-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log disk I/O #135
Comments
If a job is not computing it's either descheduled or in I/O wait, but ideally we want to distinguish disk from tty from network, and really-ideally also distinguish the different interfaces or devices. On an HPC node with 128 cores there can be many jobs running at the same time, and this is especially true of login and interactive nodes. So it's not quite enough to account for whole-system I/O wait (even if that might be better than nothing). But all that said, there's no way to say objectively that "there's too much I/O wait" if a job has threads that can make progress while other threads are waiting. "Too much" is relative to an expectation. Even on a superfast disk there will be I/O wait. One measure that might make sense is average wait (or better, time) per I/O operation. Then we remove sonar/Jobanalyzer from judging whether something is slow or fast, waiting or busy. Also, I/O count would be helpful. Of course, going down that path one could imagine a distribution of timings by count, but I don't expect the kernel keeps that around. |
But would sonar then make regular well-defined reads and writes and measure how long it takes? |
I've been looking at this but not commenting, apparently. It looks like waiting for disk writes is not a thing; they happen in the background. So (for disk) it's mostly about waiting for reads, and not just reads made explicitly but also page-ins from mapped executables, mapped files. I believe htop presents some data about this and the first order of business is to dig into that (documentation, code) to see if it leads anywhere. |
This recipe produces desired results on my Ubuntu 22 (Linux 6.5) laptop, but it does not work on a Saga login node (Linux 5.14), I get the "Avg" display but not the detailed breakdown. Given how old that post is, it's probably how the kernel is configured, not its version, that is the issue. |
I think this is the most important issue we can work on in the near future, though I don't know how good we can make it. It's easy to become I/O bound without knowing it except by observing low cpu/gpu utilization, which is a weak signal. Having something actionable (i/o wait, especially read wait) would be an immensely useful first piece of information. I have a support case right now where an ML model reads 2.5e6 small files from a network disk, and I'm fairly sure it's I/O bound, but actually proving that with the tools available on the nodes has turned out to be difficult. |
Thinking ... I would like to first learn how to identify an I/O bottleneck without Sonar. With |
An I/O bottleneck (mostly reads) could probably be defined as having a process that could run if there was input, but is waiting, and there are cores that are idle. I'm not sure how to measure that precisely. The problem with the support case I have turned out to be that the job was underprovisioned with CPU, so the I/O workers got input fast enough but could not be scheduled to run. In some sense you could observe that as all CPUs being busy all the time. (Moving the data to local scratch evened out the profile some because input came from a fast disk and not the network, but did not solve the problem because the job was not actually I/O bound.) We had an offline conversation about I/O monitoring the other day and I'm still exploring that, but to some significant extent it could appear that it is device-specific and not something we build generic monitoring of. |
The use case here is jobs that are "unexpectedly slow", we want to know whether this is because they are I/O bound or are held up by slow I/O. For example, on interactive nodes (login nodes, Fox int* nodes, UiO ML nodes) memory can be oversubscribed and the system can be paging, or there can be a shared disk that is hammered and is holding up progress (the latter seems to be an issue on Saga login nodes, which are deadly slow but where very little computation actually happens).
As for #67, let's try to collect data if we can, and see if we can't surface it in some sensible way in Jobanalyzer.
Also see NAICNO/Jobanalyzer#399.
The text was updated successfully, but these errors were encountered: