Log disk I/O #135

lars-t-hansen · 2024-02-08T08:40:22Z

The use case here is jobs that are "unexpectedly slow", we want to know whether this is because they are I/O bound or are held up by slow I/O. For example, on interactive nodes (login nodes, Fox int* nodes, UiO ML nodes) memory can be oversubscribed and the system can be paging, or there can be a shared disk that is hammered and is holding up progress (the latter seems to be an issue on Saga login nodes, which are deadly slow but where very little computation actually happens).

As for #67, let's try to collect data if we can, and see if we can't surface it in some sensible way in Jobanalyzer.

Also see NAICNO/Jobanalyzer#399.

lars-t-hansen · 2024-02-13T11:44:20Z

If a job is not computing it's either descheduled or in I/O wait, but ideally we want to distinguish disk from tty from network, and really-ideally also distinguish the different interfaces or devices.

On an HPC node with 128 cores there can be many jobs running at the same time, and this is especially true of login and interactive nodes. So it's not quite enough to account for whole-system I/O wait (even if that might be better than nothing).

But all that said, there's no way to say objectively that "there's too much I/O wait" if a job has threads that can make progress while other threads are waiting. "Too much" is relative to an expectation. Even on a superfast disk there will be I/O wait.

One measure that might make sense is average wait (or better, time) per I/O operation. Then we remove sonar/Jobanalyzer from judging whether something is slow or fast, waiting or busy. Also, I/O count would be helpful. Of course, going down that path one could imagine a distribution of timings by count, but I don't expect the kernel keeps that around.

bast · 2024-02-21T17:47:30Z

But would sonar then make regular well-defined reads and writes and measure how long it takes?

lars-t-hansen · 2024-02-21T18:16:03Z

But would sonar then make regular well-defined reads and writes and measure how long it takes?

I've been looking at this but not commenting, apparently. It looks like waiting for disk writes is not a thing; they happen in the background. So (for disk) it's mostly about waiting for reads, and not just reads made explicitly but also page-ins from mapped executables, mapped files. I believe htop presents some data about this and the first order of business is to dig into that (documentation, code) to see if it leads anywhere.

lars-t-hansen · 2024-04-10T08:39:03Z

This recipe produces desired results on my Ubuntu 22 (Linux 6.5) laptop, but it does not work on a Saga login node (Linux 5.14), I get the "Avg" display but not the detailed breakdown. Given how old that post is, it's probably how the kernel is configured, not its version, that is the issue.

lars-t-hansen · 2025-01-30T07:20:51Z

I think this is the most important issue we can work on in the near future, though I don't know how good we can make it. It's easy to become I/O bound without knowing it except by observing low cpu/gpu utilization, which is a weak signal. Having something actionable (i/o wait, especially read wait) would be an immensely useful first piece of information. I have a support case right now where an ML model reads 2.5e6 small files from a network disk, and I'm fairly sure it's I/O bound, but actually proving that with the tools available on the nodes has turned out to be difficult.

bast · 2025-01-31T12:58:10Z

Thinking ... I would like to first learn how to identify an I/O bottleneck without Sonar. With strace? I am not sure how I would do it right now if I had to. If I had to prove that something is disk-bound, I would try to redirect it to local scratch or ramdisk and compare.

lars-t-hansen · 2025-01-31T13:27:57Z

An I/O bottleneck (mostly reads) could probably be defined as having a process that could run if there was input, but is waiting, and there are cores that are idle. I'm not sure how to measure that precisely. The problem with the support case I have turned out to be that the job was underprovisioned with CPU, so the I/O workers got input fast enough but could not be scheduled to run. In some sense you could observe that as all CPUs being busy all the time. (Moving the data to local scratch evened out the profile some because input came from a fast disk and not the network, but did not solve the problem because the job was not actually I/O bound.)

We had an offline conversation about I/O monitoring the other day and I'm still exploring that, but to some significant extent it could appear that it is device-specific and not something we build generic monitoring of.

lars-t-hansen added the enhancement New feature or request label Feb 8, 2024

lars-t-hansen mentioned this issue Feb 8, 2024

Input/output logging and analysis NAICNO/Jobanalyzer#399

Open

7 tasks

lars-t-hansen added the important label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log disk I/O #135

Log disk I/O #135

lars-t-hansen commented Feb 8, 2024

lars-t-hansen commented Feb 13, 2024

bast commented Feb 21, 2024

lars-t-hansen commented Feb 21, 2024

lars-t-hansen commented Apr 10, 2024

lars-t-hansen commented Jan 30, 2025

bast commented Jan 31, 2025

lars-t-hansen commented Jan 31, 2025

Log disk I/O #135

Log disk I/O #135

Comments

lars-t-hansen commented Feb 8, 2024

lars-t-hansen commented Feb 13, 2024

bast commented Feb 21, 2024

lars-t-hansen commented Feb 21, 2024

lars-t-hansen commented Apr 10, 2024

lars-t-hansen commented Jan 30, 2025

bast commented Jan 31, 2025

lars-t-hansen commented Jan 31, 2025