IO Tracing #194

cvonelm · 2021-10-08T11:24:47Z

find suitable data sources
find out how to map them to otf2xx IO records.

focus on: block level IO, file level IO first

cvonelm · 2021-10-13T14:35:20Z

block level I/O

Blocklevel IO can be traced using those 2 tracepoints:

block:block_rq_insert: event triggered by the insertion of a request into the queue
block:block_rq_complete: Event triggered by request completion.

writing read begin and read end events from those tracepoints in process mode looks relatively easy.

however we can not use begin-end records in system mode because there is no total order of block io event issues and completions, so writing it as a sample is the best that we can do.

biosnoop from the bcc toolkit uses kprobes instead of tracepoints, but as far as I can see it the kprobes are not that different from the tracepoints above.

For now we should stick with tracepoints, I think, because they don't require set-up with perf probe to be used, but we should keep kprobes in mind if we come across some place where we miss critical information with the tracepoint approach.

cvonelm · 2021-10-15T10:46:39Z

file level I/O

File Level I/O should be traceable by using:

syscalls:sys_enter_open
syscalls:sys_(enter|exit)_read
syscalls:sys_(enter|exit)_write
syscalls:sys_exit_close

at least it would be nice if it works like this because this seems to be the only file level tracepoints I've seen. However open,read, write, close definitively don't cover the whole zoo of file level operations, and we will probably miss a bunch (mmap? the dozen different versions of those syscalls like openat/writev ... ? And things that might bypass the classical POSIX interface alltogether).

Alternatively there is kprobe based reading directly on the virtual file system layer.

this would use kprobes on vfs_open/vfs_read/vfs_write/vfs_close.

But as I already said in the comment above non-sucky kprobe support in lo2s might be an absolute b*****, and might struggle with the issue that some information might be in pointers to kernel memory like a char *filename which we can not access from lo2s unless we record the char* into user space memory using BPF because bpf has access to kernel memory, but then again going through the struggle of setting up BPF just to copy some memory out of the kernel sounds like using an ICBM to kill a fly or mmap() /dev/mem and access kernel memory that way, which sounds like the mother of all hacks.

bmario · 2021-10-15T14:55:37Z

A detailed overview of the storage stack in Linux:

https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

cvonelm · 2021-10-20T10:49:53Z

(Information based on "Understanding the Linux Kernel, which is based on 2.6, and the kernel source code for 5.something)

is there an advantage to tracing vfs_open/vfs_read/... over just tracing the syscalls?

No. The only thing vfs_open/vfs_read etc. apparently do is wrapping the open/read/... syscalls.

Is there a generic layer below vfs_open/vfs_read without cache effects?

No. The only thing vfs_read does is looking up which filesystem handles what file and then delegating the call down to the filesystem specific read(). Caching is handled entirely by the filesystem drivers (which makes sense because not all filesystems don't need caching, like procfs which is in memory anyways and contains dynamically generated content).

cvonelm · 2021-10-22T10:51:52Z

I hope this is a half-way legible representation of what I've learned about the fs stack this week.

The arrow labeled "Probe Here?" , which is the point at which the fs-dependent readpage() operation is called in generic_file_buffered_read() would be the place where we could learn if a read for a disk based filesystem* triggered an actual read on disk and didn't just end up in the page cache.

This has the problem that while the generic_file_read_iter() is very mature and stable code that changes seldomly. Hard coding a specific offset in the generic_file_read_iter() seems like something that breaks very easily.

Instrumenting the readpage() functions of the different filesystems probably does not work, because readpage() ist both used by reads that had cache misses and the readahead. And we are probably only interested in real cache misses and not the readahead doing its work.

*if the disk based fs actually uses generic_file_read_iter() which is almost all, but not all.

blastmaster · 2021-10-22T11:35:30Z

block I/O via tracepoints
sys_* syscalls via tracepoints

bmario · 2022-05-11T15:18:47Z

file level I/O

File Level I/O should be traceable by using:
* syscalls:sys_enter_open

* syscalls:sys_(enter|exit)_read

* syscalls:sys_(enter|exit)_write

* syscalls:sys_exit_close

Actually, it seems like nobody is using the open syscall, but openat instead.

perf record -e syscalls:sys_enter_open -e syscalls:sys_enter_openat -e syscalls:sys_enter_open_by_handle_at -e syscalls:sys_enter_mq_open -e syscalls:sys_enter_fsopen -a

0 syscalls:sys_enter_open
661K syscalls:sys_enter_openat
0 syscalls:sys_enter_open_by_handle_at
0 syscalls:sys_enter_mq_open
0 syscalls:sys_enter_fsopen

cvonelm self-assigned this Oct 8, 2021

cvonelm mentioned this issue Nov 4, 2021

block I/O tracing #196

Closed

bmario unassigned cvonelm Aug 19, 2022

cvonelm mentioned this issue May 8, 2023

Issue 194 Part 2: File I/O #283

Open

cvonelm self-assigned this May 30, 2024

cvonelm mentioned this issue Jun 3, 2024

Add posix I/O writer #334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO Tracing #194

IO Tracing #194

cvonelm commented Oct 8, 2021

cvonelm commented Oct 13, 2021 •

edited

Loading

cvonelm commented Oct 15, 2021 •

edited

Loading

bmario commented Oct 15, 2021

cvonelm commented Oct 20, 2021

cvonelm commented Oct 22, 2021

blastmaster commented Oct 22, 2021

bmario commented May 11, 2022

file level I/O

IO Tracing #194

IO Tracing #194

Comments

cvonelm commented Oct 8, 2021

cvonelm commented Oct 13, 2021 • edited Loading

block level I/O

cvonelm commented Oct 15, 2021 • edited Loading

file level I/O

bmario commented Oct 15, 2021

cvonelm commented Oct 20, 2021

is there an advantage to tracing vfs_open/vfs_read/... over just tracing the syscalls?

Is there a generic layer below vfs_open/vfs_read without cache effects?

cvonelm commented Oct 22, 2021

blastmaster commented Oct 22, 2021

bmario commented May 11, 2022

file level I/O

cvonelm commented Oct 13, 2021 •

edited

Loading

cvonelm commented Oct 15, 2021 •

edited

Loading