Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Tracing #194

Open
2 tasks
cvonelm opened this issue Oct 8, 2021 · 7 comments
Open
2 tasks

IO Tracing #194

cvonelm opened this issue Oct 8, 2021 · 7 comments
Assignees

Comments

@cvonelm
Copy link
Member

cvonelm commented Oct 8, 2021

  • find suitable data sources
  • find out how to map them to otf2xx IO records.

focus on: block level IO, file level IO first

@cvonelm cvonelm self-assigned this Oct 8, 2021
@cvonelm
Copy link
Member Author

cvonelm commented Oct 13, 2021

block level I/O

Blocklevel IO can be traced using those 2 tracepoints:

  • block:block_rq_insert: event triggered by the insertion of a request into the queue
  • block:block_rq_complete: Event triggered by request completion.

writing read begin and read end events from those tracepoints in process mode looks relatively easy.

however we can not use begin-end records in system mode because there is no total order of block io event issues and completions, so writing it as a sample is the best that we can do.

biosnoop from the bcc toolkit uses kprobes instead of tracepoints, but as far as I can see it the kprobes are not that different from the tracepoints above.

For now we should stick with tracepoints, I think, because they don't require set-up with perf probe to be used, but we should keep kprobes in mind if we come across some place where we miss critical information with the tracepoint approach.

@cvonelm
Copy link
Member Author

cvonelm commented Oct 15, 2021

file level I/O

File Level I/O should be traceable by using:

  • syscalls:sys_enter_open
  • syscalls:sys_(enter|exit)_read
  • syscalls:sys_(enter|exit)_write
  • syscalls:sys_exit_close

at least it would be nice if it works like this because this seems to be the only file level tracepoints I've seen. However open,read, write, close definitively don't cover the whole zoo of file level operations, and we will probably miss a bunch (mmap? the dozen different versions of those syscalls like openat/writev ... ? And things that might bypass the classical POSIX interface alltogether).

Alternatively there is kprobe based reading directly on the virtual file system layer.

this would use kprobes on vfs_open/vfs_read/vfs_write/vfs_close.

But as I already said in the comment above non-sucky kprobe support in lo2s might be an absolute b*****, and might struggle with the issue that some information might be in pointers to kernel memory like a char *filename which we can not access from lo2s unless we record the char* into user space memory using BPF because bpf has access to kernel memory, but then again going through the struggle of setting up BPF just to copy some memory out of the kernel sounds like using an ICBM to kill a fly or mmap() /dev/mem and access kernel memory that way, which sounds like the mother of all hacks.

@bmario
Copy link
Member

bmario commented Oct 15, 2021

A detailed overview of the storage stack in Linux:

https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

@cvonelm
Copy link
Member Author

cvonelm commented Oct 20, 2021

(Information based on "Understanding the Linux Kernel, which is based on 2.6, and the kernel source code for 5.something)

is there an advantage to tracing vfs_open/vfs_read/... over just tracing the syscalls?

No. The only thing vfs_open/vfs_read etc. apparently do is wrapping the open/read/... syscalls.

Is there a generic layer below vfs_open/vfs_read without cache effects?

No. The only thing vfs_read does is looking up which filesystem handles what file and then delegating the call down to the filesystem specific read(). Caching is handled entirely by the filesystem drivers (which makes sense because not all filesystems don't need caching, like procfs which is in memory anyways and contains dynamically generated content).

@cvonelm
Copy link
Member Author

cvonelm commented Oct 22, 2021

fs stack

I hope this is a half-way legible representation of what I've learned about the fs stack this week.

The arrow labeled "Probe Here?" , which is the point at which the fs-dependent readpage() operation is called in generic_file_buffered_read() would be the place where we could learn if a read for a disk based filesystem* triggered an actual read on disk and didn't just end up in the page cache.

This has the problem that while the generic_file_read_iter() is very mature and stable code that changes seldomly. Hard coding a specific offset in the generic_file_read_iter() seems like something that breaks very easily.

Instrumenting the readpage() functions of the different filesystems probably does not work, because readpage() ist both used by reads that had cache misses and the readahead. And we are probably only interested in real cache misses and not the readahead doing its work.

*if the disk based fs actually uses generic_file_read_iter() which is almost all, but not all.

@blastmaster
Copy link

  1. block I/O via tracepoints
  2. sys_* syscalls via tracepoints

@bmario
Copy link
Member

bmario commented May 11, 2022

file level I/O

File Level I/O should be traceable by using:

* syscalls:sys_enter_open

* syscalls:sys_(enter|exit)_read

* syscalls:sys_(enter|exit)_write

* syscalls:sys_exit_close

Actually, it seems like nobody is using the open syscall, but openat instead.

perf record -e syscalls:sys_enter_open -e syscalls:sys_enter_openat -e syscalls:sys_enter_open_by_handle_at -e syscalls:sys_enter_mq_open -e syscalls:sys_enter_fsopen -a

0 syscalls:sys_enter_open
661K syscalls:sys_enter_openat
0 syscalls:sys_enter_open_by_handle_at
0 syscalls:sys_enter_mq_open
0 syscalls:sys_enter_fsopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants