Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zero-compromise directory iteration #457

Merged
merged 10 commits into from
Nov 23, 2022
Merged

Add zero-compromise directory iteration #457

merged 10 commits into from
Nov 23, 2022

Conversation

SUPERCILEX
Copy link
Contributor

@SUPERCILEX SUPERCILEX commented Nov 21, 2022

Closes #451

Background reading:

Notes

Based on the above background reading, the following assumptions appear to be sound:

  • The kernel will align returned dirents and they will be contiguous.
  • Seeking into garbage offsets is safe as the kernel will not return partial dirents. In fact, I've discovered that the d_off is actually a cookie that you can use to seek to the next dirent. It has no relation to byte offsets.

Benchmarks

Benchmark 1: ./nix-raw /tmp/ftzz-test
  Time (mean ± σ):     197.4 ms ±   5.2 ms    [User: 5.8 ms, System: 190.1 ms]
  Range (min … max):   191.2 ms … 208.3 ms    15 runs
 
Benchmark 2: ./rustix-raw /tmp/ftzz-test
  Time (mean ± σ):     190.9 ms ±   4.1 ms    [User: 2.4 ms, System: 188.0 ms]
  Range (min … max):   186.7 ms … 202.1 ms    15 runs
 
Benchmark 3: ./nix /tmp/ftzz-test
  Time (mean ± σ):     228.7 ms ±   7.2 ms    [User: 39.9 ms, System: 188.7 ms]
  Range (min … max):   223.0 ms … 250.2 ms    13 runs
 
Benchmark 4: ./rustix /tmp/ftzz-test
  Time (mean ± σ):     246.3 ms ±   6.9 ms    [User: 40.9 ms, System: 201.8 ms]
  Range (min … max):   235.5 ms … 256.5 ms    11 runs
 
Benchmark 5: ./stdlib /tmp/ftzz-test
  Time (mean ± σ):     237.6 ms ±   5.2 ms    [User: 56.0 ms, System: 180.0 ms]
  Range (min … max):   232.0 ms … 246.8 ms    12 runs
 
Summary
  './rustix-raw /tmp/ftzz-test' ran
    1.03 ± 0.04 times faster than './nix-raw /tmp/ftzz-test'
    1.20 ± 0.05 times faster than './nix /tmp/ftzz-test'
    1.24 ± 0.04 times faster than './stdlib /tmp/ftzz-test'
    1.29 ± 0.05 times faster than './rustix /tmp/ftzz-test'

Optimal buf size analysis

I set up benchmarks that used power-of-two sized buffers. 2^13 seemed optimal and it's also what the stdlib uses in its BufWriter and BufReader implementations.

Benchmark 1: ./rustix5 /tmp/ftzz-1
  Time (mean ± σ):     345.7 ms ±   2.8 ms    [User: 35.5 ms, System: 309.9 ms]
  Range (min … max):   343.4 ms … 353.1 ms    10 runs
 
Benchmark 2: ./rustix6 /tmp/ftzz-1
  Time (mean ± σ):     258.5 ms ±   1.9 ms    [User: 24.9 ms, System: 233.4 ms]
  Range (min … max):   255.3 ms … 260.8 ms    11 runs
 
Benchmark 3: ./rustix7 /tmp/ftzz-1
  Time (mean ± σ):     217.8 ms ±   3.4 ms    [User: 17.4 ms, System: 200.2 ms]
  Range (min … max):   214.6 ms … 228.4 ms    13 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 4: ./rustix8 /tmp/ftzz-1
  Time (mean ± σ):     194.1 ms ±   0.9 ms    [User: 15.9 ms, System: 178.1 ms]
  Range (min … max):   192.9 ms … 196.1 ms    15 runs
 
Benchmark 5: ./rustix9 /tmp/ftzz-1
  Time (mean ± σ):     181.3 ms ±   0.8 ms    [User: 13.9 ms, System: 167.2 ms]
  Range (min … max):   179.0 ms … 182.7 ms    16 runs
 
Benchmark 6: ./rustix10 /tmp/ftzz-1
  Time (mean ± σ):     175.1 ms ±   1.6 ms    [User: 12.4 ms, System: 162.6 ms]
  Range (min … max):   171.0 ms … 177.6 ms    17 runs
 
Benchmark 7: ./rustix11 /tmp/ftzz-1
  Time (mean ± σ):     170.3 ms ±   2.2 ms    [User: 12.8 ms, System: 157.3 ms]
  Range (min … max):   167.6 ms … 173.1 ms    17 runs
 
Benchmark 8: ./rustix12 /tmp/ftzz-1
  Time (mean ± σ):     170.4 ms ±   0.6 ms    [User: 9.1 ms, System: 161.3 ms]
  Range (min … max):   170.0 ms … 172.1 ms    17 runs
 
Benchmark 9: ./rustix13 /tmp/ftzz-1
  Time (mean ± σ):     169.2 ms ±   1.8 ms    [User: 8.8 ms, System: 160.2 ms]
  Range (min … max):   165.4 ms … 173.1 ms    17 runs
 
Benchmark 10: ./rustix15 /tmp/ftzz-1
  Time (mean ± σ):     167.9 ms ±   2.2 ms    [User: 11.2 ms, System: 156.7 ms]
  Range (min … max):   164.7 ms … 172.5 ms    17 runs
 
Benchmark 11: ./rustix20 /tmp/ftzz-1
  Time (mean ± σ):     168.4 ms ±   1.0 ms    [User: 9.5 ms, System: 158.8 ms]
  Range (min … max):   167.7 ms … 171.8 ms    17 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  './rustix15 /tmp/ftzz-1' ran
    1.00 ± 0.01 times faster than './rustix20 /tmp/ftzz-1'
    1.01 ± 0.02 times faster than './rustix13 /tmp/ftzz-1'
    1.01 ± 0.02 times faster than './rustix11 /tmp/ftzz-1'
    1.01 ± 0.01 times faster than './rustix12 /tmp/ftzz-1'
    1.04 ± 0.02 times faster than './rustix10 /tmp/ftzz-1'
    1.08 ± 0.01 times faster than './rustix9 /tmp/ftzz-1'
    1.16 ± 0.02 times faster than './rustix8 /tmp/ftzz-1'
    1.30 ± 0.03 times faster than './rustix7 /tmp/ftzz-1'
    1.54 ± 0.02 times faster than './rustix6 /tmp/ftzz-1'
    2.06 ± 0.03 times faster than './rustix5 /tmp/ftzz-1'

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I still need to read through the main implementation code another time, but here are a few initial review comments.

///
/// let fd = openat(cwd(), ".", OFlags::RDONLY | OFlags::DIRECTORY, Mode::empty()).unwrap();
///
/// let mut buf = [MaybeUninit::uninit(); 2048];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this example use DIR_BUF_LEN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before making docs changes, I want to make sure we're on the same page since I think we're coming at this from different angles. I was trying to demo the edge cases while assuming that people would know to use Vec::spare_capacity_mut for everyday life, but that's probably a bad assumption.

Here's how I'd like people to use the API:

  • When using recursion, people should either use a vec or the stack. If using the stack, I'd lean towards a smaller value to minimize wasted space and skip stack probes, hence the 2048. If using the heap, then we can afford to waste more space, hence using 8192. So maybe this suggests having a constant is actually a bad idea? Instead we could tell people to "Use a buffer size of at least NAME_MAX+24 bytes. We suggest 2048 for stack allocated buffers and 8192 for heap allocated buffers." In practice, almost all file systems use 255 or smaller as their limit: https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits. Reiser appears to be the only deviant.
  • When not using recursion, people would ideally have a cached vec somewhere that they re-use.

Can we guarantee that DIR_BUF_LEN is long enough to support any path that the host OS supports (considering NAME_MAX)?

I don't think so unfortunately, which is why I'm leaning towards removing the constant. There are basically no guarantees on NAME_MAX: https://www.gnu.org/software/libc/manual/html_node/Limits-for-Files.html. Wikipedia has the current file system limits, but nothing prevents a future file system from removing the limit entirely for example.

Going back to the docs, what about having a simple heap example, a simple stack example, and then a production-ready buffer resizing example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the docs, what about having a simple heap example, a simple stack example, and then a production-ready buffer resizing example?

That sounds good. I imagine you could even skip the simple heap example if you wanted to. I imagine the vast majority of Rust code will continue using std::fs::ReadDir to read directories, or walkdir or so, so I imagine the main audience for using this API directly will be people doing low-level optimization work.

For the stack example, I'm not comfortable encouraging people to use fixed-sized buffers if OS's don't have a limit. If we're not confident enough to assume a NAME_MAX exists, it feels awkward to make suggestions to users that they bake in numbers like 2048. Could we instead give some guidance like, "only use this approach for reading directories with known layouts where all entries have names less than X" or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead give some guidance like, "only use this approach for reading directories with known layouts where all entries have names less than X" or so?

I think that's reasonable, I added clarifications for the simple stack + heap examples. Happy to remove either if you'd rather not have the example at all.

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for working on this!

I notice this is implemented only for the linux_raw backend. I think that's fine for now, though in case you're interested, a next step here is implementing it for the libc backend, using libc::syscall to call getdents.

/// buf.reserve(new_capacity);
/// }
/// ```
pub fn new(fd: Fd, buf: &'buf mut [MaybeUninit<u8>]) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if getdents requires the buffer to be aligned at all? If so, we should document that, and either change the type here or potentially make this unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some experimentation, no:

        slice::from_raw_parts_mut(
            buf.as_mut_ptr().add(3) as *mut MaybeUninit<u8>,
            buf.capacity() - 3,
        )

That still works.

And this popped up in a search: microsoft/WSL#1769

Anyway it isn't an ABI requirement the pointer be 8 byte aligned

So I'm going to say I'm fairly confident the buffer doesn't have to be aligned. I would also expect the syscall to fail if it needs to be aligned which means it isn't a safety issue.

@SUPERCILEX
Copy link
Contributor Author

SUPERCILEX commented Nov 22, 2022

I notice this is implemented only for the linux_raw backend. I think that's fine for now, though in case you're interested, a next step here is implementing it for the libc backend, using libc::syscall to call getdents.

Done, I wasn't sure if using libc::syscall was ok, but sounds like it is. :)

@SUPERCILEX
Copy link
Contributor Author

Looks like I'm going to need help figuring out the right cfgs. Are there more target_oss than linux that support getdents64? Not sure where to look this up

@sunfishcode
Copy link
Member

The main other target_os that supports getdents64 is "android". So any(target_os = "android", target_os = "linux") should be good for this.

@SUPERCILEX
Copy link
Contributor Author

Thanks!

@SUPERCILEX
Copy link
Contributor Author

Hmmm, looks like x86_64-unknown-linux-gnux32 is using 32 bit inodes and offsets even though it's supposed to be linux_dirent64. That doesn't make much sense, not quite sure what to do.

@sunfishcode
Copy link
Member

Oh, hrm. The linux_raw_sys bindings have the wrong type for x32. That might be pretty tricky to sort out.

Maybe what we could do here is just special-case x32 here. When #[cfg(all(target_arch == "x86_64", target_pointer_width == "32"))] is true, use a locally-defined linux_dirent64, otherwise use the linux_dirent64 defined in linux_raw_sys. That's not super pretty, but maybe it'd be good enough for now?

@SUPERCILEX
Copy link
Contributor Author

That's a little sad, but done.

@sunfishcode
Copy link
Member

Thinking about this more, I believe I now have a better solution for x32: sunfishcode/linux-raw-sys#36

It's just a matter of putting special cases in the right place 😅.

@sunfishcode
Copy link
Member

Ok, that patch is now in linux-raw-sys 0.1.3. Could you try updating that and trying this patch without the special case for x86?

Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
… BS, yielding a 2x instruction count reduction

Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
Signed-off-by: Alex Saveau <saveau.alexandre@gmail.com>
@SUPERCILEX
Copy link
Contributor Author

SUPERCILEX commented Nov 22, 2022

Sweet, that's much nicer. Thanks!

@sunfishcode
Copy link
Member

Looks good, thanks!

@sunfishcode sunfishcode merged commit 1b06142 into bytecodealliance:main Nov 23, 2022
@SUPERCILEX
Copy link
Contributor Author

Woohoo, thanks!

@SUPERCILEX SUPERCILEX deleted the dents branch December 28, 2022 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Openness to a low-level getdents64 directory iterator?
2 participants