Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache information about file type #135

Merged
merged 3 commits into from
Jan 2, 2024
Merged

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented Dec 12, 2023

This commit adds a cache that remembers whether a given path is a file or a directory, based on the results of std::fs::read_dir. This reduces the number of executed syscalls and improves the performance of the library.

Here is a simple benchmark that uses glob to find the amount of Rust files in the tests directory of a rustc checkout.

fn main() {
    let count = glob::glob("<rustc-root>/tests/**/*.rs")
        .unwrap()
        .count();
    println!("File count: {count}");
}

Results on my PC (approximately 19k Rust files are in that directory):

Version Syscall count statx syscall count Time
Before 41586 34468 ~130ms
After 7131 11 ~70ms

Syscalls were measured with strace <program> 2> out.txt && cat out.txt | wc -l and time was measured using hyperfine.

Fixes: #79

This pull request was created in cooperation with students of the Rust course on the VSB-TUO university.

src/lib.rs Outdated Show resolved Hide resolved
This commit adds a cache that remembers whether a given path is a file or a directory, based on the results of `std::fs::read_dir`. This reduces the number of executed syscalls and improves the performance of the library.
src/lib.rs Show resolved Hide resolved
tests/glob-std.rs Outdated Show resolved Hide resolved
@the8472 the8472 merged commit 4172399 into rust-lang:master Jan 2, 2024
12 checks passed
@Kobzol Kobzol deleted the cache-dir branch January 2, 2024 22:35
osiewicz added a commit to osiewicz/glob that referenced this pull request Apr 27, 2024
Background:
While working with cargo, I've noticed that it takes ~30s to cargo clean -p with large enough target directory (~200GB).
With a profiler, it turned out that most of the time was spent retrieving paths for removal in https://github.com/rust-lang/cargo/blob/eee4ea2f5a5fa1ae184a44675315548ec932a15c/src/cargo/ops/cargo_clean.rs#L319 (and not actually removing the files).

Change description:
In call to .sort_by, we repetitively parse the paths to obtain file names for comparison. This commit caches file names in PathWrapper object, akin to rust-lang#135 that did so for dir info.

For my use case, a cargo build using that branch takes ~14s to clean files instead of previous 30s (I've measured against main branch of this directory, to account for changes made since 0.3.1). Still not ideal, but hey, we're shaving 50% of time off for a bit heavier memory use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive stat syscalls on linux
2 participants