Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to improve perf by avoiding fs.stat calls #24

Closed
es128 opened this issue Apr 20, 2015 · 8 comments
Closed

Option to improve perf by avoiding fs.stat calls #24

es128 opened this issue Apr 20, 2015 · 8 comments

Comments

@es128
Copy link
Collaborator

es128 commented Apr 20, 2015

readdirp can take a long time to process a very large file tree. I'm interested in exploring ways to improve performance for this use case, and one way to do that would be by avoiding the stat calls to every entry.

Instead, it should be more performant to just call readdir on each entry and handle the error to tell the difference between files and dirs - perhaps even making the assumption that entries with names that appear to include a file extension are not directories when this mode is employed.

Opening this issue just to declare my intent to do some work along these lines.

@thlorenz if you have objections to adding this sort of option to readdirp please let me know and I'll consider other possibilities.

@thlorenz
Copy link
Collaborator

No, provide an option, add tests, etc. and I'll happily merge this.
However the convenience of resolving stats is probably the major thing that readdirp adds on top of the glob package aside from the streaming.
Any reason you wouldn't just use glob for these cases?

@es128
Copy link
Collaborator Author

es128 commented Apr 21, 2015

I hadn't considered glob before, but still the reason I'd want this here is to keep the same interface I'm using in chokidar (using the streaming api) with a simple option to switch the stats on or off, which would correspond with chokidar's alwaysStat option. I utilize readdirp's filters and depth options that I wouldn't want to have to reinvent by incorporating glob (would rather reinvent some of readdirp's internals I guess, heh)

Thank you for the suggestion, though!

@thlorenz
Copy link
Collaborator

Ok, cool waiting for the PR ... :)

@es128
Copy link
Collaborator Author

es128 commented Apr 21, 2015

Just FYI, node-glob does do stat calls on the paths it collects, it just doesn't expose that data. So I doubt I'd see much of a performance improvement by using it for the cases I'm targeting.

@thlorenz
Copy link
Collaborator

Interesting, I had no clue about that. I guess it does so to figure out links and stuff.

So readdirp could become faster than glob then -- that'd be awesome!
We need benchmarks!

@thlorenz
Copy link
Collaborator

Oh, but btw if we don't do stat calls how are we gonna know if we need to recurse into (a directory) or not (a file)?

@thlorenz
Copy link
Collaborator

Ah just reread:

making the assumption that entries with names that appear to include a file extension are not directories when this mode is employed

That's really hairy since lots of files have no extension -- especially bash scripts. So you'd end up catching a bunch of errors but you still may be faster -- we'll see.

@es128
Copy link
Collaborator Author

es128 commented Apr 21, 2015

Yes, expecting that. The idea is that catching the error on a bad readdir call is faster than stat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants