-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too slow with thousands of files #207
Comments
Hey, thanks for the kind words! I'll let @thibaudgg answer you properly but just a quick answer: yes, I think it'd be very easy to add an option to disable the MD5 stuff! |
Thanks, at the moment it looks like thats the MAIN bottleneck. People have seen that the inotify and fsevents events come in quite fast (not surprised), but the time from that to my callback being called can be many seconds and 100% CPU later... For Vagrant, it'd be easier if we could just say "let it all through" |
Hey @mitchellh, can you confirm that MD5 comparaison is the main bottleneck, because it must be skipped on non-darwin system. I improve the logic recently db6dcc2, are you using the last version? |
@mitchellh - I'm currently refactoring Listen, which may also mean a few performance-related fixes. (e.g. even on Linux all the watched directories are unnecessarily scanned on startup) And there's the built in "interactive" delay to pile up changes ( Listen was designed for interactivity (pileing up events and reducing them), not performance (blocking and listening and quickly returning every change) - but with a few tweaks it should perform just as well as the adapters. In short - listen was built to reduce similar/related changes and not assure low latency between changes. Also, if Listen is to be a trigger for rsync, then you're right - it's more than reasonable to disable MD5 completely - along with change recording ( I have a list of issues I'm working from, but any :+1's and I'll prioritize the rsync support ... ;) |
Turns out I missed out something important... Listen is "correct", while the functionality Vagrant needs is "detecting parent directory changes" and then letting Rsync work things out. Consider this case (from one of Listen's acceptance tests):
inotifyInotify will ONLY generate the following events: (I'll call this rsync mode or dumb mode from now on) listenWhile listen "makes up" the ADDITIONAL events: (I'll call this correct mode or smart mode from now on, because there's no system event saying foo changed - you can only discover that by saving a snapshot of So here's the difference (flow): What Vagrant needs: change happens -> notify on directory -> call rsync to sync target What Listen does: change happens -> compare with "db" -> compile list of actual IMPLICITLY changed paths If ... when DISK SPACE and IO is concerned, because listen operates on ONE directory (and a snapshot in memory), while RSYNC works on TWO directories. conclusions
So the best performance related change in Listen I can think of are:
So overall, the vagrant-gatling-rsync plugin may be more suited than listen will ever be, especially since Windows isn't a first choice for performance (portability is a strong point of Let me know what you think. |
I also ran into a similar issue. OS X 10.9.3, 1.86GHz Core 2 Duo, Listen 2.7.5 I would like to monitor a directory with ≈50k files in it (multiple code projects in git repos). It takes about 10 minutes of CPU time (not actual time) to warm up before it started processing the callback in a reasonable amount of time. While it was warming up, events were delayed for minutes, some events were never reported, and the ruby process was consuming >80% CPU. Once it finished warming up, it processes events in a timely fashion. I see that it is using 140MB of "Real Memory", ranking in at the third fattest process currently running on my laptop. I didn't expect my 30 line rake task to be so heavy! |
Thanks for the numbers, @whitehat101 (I use Linux - the situation is quite different). Since I'm reworking Listen to make it "faster" and "more lightweight" for different use cases, I have a few questions:
Some comments:
During that time listen is making an internal snapshot of the directories to later be able to detect complex changes that the OS (fsevent) doesn't report.
The snapshotting does take a while and until it's completed, changes cannot be reliably detected. What makes this worse is on OSX (and Windows) the mtime of files is rounded to seconds, so Listen uses MD5 to distinguish between actual changes (to avoid e.g. running unit tests multiple times, when a file was reported as changed, but the content didn't change).
For a Ruby process with an internal snapshot of file information for 50k files ... that's really not bad. Again - let me know what your use case is exactly, so I can make sure the next version of Listen will be as tweakable as possible to avoid the slowdown. |
I'm trying to tackle the dreaded TimeTracking problem. Whose project did I work on, when, and how long. I set Listen to watch my ~/Code directory, which contains active projects and inactive projects. When something changes, I'm currently just inserting the time, what project, what file, and what happened ("added", "deleted", ...) into a database to be processed later.
Not exactly, but it would be convenient. I got the 50k number from I'd rather not prune or refactor my personal code folder, but at work it would be reasonable to keep billable/active projects separate. At home, it would be fun to know that I hacked on that one obscure github project for three hours after not touching it for months without having to move project folders around or start a new watcher process -- zero-config would be nice (and it's impossible to forget to enable something if there is nothing to be enabled in the first place).
Yes.
Both. I'm tempted to say that I don't actually care about the content of the files, and that an occasional false-positive-modified would be acceptable, but then I started imagining a quirky editor that forced a save every five minutes as long as the editor is open. I want to monitor user activity, but a file system likely can't tell if a user hit a key or if a program is just doing odd things.
Knowing that dir and dir2 were changed is enough.
Modified is sufficient. Knowing that something has happened in the project, is essential. What happened is just fun to know. I haven't read enough of your project's code to know what I'm talking about, but it seem like lazy snapshotting could provide a significant performance boost to start up time and memory usage. Don't build any snapshots until the OS reports activity. Assume the first report is genuine, and then create a snapshot to use for future comparisons. Then, my active projects could have reliable changes reported, and my inactive projects would not consume (as many) resources. I'm curious, what kinds of things might cause a file to be reported as changed, but not actually change? I can only think of the |
TL;DR - help get rb-fsevent tracking file changes (and not just dirs) ... and I'll happily throw away the OSX / old fsevent specfic crutches away from Listen.
Welcome to the club ... Personally, I take my zhistory file (with timestamps) and "compile it" with my eyeballs, then log what I did to another app for stats. But I digress...
Makes sense to watch everything then.
Great, because, I'm planning to get rid of the distinction between "added" and "removed" and "modified" completely, because I can't find a use case that justifies keeping it.
That's fine on Linux, but on MacOS the rb-fsevent gem can only track changes to directories, so if you change a file .. you need some way to work out which file was modified (that's why snapshots are needed - to work out which file(s) actually changed - and the mtime second granularity makes it even worse). There's probably an option to watch files, except AFAIK the current rb-fsevent doesn't support that. So, if someone can get rb-fsevent to report changes to files (and not just directories) ... I'd love to drop some of the current workarounds (because both Win* and Linux* report changes files to files nicely). That means - the best thing to do would be to get rb-fsevent to handle files (because essentially Listen adds tracking file changes on top of fsevent - on other platforms it may seem almost pointless to use Listen for performance-demanding use cases, because of how full-featured WDM and rb-inotify are). [ Case in point - some people would probably just use inotify-tools binaries and a few shell scripts for what you need on Linux ] There are also exotic other options probably for OSX, like putting your files on a Linux VM, mounting the image on the host and/or sending file changes over TCP to Listen (check out the Listen README) ...
E.g. on Mac when a file changes, the current (?) rb-fsevent says a dir changed ... so potentially ANYTHING inside the directory could have changed (e.g. removed, added, moved into that dir unmodified, etc.), so listen marks *everything inside as changed (and recursively too), then compares stuff to work out which file was actually the one that triggered the chain-reaction... I'm currently rewriting Listen to be more easily configurable so you can easily tweak and configure to get what you want. Can't give a deadline though (it's in my "spare" time). |
rb-fsevent does support file events with the
|
Then it's all a matter of getting the I have no means of implementing this (other than blindly) and testing (Linux only), so a PR would be nice. (Otherwise all I can actually do is end up breaking the current OSX support without being able to test it.) Instead, here's a recipe to get it working with Listen:
For this to work properly (skip comparing with the record and support editor moves/renames, etc.), it should do for files what the Linux adapter (linux.rb) does, which is:
And ideally, there should be unit tests properly stubbing/mocking the rb-fsevent objects, so that OSX functionality is covered. If the events are properly handled, I can help implementing the "top half" (i.e. if there's a test case with rb-fsevent objects stubbed that I can run on Linux that fails, I can fix it). Currently I have no idea what events rb-fsevent generates for every tested scenario regarding files - so at the very, very least I need test cases I can run on Linux and that fail on Travis (when they're broken). Also, I have no idea if this will actually work better and more effectively than the current implementation. |
TL;DR; - in 2.7.6 there's an undocumented Since the 2.7.6 release, there are 3 changes related to this issue:
notesThe Record building (now during startup) is VERY slow (mostly because heavy fiber/task switching). E.g. 15,000 files/dirs on Linux means 35 seconds (and that's with no hashing). This will likely be improved in the next release, but it requires heavy re-refactoring (ongoing) and then lots of testing. I'm also planning to allow the Record to be skipped completely - which makes sense for "rsync-based" use cases (monitoring only dir changes). Though, Listen needs to be drastically reorganized for this to happen, because it's whole idea is based on the opposite - watching for file changes and ignoring dir changes. |
Fixed in v2.7.7 Feel free to reopen this if there are still any performance issues. |
💟 |
+1 |
Hi! First, thanks for LIsten, it is a great API and library to file watching. We recently integrated using the listen gem for the rsync functionality in Vagrant.
However, we have many users that have upwards of 20,000 to 60,000 files and listen just is far too slow for this. We're considering moving away from listen but before we make that choice I wanted to ask if there was a way we can improve performance?
See: hashicorp/vagrant#3249
One person mentioned that Listen does an MD5 of every file or something thereabouts. Perhaps you can expose an option to not do this at the expense of maybe some false positives?
The text was updated successfully, but these errors were encountered: