-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Errno 24 - Too many open files' on dvc push #2473
Comments
Hi @ChrisHowlin ! Thank you for reporting this! 🙂 Could you post full log, please? It is 100 files, right? Not 100K? Just making sure I understand you correctly. If it is 100, we might be leaking fds 🙁 Mind also trying dvc version 0.56.0, to see if that work? We've had some changes to dvc since that version, that I suspect might be the cause of the leaks. Thanks, |
I checked again, and the number of files being pushed was actually 79 (definitely not thousands 😄 ) I don't have the full logs from the
I don't have access to the environment until Monday, but I might be able to reproduce over the weekend. |
Hi @ChrisHowlin ! Were you able to reproduce it again? 🙂 |
Just for the record, we are doing tests with millions of files and not seeing reaching the linux opened fds limit (which is 4K), so I don't think we are seriously leaking fds. But with 256 on mac and 70 files, in 4*NCPU workers, we might indeed be pushing it a bit. Using lower |
I have got some time to have a look again today, and going through the steps to try and reproduce with 0.56.0. Some observations I have so far is that, this error appears right at the start of the push, before I see any progress on the upload bars. Also, tracking the lsof count for the dvc push process over time shows the number of open files reducing over time. I wonder if the problem is not an fd leak, instead too many threads are opening files and sockets to S3, beyond the limit the OS defines for the process (default 256)? I will provide some more logs on the different versions when I have them. |
I have managed to reproduce on 0.56.0 and also have identified minimal-ish reproduction instructions using test data: Repro instructions
Repro log for
|
Some observations:
|
@ChrisHowlin Sorry for the late response. Thank you for the investigation! 🙂 So to summarize, while we are taking a closer look at this, a workaround for you would be to either increase ulimit or lower For the record: I am able to reproduce with your script on my mac. |
I can confirm the above workarounds. To summarise (for anyone else experiencing this), I was able to work around the issue with either of these commands:
|
@efiop , are we leaking some file descriptors? is this expected in macos? (we don't have anything related to |
@MrOutis Probably, haven't looked into this yet. Mac has a really low ulimit for opened fds compared to some ordinary linux distro, and that very often gets in the way, as I've noted above. I would take a look at this first and then look into modifying the docs. |
Is this error still valid? Maybe there is similar problem as we had in #2600? |
@ChrisHowlin Could you try the newest version of dvc (0.68.1) and see if you are still able to reproduce this issue? @pared Very similar, indeed! Sure, please go ahead! |
@efiop, @ChrisHowlin
seems that problem is similar as in #2600, will take a closer look inside |
@pared setting ulimit to 25 might be way too low just in general, need to be careful about that. What I did for #2600 is I've monitored the number of opened fds from /proc/$PID/ and it was pretty apparent that some stuff was not released in time. Might want to look into caching the boto session as a quick experiment. |
Probably related: |
Even when setting |
it's, not a bug, its a feature So the thing is that we easily reach the open descriptors limit on mac. It is unnoticeable for Linux because it has 4 times bigger default open descriptors limit than mac. The reason why we are reaching the limit is the default configuration of s3. When we do not provide
(defined here) I think, we should introduce default upload config in |
@pared Great investigation!
Woudn't that slow down the uploading of big multipart files? |
Also, one more note: |
Great point! But anything less then what mac has is an unlikely scenario and we shouldn't worry about it, as lots of other stuff will break anyway. |
Reading through the docs of boto: So it seems to me that it for sure will be slower, but only because we restrict the number of jobs to X rather than 10X. Do you think we should do some performance tests whether the upload of a few big files is faster when using default s3 settings than when creating single threads for each one? |
I also think so and that would affect any dvc user, even on linux, which is absolutely no bueno and would slow down big files upload/(and probably download) for everyone. So we what we have here is a tradeoff between "large files" and "large number of files" scenarios, which are both core scenarios for us. If choosing between lowering
I bet a ramen that it will be slower for large files 🙂 |
Ok so let's sum up what is going on: Problem:For different remotes, Possible solution
I think the proper solution would be the second one, though there are few things to consider:
Also: Notes
|
Related: |
We can adjust that ( BTW, there are similar scenarios with other resources like SSH/SFTP sessions. Now we set the conservative defaults, which makes dvc slower for many users, without them even knowing that, which in its turn makes a perception of dvc worse. |
Ok, after discussion with @efiop I would like to get back to the implementation idea of this one, as it seems we imagined it differently. Here are some notes:
|
Maybe we could simply set JOBS for S3 to be
But how do we know when it is too many? :) When we hit "Too many files open"? In that case, how is this different from point 3 ? EDIT: ah, got it, you mean that it is closer to 1 , so we would print the warning if we are over
Sounds good to me 🙂 |
I think we should decide do we want dvc be a little bit aggressive to be fast. If yes then we should implement automatic degrading with a WARNING and a HINT and set default high. If no we might catch an error, stop, show a hint to either reduce jobs or increase ulimit. I am in favor of being aggressive here. Reasons:
|
In the future we might apply this approach to other situations. Network speed and stability might vary (i.e. a user connecting from different locations), available SSH/SFTP connections might vary (bound by other user), number of available sockets might vary (opened by some other process), etc. We should adjust and work as fast as possible in current situation, asking a user to reconfigure things each time is boresome. Going this way we might drop some config/cmdline options in the future, like |
@Suor Great points! Though, estimating the maximum tolerances might be tricky, as once we reach them, other stuff might break (maybe even in the handling code) :) The "error and hint" approach is simple to get right quickly. Maybe let's do the latter and then keep the discussion for the former? |
Not sure we will ever be able to get rid of |
I think that the "aggressive" approach is the only way to go. If we will throw error and hint to make the user change the number of jobs, I think that will result in many angry users getting "fix number of jobs" error after few minutes of successful upload. That is definitely not a friendly user experience. |
Also, @Suor, when we talk about auto degrading in case of default jobs number, right? Or do you think it should also be applied in case of |
@pared ideally we won't have A good example is dynamic chunk size in
|
Version information
Description
When pushing to S3 a directory of ~100 files that have been added to DVC, I observe an Errno 24 error from the dvc process.
It looks like dvc is trying to open more files than the OS allows. Checking the file handles on for the dvc process I get:
Looking at the OS limits, a process is limited to having 256 open files.
A workaround for this is to increase the max files per process to a larger number (say 4096) by running something like
ulimit -n 4096
, but I wonder if the ideal solution is for DVC to work within the OS configured limits by default?Edit: Updated wording of workaround
The text was updated successfully, but these errors were encountered: