-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: s3: adjust jobs number basing on file descriptors number #2866
Conversation
Note: |
sorry guys, missed the discussion in the ticket. @pared could you please summarize the changes, show a gif on how would it look like? My 2cs - it feels a bit that we are over-architecting this if we try to analyze the system in advance. Why don't we just handle this gracefully and show a message in logs/screen that "you are hitting the limit blah - blah ..."? |
@shcheklein the whole implementation is a result of our discussion in the original issue.
The problem here is that we cannot for 100% say that this error will occur, which heavily depends on the size of particular files and speed of the network. We know that this error might occur when the number of jobs exceeds some (possible to estimate) number. However, user might be aware of that and purposefully increases the number of jobs because he knows that for some reason (super-fast network) this does not apply to him. So what I want to introduce in this change is:
|
I don't like this approach. This is:
So I would prefer dvc to retry on errors, adjust jobs automatically. Ideally we won't need |
@Suor I don't like the retrying approach, because you cannot determine when and if the operation will fail. So it is possible, that someone will get an error after a few minutes of operation execution. And I think that would be really bad because the user would need to reduce number of jobs/increase fds limit, retry and hope for the best. I think that could be frustrating if faced a few times. |
@pared thanks for the clarifications! Even though I tend to agree with @Suor that our future direction should be aggressive (and do not fail, but adjust itself dynamically), I think we should move forward (especially since it was discussed by a few members of the team) with this and move @Suor 's general suggestion to a separate discussion and approach holistically. @efiop just to clarify (since I missed the discussion), what's your take on this? |
@Suor ulimit is per-session, your torrents (aye-aye, captain Alex! βοΈ π’ π« β οΈ) and backups are in a different session, so they won't affect your dvc pull.
Agreed on these two. The warning can be replaced with a post-crash one (see part about it below). I would probably make an emphasis on "increase your ulimit!!" since mac's ulimit is just ridiculous, and I was running into it constantly until I've just adjusted my ulimits.
Ideally yes, but that is a different ticket. With ulimit, it is hard to retry, as you can't really be sure if it won't fail again or after 10 minutes of wait time. I haven't tested it, so not 100% sure if that would be suitable for us or not, I guess it depends on where we are going to catch that exception. @pared We've agreed to definitely catch that exception in |
@Suor @shcheklein @efiop
|
Note: I think I did not initially understand @Suor point of view, and imagined it as following workflow: |
@pared It still feels like catching the error and printing a message to increase the ulimit brings more value here. I'm worried about automatic adjustment like this because it is a lot of heuristics to handle. Maybe it is indeed the only way to do that, but it feels like if I was the user, if dvc told me "increase your ulimit" it would be more useful than the warning and the auto-throttle. Especially since we are talking mostly about mac, where ulimit for files is simply ridiculously low, so it makes total sense to just make user increase it. |
5544120
to
3d3747f
Compare
dvc/remote/base.py
Outdated
self, from_info, to_info, exception, operation | ||
): | ||
if isinstance(exception, OSError) and exception.errno == 24: | ||
raise TooManyOpenFilesError(exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to handle this in main.py , as it might raise elsewhere too. E.g. on status even.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I mean is that in main.py we have a few try:excepts
for any CLI command. I would put an except there for this and just would print a proper error. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, as @pared noted, this is not as trivial, because we catch Exception in upload/download, so it won't hit main.py...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this changes the flow from what we've had before. As previously it would catch and return 1 and not re-raise. Not sure about the consequences.
dvc/exceptions.py
Outdated
"Operation failed due to too many open file descriptors. reduce " | ||
"the number of jobs or increase open file descriptors limit to " | ||
"prevent this.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put the emphasis on "increase your ulimit", as that is the most reasonable approach. Also, it is better if we tell how to do it or refer to a doc.
It seems that handling exceptions coming from parallel execution of tasks is causing some confusion.
That brings me to the conclusion that we might need to introduce some kind of error handler to whom we could delegate logic related to error handling. @iterative/engineering What do you think about this idea? [EDIT]
That way we preserve "best effort" approach (download/ upload as much as you can) and we still can fail in cases that we know will not yield useful result ( |
@pared So for example, |
@efiop yes, that's what I have in mind. I also can't come up some common errors for all remotes, but handling "Unspecified" exceptions is common to all classes (raise |
@pared Should be quite straight forward. E.g. for s3 it is ClientError, for other remotes it is also something specific that we can get straight from the docs. There will be a learning curve though, but I think that is reasonable for the sake of proper error handling and not this current lazy "except Exception"(my fault originally). |
@efiop I agree that there will be a learning curve, though I believe that the current way of handling exceptions also requires some insight. |
@pared Agreed. Ok, so on a quick google for upload/download: |
@efiop I would actually leave error as recoverable by default, and implement logic for those that we think should fail the whole operation, that seems to go along with logic we applied until this issue came up. |
@pared Right, that is what you are basically doing right now. Makes sense. Seems like we agree on the throttling magic being too fragile, so maybe let's repurpose this PR to only properly handle the ulimit issue? If so, one thing I would do is to not create that exception, but rather catch that as an OSError in main.py and |
dvc/main.py
Outdated
if os.name != "nt": | ||
solution_hint = ( | ||
", increase open file descriptors limit with `ulimit`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command is not full and it is different for mac and linux πWindows also has ways to increase ulimit. WIthout the hint this message is useless, as it just duplicates what the exception says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, might consider just refering to our doc (e.g. in user-guide) and explaining nicely what is going on and how to solve it on different systems and why increasing ulimit is better for dvc.
tests/func/test_remote.py
Outdated
|
||
with pytest.raises(OSError) as e: | ||
dvc.push() | ||
assert e is too_many_open_files_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just assert e.errno == 24
?
tests/func/test_remote.py
Outdated
tmp_dir.dvc_gen({"file": "file content"}) | ||
|
||
too_many_open_files_error = OSError() | ||
mocker.patch.object(too_many_open_files_error, "errno", 24) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just exc.errno = 24
, no need to mock. Or am i missing something?
tests/func/test_remote.py
Outdated
too_many_open_files_error = OSError() | ||
mocker.patch.object(too_many_open_files_error, "errno", 24) | ||
mocker.patch.object( | ||
RemoteLOCAL, "_upload", side_effect=too_many_open_files_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can just
RemoteLOCAL, "_upload", side_effect=too_many_open_files_error | |
RemoteLOCAL, "_upload", side_effect=OSError(24) |
dvc/main.py
Outdated
"too many open files error, for solution, please refer to our " | ||
"user guide" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would look like "ERRROR: too many open files error..." which is odd. How about
too many open files, please increase your
ulimit
. Refer to https://dvc.org/doc/user-guide/troubleshooting for more info.
This way there is a placeholder non-verbose solution and a referal to docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could skip troubleshooting
part of the URL for now though. Not ideal, but at least we won't promise anything here π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't ulimit
unix-specific? That message would be amibguous for windows users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared Yes, but it is generic enough that it brings the point across. More elaborate explanation will be in the linked doc. Your original message was similar but lacked the doc link.
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
For the record: mac pkg failing because of unrelated issues with ruby. |
β Have you followed the guidelines in the Contributing to DVC list?
π Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
β Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. π
Fixes #2473