Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: s3: adjust jobs number basing on file descriptors number #2866

Merged
merged 8 commits into from
Dec 17, 2019

Conversation

pared
Copy link
Contributor

@pared pared commented Nov 29, 2019

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • πŸ“– Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Fixes #2473

@pared pared marked this pull request as ready for review December 6, 2019 12:18
@pared pared changed the title [WIP] remote: s3: adjust jobs number basing on file descriptors number remote: s3: adjust jobs number basing on file descriptors number Dec 6, 2019
@pared pared requested a review from efiop December 6, 2019 12:19
@pared pared requested a review from Suor December 9, 2019 15:53
@pared
Copy link
Contributor Author

pared commented Dec 9, 2019

Note:
If we will agree that is how we want to handle s3, I will have to prepare change to docs about push and pull.

@shcheklein
Copy link
Member

sorry guys, missed the discussion in the ticket. @pared could you please summarize the changes, show a gif on how would it look like?

My 2cs - it feels a bit that we are over-architecting this if we try to analyze the system in advance. Why don't we just handle this gracefully and show a message in logs/screen that "you are hitting the limit blah - blah ..."?

@pared
Copy link
Contributor Author

pared commented Dec 10, 2019

@shcheklein the whole implementation is a result of our discussion in the original issue.
We wanted to achieve two goals:

  1. Prevent Too many open files from occurring
  2. The default jobs number is pretty high (16) and we would like it to stay, to make default experience better.

The problem here is that we cannot for 100% say that this error will occur, which heavily depends on the size of particular files and speed of the network. We know that this error might occur when the number of jobs exceeds some (possible to estimate) number. However, user might be aware of that and purposefully increases the number of jobs because he knows that for some reason (super-fast network) this does not apply to him. So what I want to introduce in this change is:

  1. Decrease automatically default number of jobs if user did not provide jobs, just to make his life easier, example:
    https://asciinema.org/a/286748
    (added print that its actually decreasing)

  2. If everything seems fine, just use default jobs number:
    https://asciinema.org/a/286749
    (also this print will not be there)

  3. In case of user defining jobs specifically, warn him:
    https://asciinema.org/a/286750

@Suor
Copy link
Contributor

Suor commented Dec 10, 2019

I don't like this approach. This is:

  • fragile, if something is happening in the bg, like backup or torrents) then a user will still get "too many files"
  • bothers user with some warnings
  • still stops on errors, this makes it non-reliable to run automatically unless you use conservative jobs.

So I would prefer dvc to retry on errors, adjust jobs automatically. Ideally we won't need jobs param at all.

@pared
Copy link
Contributor Author

pared commented Dec 10, 2019

@Suor I don't like the retrying approach, because you cannot determine when and if the operation will fail. So it is possible, that someone will get an error after a few minutes of operation execution. And I think that would be really bad because the user would need to reduce number of jobs/increase fds limit, retry and hope for the best. I think that could be frustrating if faced a few times.

@shcheklein
Copy link
Member

@pared thanks for the clarifications! Even though I tend to agree with @Suor that our future direction should be aggressive (and do not fail, but adjust itself dynamically), I think we should move forward (especially since it was discussed by a few members of the team) with this and move @Suor 's general suggestion to a separate discussion and approach holistically.

@efiop just to clarify (since I missed the discussion), what's your take on this?

@efiop
Copy link
Contributor

efiop commented Dec 11, 2019

fragile, if something is happening in the bg, like backup or torrents) then a user will still get "too many files"

@Suor ulimit is per-session, your torrents (aye-aye, captain Alex! βš“οΈ 🚒 πŸ”« ☠️) and backups are in a different session, so they won't affect your dvc pull.

bothers user with some warnings
still stops on errors, this makes it non-reliable to run automatically unless you use conservative jobs.

Agreed on these two. The warning can be replaced with a post-crash one (see part about it below). I would probably make an emphasis on "increase your ulimit!!" since mac's ulimit is just ridiculous, and I was running into it constantly until I've just adjusted my ulimits.

So I would prefer dvc to retry on errors, adjust jobs automatically. Ideally we won't need jobs param at all.

Ideally yes, but that is a different ticket. With ulimit, it is hard to retry, as you can't really be sure if it won't fail again or after 10 minutes of wait time. I haven't tested it, so not 100% sure if that would be suitable for us or not, I guess it depends on where we are going to catch that exception.

@pared We've agreed to definitely catch that exception in main.py and show a meaningful warning to start with, right? I don't think anyone has any objections about that part that should come as a separate PR.

dvc/remote/s3.py Outdated Show resolved Hide resolved
@pared
Copy link
Contributor Author

pared commented Dec 11, 2019

@Suor @shcheklein @efiop
Ok, so to sum it up:

  1. Lets create new issue where we will handle OsError 24 and post-fail communication, and maybe automatic adjustment of jobs, post fail
  2. Modify current change so that it does not log warnings, but leave adjusting default jobs number be.
    How does it sound?

@pared
Copy link
Contributor Author

pared commented Dec 11, 2019

Note: I think I did not initially understand @Suor point of view, and imagined it as following workflow:
push -> fail -> push again (manually).
I was wrong here, right? In the end, we should let it fail and retry (automatically) upon "Too many open files" error. That should not penalize the whole operation, because push/pull operation as a whole is not success or fail operation, some of the files should be pushed/pulled.

@Suor
Copy link
Contributor

Suor commented Dec 11, 2019

@pared the retry thing may be definitely separated and might need to be handled along with retries for other network related reasons.

We can merge this as is or remove warning, both are fine with me.

@pared yes, I meant automatic retries.

@efiop
Copy link
Contributor

efiop commented Dec 12, 2019

@pared It still feels like catching the error and printing a message to increase the ulimit brings more value here. I'm worried about automatic adjustment like this because it is a lot of heuristics to handle. Maybe it is indeed the only way to do that, but it feels like if I was the user, if dvc told me "increase your ulimit" it would be more useful than the warning and the auto-throttle. Especially since we are talking mostly about mac, where ulimit for files is simply ridiculously low, so it makes total sense to just make user increase it.

@pared pared force-pushed the 2473 branch 2 times, most recently from 5544120 to 3d3747f Compare December 16, 2019 07:24
self, from_info, to_info, exception, operation
):
if isinstance(exception, OSError) and exception.errno == 24:
raise TooManyOpenFilesError(exception)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to handle this in main.py , as it might raise elsewhere too. E.g. on status even.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I mean is that in main.py we have a few try:excepts for any CLI command. I would put an except there for this and just would print a proper error. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as @pared noted, this is not as trivial, because we catch Exception in upload/download, so it won't hit main.py...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this changes the flow from what we've had before. As previously it would catch and return 1 and not re-raise. Not sure about the consequences.

Comment on lines 352 to 354
"Operation failed due to too many open file descriptors. reduce "
"the number of jobs or increase open file descriptors limit to "
"prevent this.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put the emphasis on "increase your ulimit", as that is the most reasonable approach. Also, it is better if we tell how to do it or refer to a doc.

@pared
Copy link
Contributor Author

pared commented Dec 16, 2019

It seems that handling exceptions coming from parallel execution of tasks is causing some confusion.

  1. We have ugly try/ catch Exception in _download_file and in upload
  2. Till now we used to throw generic [Download/Upload]Error after all operations if some of them failed
  3. I think that on Too many open files we could raise exception immediately.
  4. Trying to introduce separate Too many open files error handling in RemoteBASE caused me to scatter logic responsible for error handling between RemoteBASE and RemoteLOCAL which obfuscates what is actually happening with the errors.

That brings me to the conclusion that we might need to introduce some kind of error handler to whom we could delegate logic related to error handling.

@iterative/engineering What do you think about this idea?

[EDIT]
How I see that handler could work is:

  1. RemoteBase has some kind of BaseHandler that can take care of some generic errors.
  2. Errors from the particular download/upload tasks could be submitted, one by one
  3. Handler would decide how severe is the error, if its problem like Too many open files, raise immediately,
  4. If error is not severe we just pass it down the line to BaseHandler which in the end can fail with Upload/Download Error.

That way we preserve "best effort" approach (download/ upload as much as you can) and we still can fail in cases that we know will not yield useful result (Too many open files means that fds limit is saturated and probably all following operations will fail)
Also, we get error handling logic in one place, which would be more readable. Each remote can have its own handler, so instead of writing lots of isinstance(exc, {Exception}), create {X}Handler and just use it before BaseHandler (because base handler, in the end will fail with generic [Upload/Download]Error).
Another thing is thanks to that, Exceptions could get passed up to main in their original form and not ambiguous [Upload/Download]Error.

@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

@pared So for example, _download from RemoteS3 would catch its specific boto3.ClientError and maybe some other stuff that is not critical and so other workers can proceed working, while we in download in RemoteBASE would catch some generic issues (not sure which ones from the top of my head) and main.py would catch OSError for ulimit issue, right? That way critical errors could propaate up and stop everything, while non-critical ones wouldn't interrupt other workers.

@pared
Copy link
Contributor Author

pared commented Dec 16, 2019

@efiop yes, that's what I have in mind.

I also can't come up some common errors for all remotes, but handling "Unspecified" exceptions is common to all classes (raise [Upload/ Download]Error after all operations are performed)

@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

@pared Should be quite straight forward. E.g. for s3 it is ClientError, for other remotes it is also something specific that we can get straight from the docs. There will be a learning curve though, but I think that is reasonable for the sake of proper error handling and not this current lazy "except Exception"(my fault originally).

@pared
Copy link
Contributor Author

pared commented Dec 16, 2019

@efiop I agree that there will be a learning curve, though I believe that the current way of handling exceptions also requires some insight.

@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

@pared Agreed. Ok, so on a quick google for upload/download:
s3 - ClientError
azure - ClientException
gs - ClientError
etc
So looks like it is pretty standardized for those clouds, which is pretty nice. As a first step, it would be reasonable to count those as recoverable ones (meaning that they would be logged and return 1, same as we do right now with all exceptions in workers) and the rest consider non-recoverable, that way we could catch ulimit error in main.py, as discussed earlier. What do you think?

@pared
Copy link
Contributor Author

pared commented Dec 16, 2019

@efiop I would actually leave error as recoverable by default, and implement logic for those that we think should fail the whole operation, that seems to go along with logic we applied until this issue came up.

@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

@pared Right, that is what you are basically doing right now. Makes sense. Seems like we agree on the throttling magic being too fragile, so maybe let's repurpose this PR to only properly handle the ulimit issue? If so, one thing I would do is to not create that exception, but rather catch that as an OSError in main.py and logger.error() it instead with that message. Doesn't look like we really need a special exception for it.

dvc/main.py Outdated
Comment on lines 70 to 73
if os.name != "nt":
solution_hint = (
", increase open file descriptors limit with `ulimit`"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command is not full and it is different for mac and linux πŸ™‚Windows also has ways to increase ulimit. WIthout the hint this message is useless, as it just duplicates what the exception says.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might consider just refering to our doc (e.g. in user-guide) and explaining nicely what is going on and how to solve it on different systems and why increasing ulimit is better for dvc.


with pytest.raises(OSError) as e:
dvc.push()
assert e is too_many_open_files_error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just assert e.errno == 24?

tmp_dir.dvc_gen({"file": "file content"})

too_many_open_files_error = OSError()
mocker.patch.object(too_many_open_files_error, "errno", 24)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just exc.errno = 24, no need to mock. Or am i missing something?

too_many_open_files_error = OSError()
mocker.patch.object(too_many_open_files_error, "errno", 24)
mocker.patch.object(
RemoteLOCAL, "_upload", side_effect=too_many_open_files_error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you can just

Suggested change
RemoteLOCAL, "_upload", side_effect=too_many_open_files_error
RemoteLOCAL, "_upload", side_effect=OSError(24)

dvc/main.py Show resolved Hide resolved
dvc/main.py Outdated
Comment on lines 70 to 71
"too many open files error, for solution, please refer to our "
"user guide"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would look like "ERRROR: too many open files error..." which is odd. How about

too many open files, please increase your ulimit. Refer to https://dvc.org/doc/user-guide/troubleshooting for more info.

This way there is a placeholder non-verbose solution and a referal to docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could skip troubleshooting part of the URL for now though. Not ideal, but at least we won't promise anything here πŸ™‚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't ulimit unix-specific? That message would be amibguous for windows users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared Yes, but it is generic enough that it brings the point across. More elaborate explanation will be in the linked doc. Your original message was similar but lacked the doc link.

dvc/main.py Outdated Show resolved Hide resolved
Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>
@efiop
Copy link
Contributor

efiop commented Dec 17, 2019

For the record: mac pkg failing because of unrelated issues with ruby.

@efiop efiop merged commit 85b1721 into iterative:master Dec 17, 2019
@pared pared deleted the 2473 branch March 24, 2020 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'Errno 24 - Too many open files' on dvc push
4 participants