remote: s3: adjust jobs number basing on file descriptors number #2866

pared · 2019-11-29T08:20:36Z

❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

pared · 2019-12-09T17:21:12Z

Note:
If we will agree that is how we want to handle s3, I will have to prepare change to docs about push and pull.

shcheklein · 2019-12-09T18:05:22Z

sorry guys, missed the discussion in the ticket. @pared could you please summarize the changes, show a gif on how would it look like?

My 2cs - it feels a bit that we are over-architecting this if we try to analyze the system in advance. Why don't we just handle this gracefully and show a message in logs/screen that "you are hitting the limit blah - blah ..."?

pared · 2019-12-10T10:51:09Z

@shcheklein the whole implementation is a result of our discussion in the original issue.
We wanted to achieve two goals:

Prevent Too many open files from occurring
The default jobs number is pretty high (16) and we would like it to stay, to make default experience better.

The problem here is that we cannot for 100% say that this error will occur, which heavily depends on the size of particular files and speed of the network. We know that this error might occur when the number of jobs exceeds some (possible to estimate) number. However, user might be aware of that and purposefully increases the number of jobs because he knows that for some reason (super-fast network) this does not apply to him. So what I want to introduce in this change is:

Decrease automatically default number of jobs if user did not provide jobs, just to make his life easier, example:
https://asciinema.org/a/286748
(added print that its actually decreasing)
If everything seems fine, just use default jobs number:
https://asciinema.org/a/286749
(also this print will not be there)
In case of user defining jobs specifically, warn him:
https://asciinema.org/a/286750

Suor · 2019-12-10T12:33:11Z

I don't like this approach. This is:

fragile, if something is happening in the bg, like backup or torrents) then a user will still get "too many files"
bothers user with some warnings
still stops on errors, this makes it non-reliable to run automatically unless you use conservative jobs.

So I would prefer dvc to retry on errors, adjust jobs automatically. Ideally we won't need jobs param at all.

pared · 2019-12-10T19:12:32Z

@Suor I don't like the retrying approach, because you cannot determine when and if the operation will fail. So it is possible, that someone will get an error after a few minutes of operation execution. And I think that would be really bad because the user would need to reduce number of jobs/increase fds limit, retry and hope for the best. I think that could be frustrating if faced a few times.

shcheklein · 2019-12-10T20:04:40Z

@pared thanks for the clarifications! Even though I tend to agree with @Suor that our future direction should be aggressive (and do not fail, but adjust itself dynamically), I think we should move forward (especially since it was discussed by a few members of the team) with this and move @Suor 's general suggestion to a separate discussion and approach holistically.

@efiop just to clarify (since I missed the discussion), what's your take on this?

efiop · 2019-12-11T05:22:27Z

fragile, if something is happening in the bg, like backup or torrents) then a user will still get "too many files"

@Suor ulimit is per-session, your torrents (aye-aye, captain Alex! ⚓️ 🚢 🔫 ☠️) and backups are in a different session, so they won't affect your dvc pull.

bothers user with some warnings
still stops on errors, this makes it non-reliable to run automatically unless you use conservative jobs.

Agreed on these two. The warning can be replaced with a post-crash one (see part about it below). I would probably make an emphasis on "increase your ulimit!!" since mac's ulimit is just ridiculous, and I was running into it constantly until I've just adjusted my ulimits.

So I would prefer dvc to retry on errors, adjust jobs automatically. Ideally we won't need jobs param at all.

Ideally yes, but that is a different ticket. With ulimit, it is hard to retry, as you can't really be sure if it won't fail again or after 10 minutes of wait time. I haven't tested it, so not 100% sure if that would be suitable for us or not, I guess it depends on where we are going to catch that exception.

@pared We've agreed to definitely catch that exception in main.py and show a meaningful warning to start with, right? I don't think anyone has any objections about that part that should come as a separate PR.

dvc/remote/s3.py

pared · 2019-12-11T07:23:22Z

@Suor @shcheklein @efiop
Ok, so to sum it up:

Lets create new issue where we will handle OsError 24 and post-fail communication, and maybe automatic adjustment of jobs, post fail
Modify current change so that it does not log warnings, but leave adjusting default jobs number be.
How does it sound?

pared · 2019-12-11T08:08:45Z

Note: I think I did not initially understand @Suor point of view, and imagined it as following workflow:
push -> fail -> push again (manually).
I was wrong here, right? In the end, we should let it fail and retry (automatically) upon "Too many open files" error. That should not penalize the whole operation, because push/pull operation as a whole is not success or fail operation, some of the files should be pushed/pulled.

Suor · 2019-12-11T11:27:31Z

@pared the retry thing may be definitely separated and might need to be handled along with retries for other network related reasons.

We can merge this as is or remove warning, both are fine with me.

@pared yes, I meant automatic retries.

efiop · 2019-12-12T02:40:57Z

@pared It still feels like catching the error and printing a message to increase the ulimit brings more value here. I'm worried about automatic adjustment like this because it is a lot of heuristics to handle. Maybe it is indeed the only way to do that, but it feels like if I was the user, if dvc told me "increase your ulimit" it would be more useful than the warning and the auto-throttle. Especially since we are talking mostly about mac, where ulimit for files is simply ridiculously low, so it makes total sense to just make user increase it.

efiop · 2019-12-16T10:21:40Z

dvc/remote/base.py

+        self, from_info, to_info, exception, operation
+    ):
+        if isinstance(exception, OSError) and exception.errno == 24:
+            raise TooManyOpenFilesError(exception)


I think it is better to handle this in main.py , as it might raise elsewhere too. E.g. on status even.

So what I mean is that in main.py we have a few try:excepts for any CLI command. I would put an except there for this and just would print a proper error. What do you think?

Ok, as @pared noted, this is not as trivial, because we catch Exception in upload/download, so it won't hit main.py...

Btw, this changes the flow from what we've had before. As previously it would catch and return 1 and not re-raise. Not sure about the consequences.

efiop · 2019-12-16T10:49:31Z

dvc/exceptions.py

+            "Operation failed due to too many open file descriptors. reduce "
+            "the number of jobs or increase open file descriptors limit to "
+            "prevent this.",


We should put the emphasis on "increase your ulimit", as that is the most reasonable approach. Also, it is better if we tell how to do it or refer to a doc.

pared · 2019-12-16T11:06:16Z

It seems that handling exceptions coming from parallel execution of tasks is causing some confusion.

We have ugly try/ catch Exception in _download_file and in upload
Till now we used to throw generic [Download/Upload]Error after all operations if some of them failed
I think that on Too many open files we could raise exception immediately.
Trying to introduce separate Too many open files error handling in RemoteBASE caused me to scatter logic responsible for error handling between RemoteBASE and RemoteLOCAL which obfuscates what is actually happening with the errors.

That brings me to the conclusion that we might need to introduce some kind of error handler to whom we could delegate logic related to error handling.

@iterative/engineering What do you think about this idea?

[EDIT]
How I see that handler could work is:

RemoteBase has some kind of BaseHandler that can take care of some generic errors.
Errors from the particular download/upload tasks could be submitted, one by one
Handler would decide how severe is the error, if its problem like Too many open files, raise immediately,
If error is not severe we just pass it down the line to BaseHandler which in the end can fail with Upload/Download Error.

That way we preserve "best effort" approach (download/ upload as much as you can) and we still can fail in cases that we know will not yield useful result (Too many open files means that fds limit is saturated and probably all following operations will fail)
Also, we get error handling logic in one place, which would be more readable. Each remote can have its own handler, so instead of writing lots of isinstance(exc, {Exception}), create {X}Handler and just use it before BaseHandler (because base handler, in the end will fail with generic [Upload/Download]Error).
Another thing is thanks to that, Exceptions could get passed up to main in their original form and not ambiguous [Upload/Download]Error.

efiop · 2019-12-16T11:40:56Z

@pared So for example, _download from RemoteS3 would catch its specific boto3.ClientError and maybe some other stuff that is not critical and so other workers can proceed working, while we in download in RemoteBASE would catch some generic issues (not sure which ones from the top of my head) and main.py would catch OSError for ulimit issue, right? That way critical errors could propaate up and stop everything, while non-critical ones wouldn't interrupt other workers.

pared · 2019-12-16T11:49:33Z

@efiop yes, that's what I have in mind.

I also can't come up some common errors for all remotes, but handling "Unspecified" exceptions is common to all classes (raise [Upload/ Download]Error after all operations are performed)

efiop · 2019-12-16T11:52:26Z

@pared Should be quite straight forward. E.g. for s3 it is ClientError, for other remotes it is also something specific that we can get straight from the docs. There will be a learning curve though, but I think that is reasonable for the sake of proper error handling and not this current lazy "except Exception"(my fault originally).

pared · 2019-12-16T11:55:56Z

@efiop I agree that there will be a learning curve, though I believe that the current way of handling exceptions also requires some insight.

efiop · 2019-12-16T12:00:45Z

@pared Agreed. Ok, so on a quick google for upload/download:
s3 - ClientError
azure - ClientException
gs - ClientError
etc
So looks like it is pretty standardized for those clouds, which is pretty nice. As a first step, it would be reasonable to count those as recoverable ones (meaning that they would be logged and return 1, same as we do right now with all exceptions in workers) and the rest consider non-recoverable, that way we could catch ulimit error in main.py, as discussed earlier. What do you think?

pared · 2019-12-16T12:03:52Z

@efiop I would actually leave error as recoverable by default, and implement logic for those that we think should fail the whole operation, that seems to go along with logic we applied until this issue came up.

efiop · 2019-12-16T12:12:35Z

@pared Right, that is what you are basically doing right now. Makes sense. Seems like we agree on the throttling magic being too fragile, so maybe let's repurpose this PR to only properly handle the ulimit issue? If so, one thing I would do is to not create that exception, but rather catch that as an OSError in main.py and logger.error() it instead with that message. Doesn't look like we really need a special exception for it.

efiop · 2019-12-16T18:43:25Z

dvc/main.py

+            if os.name != "nt":
+                solution_hint = (
+                    ", increase open file descriptors limit with `ulimit`"
+                )


The command is not full and it is different for mac and linux 🙂Windows also has ways to increase ulimit. WIthout the hint this message is useless, as it just duplicates what the exception says.

Also, might consider just refering to our doc (e.g. in user-guide) and explaining nicely what is going on and how to solve it on different systems and why increasing ulimit is better for dvc.

efiop · 2019-12-16T18:44:01Z

tests/func/test_remote.py

+
+    with pytest.raises(OSError) as e:
+        dvc.push()
+        assert e is too_many_open_files_error


why not just assert e.errno == 24?

efiop · 2019-12-16T18:45:03Z

tests/func/test_remote.py

+    tmp_dir.dvc_gen({"file": "file content"})
+
+    too_many_open_files_error = OSError()
+    mocker.patch.object(too_many_open_files_error, "errno", 24)


You can just exc.errno = 24, no need to mock. Or am i missing something?

efiop · 2019-12-16T18:46:03Z

tests/func/test_remote.py

+    too_many_open_files_error = OSError()
+    mocker.patch.object(too_many_open_files_error, "errno", 24)
+    mocker.patch.object(
+        RemoteLOCAL, "_upload", side_effect=too_many_open_files_error


I'm pretty sure you can just

Suggested change

RemoteLOCAL, "_upload", side_effect=too_many_open_files_error

RemoteLOCAL, "_upload", side_effect=OSError(24)

dvc/main.py

efiop · 2019-12-17T10:25:27Z

dvc/main.py

+                "too many open files error, for solution, please refer to our "
+                "user guide"
+            )


It would look like "ERRROR: too many open files error..." which is odd. How about

too many open files, please increase your ulimit. Refer to https://dvc.org/doc/user-guide/troubleshooting for more info.

This way there is a placeholder non-verbose solution and a referal to docs.

Could skip troubleshooting part of the URL for now though. Not ideal, but at least we won't promise anything here 🙂

Isn't ulimit unix-specific? That message would be amibguous for windows users

@pared Yes, but it is generic enough that it brings the point across. More elaborate explanation will be in the linked doc. Your original message was similar but lacked the doc link.

dvc/main.py

Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>

efiop · 2019-12-17T11:57:42Z

For the record: mac pkg failing because of unrelated issues with ruby.

weekly-digest bot mentioned this pull request Dec 1, 2019

Weekly Digest (24 November, 2019 - 1 December, 2019) #2876

Closed

pared force-pushed the 2473 branch from 857aa70 to 6407955 Compare December 5, 2019 13:45

pared marked this pull request as ready for review December 6, 2019 12:18

pared changed the title ~~[WIP] remote: s3: adjust jobs number basing on file descriptors number~~ remote: s3: adjust jobs number basing on file descriptors number Dec 6, 2019

pared requested a review from efiop December 6, 2019 12:19

weekly-digest bot mentioned this pull request Dec 8, 2019

Weekly Digest (1 December, 2019 - 8 December, 2019) #2919

Closed

pared requested a review from Suor December 9, 2019 15:53

efiop reviewed Dec 11, 2019

View reviewed changes

dvc/remote/s3.py Outdated Show resolved Hide resolved

pared force-pushed the 2473 branch 2 times, most recently from 5544120 to 3d3747f Compare December 16, 2019 07:24

efiop reviewed Dec 16, 2019

View reviewed changes

pared force-pushed the 2473 branch from 4cb8894 to 6219868 Compare December 16, 2019 17:04

efiop reviewed Dec 16, 2019

View reviewed changes

pared force-pushed the 2473 branch from 6219868 to 47d3d75 Compare December 17, 2019 10:10

pared mentioned this pull request Dec 17, 2019

sync: prepare troubleshooting guide for Too many open files error iterative/dvc.org#871

Closed

efiop reviewed Dec 17, 2019

View reviewed changes

dvc/main.py Show resolved Hide resolved

efiop reviewed Dec 17, 2019

View reviewed changes

pared added 7 commits December 17, 2019 11:41

remote: s3: adjust jobs number basing on file descriptors number

0fdb88a

remote: s3: adjust jobs only on default

c167eb2

s3: test: mocker instead of mock annotations

710c4bd

remote: s3: reading ulimit/handles limit: support windows

a0eda4c

remote: upload/download: handle OSError

d253f13

remote: base: [down/up]load: raise on too many open files

6ba4aba

remote: s3: revert jobs adjustment, pass OSError to main

5316336

pared force-pushed the 2473 branch from 47d3d75 to 5316336 Compare December 17, 2019 10:41

pared mentioned this pull request Dec 17, 2019

Create error handler for remotes #2965

Closed

efiop reviewed Dec 17, 2019

View reviewed changes

dvc/main.py Outdated Show resolved Hide resolved

Update dvc/main.py

ab54847

Co-Authored-By: Ruslan Kuprieiev <kupruser@gmail.com>

efiop approved these changes Dec 17, 2019

View reviewed changes

efiop merged commit 85b1721 into iterative:master Dec 17, 2019

pared mentioned this pull request Dec 19, 2019

Add troubleshooting guide iterative/dvc.org#875

Merged

weekly-digest bot mentioned this pull request Dec 22, 2019

Weekly Digest (15 December, 2019 - 22 December, 2019) #2996

Closed

pared deleted the 2473 branch March 24, 2020 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: s3: adjust jobs number basing on file descriptors number #2866

remote: s3: adjust jobs number basing on file descriptors number #2866

pared commented Nov 29, 2019 •

edited

Loading

pared commented Dec 9, 2019

shcheklein commented Dec 9, 2019

pared commented Dec 10, 2019

Suor commented Dec 10, 2019

pared commented Dec 10, 2019

shcheklein commented Dec 10, 2019

efiop commented Dec 11, 2019 •

edited

Loading

pared commented Dec 11, 2019

pared commented Dec 11, 2019

Suor commented Dec 11, 2019

efiop commented Dec 12, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

pared commented Dec 16, 2019 •

edited

Loading

efiop commented Dec 16, 2019 •

edited

Loading

pared commented Dec 16, 2019

efiop commented Dec 16, 2019 •

edited

Loading

pared commented Dec 16, 2019

efiop commented Dec 16, 2019

pared commented Dec 16, 2019

efiop commented Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 16, 2019

efiop Dec 17, 2019

efiop Dec 17, 2019

pared Dec 17, 2019

efiop Dec 17, 2019

efiop commented Dec 17, 2019

	RemoteLOCAL, "_upload", side_effect=too_many_open_files_error
	RemoteLOCAL, "_upload", side_effect=OSError(24)

remote: s3: adjust jobs number basing on file descriptors number #2866

remote: s3: adjust jobs number basing on file descriptors number #2866

Conversation

pared commented Nov 29, 2019 • edited Loading

pared commented Dec 9, 2019

shcheklein commented Dec 9, 2019

pared commented Dec 10, 2019

Suor commented Dec 10, 2019

pared commented Dec 10, 2019

shcheklein commented Dec 10, 2019

efiop commented Dec 11, 2019 • edited Loading

pared commented Dec 11, 2019

pared commented Dec 11, 2019

Suor commented Dec 11, 2019

efiop commented Dec 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pared commented Dec 16, 2019 • edited Loading

efiop commented Dec 16, 2019 • edited Loading

pared commented Dec 16, 2019

efiop commented Dec 16, 2019 • edited Loading

pared commented Dec 16, 2019

efiop commented Dec 16, 2019

pared commented Dec 16, 2019

efiop commented Dec 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efiop commented Dec 17, 2019

pared commented Nov 29, 2019 •

edited

Loading

efiop commented Dec 11, 2019 •

edited

Loading

pared commented Dec 16, 2019 •

edited

Loading

efiop commented Dec 16, 2019 •

edited

Loading

efiop commented Dec 16, 2019 •

edited

Loading