-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
push/pull: fails to finish querying the remote (azure blob) #7337
Comments
Some new info:
So it seems that the connection is indeed closed from the server side. As to why this is I have no idea, because from our end it looks like dvc is continually sending requests to the blob storage to fully list it. I have also tested if pulling data from the remote gives any trouble on my machine, as my colleague also has problems with that. It seems to work on my end, but I spotted a strange difference when going through the logs I included below. Here the estimated remote size is ~500k, which is less than half of the estimated size my colleague got when pushing (see logs in my previous comment). Even so, dvc decided to query the 59k images in this data set separately instead of querying via traverse (for some reason this is not clear from the log, but I saw the progress bar counting up to 59k before the files began downloading). This all seems very weird to me, as I'd say that with more files to query and a smaller estimated remote size (compared to the push from my previous comment), it would only be logical for dvc to choose the traverse method here as well. What's more, the query for 59k separate files was over in seconds on my machine, while querying via traverse took minutes when I tried to push a data set earlier. With all this I just don't understand why dvc chooses to query via traverse when doing a push. Could this be part of the reason why we get to a timeout? That the estimated times for querying the remote are somehow wrong?
|
Hi @rubenpraets, what's the CPU count on the machine you're running push/pull from? python -c "from multiprocessing import cpu_count; print(cpu_count())" |
From my colleagues machine (the one having trouble) we get 8. |
I'm working on a project that stores a very big DVC cache in S3. What might help is reducing the job count with the -j parameter or by pushing one stage at a time, but this of course is no reliable answer, sorry. |
Hi @rubenpraets, sorry for getting back to you so late. I've had a look at the issue, and I think it might be related to the way that the azure python sdk (used by I found a related issue here (Azure/azure-sdk-for-python#17974), but it seems it has been solved. I'd try to |
@dtrifiro Thanks for the pointers, I will look into them shortly. Even though the error might be caused by underlying libraries, I do want to stress that your heuristic for choosing between querying via object_exists or traverse seems to be off as well, at least for azure. It seems like in this specific situation object_exists is many times faster than traverse (which is chosen here), which indirectly causes errors in the long-running connections. You might want to look into this as well, it would greatly improve the usability of dvc. |
closing as stale |
Bug Report
Description
When trying to push a new directory structure containing relatively many small files (e.g. 18k files of ~50 kB) to our azure blob storage remote (containing ~1.2 million files), the operation quite reliably fails while querying the remote for existing hashes. See below for a full stack trace.
On my colleagues pc, the error always happens after say 6-8 minutes of happily querying away, while the operation is nearly finished. Occasionally such a push succeeds, but that is maybe once for every 10 tries, if not less often.
On my own pc I have had the same problem in the past (be it less reliably), but right now I am unable to reproduce the issue. On my pc the query step went a little faster though and completed in about 5 minutes.
My colleague also reports this problem when trying to pull data from the remote, where it also fails while querying for hashes.
Reproduce
Probably hard to reproduce, but say that ./data is a directory with a lot of new files to push to a remote containing many more files already:
Expected
The new data is successfully pushed to the remote.
Environment information
Output of
dvc doctor
:Additional Information (if any):
We have already tried to upgrade the libraries involved in the stacktrace to the newest versions, to no avail. If anything this has led to more errors being printed about unavailable link types and the like. I however don't immediately think these are related to the above problem.
On the aiohttp github there are a number of issues regarding the same error we get
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
. In particular the last part of the discussion at aio-libs/aiohttp#3904 caught my eye, as there is mention of both azure blob storage and connections that are closed after 5 minutes. The supposed fix is to add the Connection: keep-alive header to your request, so I went digging. We altered the aiohttp code to add this header to all requests. At first glance this seemed to have solved the problem, but sadly it is back again. I don't know if it would be possible to do this in a cleaner way from the dvc code, but I'm willing to try if it works.As a last resort I told my colleague to push new files in smaller chunks. This triggers dvc to query single hashes from the remote (via object_exists) instead of listing it entirely (via traverse), which seems to work for now. I'm not sure though if this method would encounter the same problem if it were to query a larger amount of files this way and go over the 5 minute mark.
Anyway, my colleague is now forced to upload data in chunks of ~2000 files, which is really not practical considering that he routinely needs to upload tens of thousands of files. I hope you can help us out, but if not we would appreciate any pointers you could give to where we might find help.
verbose output from a dvc push (after updating libraries):
The text was updated successfully, but these errors were encountered: