-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull extrememly slow on ~400GB of data with hot DVC cache #3261
Comments
Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of |
It was a known issue, but I didn't follow up on it 😞 Take a look at #1970 (comment) |
Ouch, our heuristic is similar to git's and only checks first N bytes to determine if this is a binary or a text file. Related #992 . That being said, current |
Ok, got it, I made a mistake in the implementation:
[EDIT] |
@kevlar1818 do you set up fresh docker container for each time you use this repo? Or is it kind of your development environment and it will persist? I think we should consider trusting the cache or provide an option allowing the user to do that. |
@kevlar1818 please rerun |
@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull. The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I'd bin our use case in your first description above. That being said, all developers share the |
@kevlar1818 And could you show prof results for that one too, please? |
@efiop I'll have to re-run with profiling enabled on Monday. |
@kevlar1818 One more thing, could you please show |
|
@kevlar1818 Oh, sorry, forgot to mention that it would be great to |
|
Thanks @kevlar1818 ! Btw, looks like one more line is missing from that output. The one for But now that we see that it is nfs, I suspect that it is probably the cause here. Need to check that maybe |
@kevlar1818 Also, what is the remote that you are |
@kevlar1818 Btw, what we could also do here is have a video call together and debug on the spot. It should be much faster for everyone. 🙂 Would that work for you? If it would, please ping me through the email (see it in my github profile) and we'll arrange something ASAP. |
@efiop That was the entire output of
The remote is a MinIO (S3) instance on the same LAN. |
@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work. |
I have a similar issue, even on recent DVC (after #3200 was merged).
in my use case, I'd happily make the cache read only if that offers additional guarantees against corruption, as this is a CI environment and nothing is expected to alter data |
btw from the OP
from what I can see in the profiler results, (related: #3060) |
@kevlar1818 Really sorry for this taking so long, we have a lot on our plate these days. Thanks to amazing research by @pared , I've created #3472 that does some tricks with read-only permissions, so that it could trust any cache. Please consider trying it out with
and let us know if that helps with the issue or not. Thanks for the feedback! |
* dvc: use protected mode by default Fixes #3261 Fixes #2041 * unprotect: adjust help message https://github.com/iterative/dvc.org/pull/1058/files/21aab371a487acf6f6e6201b29cd832e7c55ed23#r393324668
@kevlar1818 The fix was released in 0.90 , please give it a try and let us know if that fixed the issue. Thanks for the feedback and your initial PR! 🙏 |
OS: Docker image based off of
tensorflow/tensorflow:latest-gpu-py3
, with Python 3.7.5 as the system Python.Initial setup:
DVC cache configuration:
Please note that the DVC cache is hot. In other words, most if not all files for
dvc fetch
are present and up-to-date at/ssd/...
.Make a fresh clone and profile
dvc pull
:This
dvc pull
, uninstrumented, usually takes 40+ minutes with a hot DVC cache.Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:
Looking at the
dvc_pull.prof
(in KCachegrind) suggests that the bottleneck is the checksum process. Thefile_md5
anddos2unix
functions inutils/__init__.py
appear particularly costly.Is this a known issue? Would the primary authors of DVC entertain a more performant version of
file_md5
(perhaps written in C/C++ and without TQDM integration)?dvc_pull_prof.zip
The text was updated successfully, but these errors were encountered: