Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

Closed
kevin-hanselman opened this issue Jan 30, 2020 · 37 comments · Fixed by #3472
Closed

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

kevin-hanselman opened this issue Jan 30, 2020 · 37 comments · Fixed by #3472
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@kevin-hanselman
Copy link

kevin-hanselman commented Jan 30, 2020

OS: Docker image based off of tensorflow/tensorflow:latest-gpu-py3, with Python 3.7.5 as the system Python.

Initial setup:

# python --version
Python 3.7.5
# python -m venv .profiling_venv
# source .profiling_venv/bin/activate
# pip install dvc yappi
...
# dvc --version
0.82.6

DVC cache configuration:

# dvc config cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --local cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --global cache.dir
ERROR: configuration error - config file error: section 'cache' doesn't exist
# dvc config --system cache.dir
/ssd/.../dvc/cache
# cat /etc/xdg/dvc/config
[cache]
dir = /ssd/.../dvc/cache
protected = true
type = "hardlink,symlink"

Please note that the DVC cache is hot. In other words, most if not all files for dvc fetch are present and up-to-date at /ssd/....

Make a fresh clone and profile dvc pull:

# git clone ssh://... repo
...
# cd repo
# yappi -f callgrind -o dvc_pull.prof -s $(which dvc) pull
...

This dvc pull, uninstrumented, usually takes 40+ minutes with a hot DVC cache.

Count the number of DVC-tracked files (symlinks, see the above config) and the total size of the repo:

# find . -type l | wc -l
29003
# du -shL .
403G    .

Looking at the dvc_pull.prof (in KCachegrind) suggests that the bottleneck is the checksum process. The file_md5 and dos2unix functions in utils/__init__.py appear particularly costly.

Is this a known issue? Would the primary authors of DVC entertain a more performant version of file_md5 (perhaps written in C/C++ and without TQDM integration)?

dvc_pull_prof.zip

Screenshot_2020-01-30_14-02-20

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 30, 2020
@kevin-hanselman
Copy link
Author

Looking into this further, it appears DVC is wrongly treating the binary files as text files (via istextfile.py). This leads to the huge amount of dos2unix calls mentioned above. In my case, the files have large text headers followed by binary data.

@ghost
Copy link

ghost commented Jan 30, 2020

Is this a known issue?

It was a known issue, but I didn't follow up on it 😞

Take a look at #1970 (comment)

@efiop
Copy link
Contributor

efiop commented Jan 30, 2020

In my case, the files have large text headers followed by binary data.

Ouch, our heuristic is similar to git's and only checks first N bytes to determine if this is a binary or a text file. Related #992 .

That being said, current dvc pull should trust hashes downloaded from remote, so I wonder if there is a bug here somewhere. Need to check, might be related to unpacked dir or something. @pared maybe you have any thoughts about it?

@efiop efiop added the bug Did we break something? label Jan 30, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 30, 2020
@pared
Copy link
Contributor

pared commented Jan 31, 2020

@efiop seems like a bug to me, dos2unix should not be called at all as it's invoked only by file_md5, which supposedly should not happen after #3200. Looking into that.

@pared
Copy link
Contributor

pared commented Jan 31, 2020

Ok, got it, I made a mistake in the implementation:

  • trusting remote cache occurs only when we are actually pulling from the cache
  • in case when we already have a cache (@kevlar1818 has up-to-date cache) we do not trigger
    files download, which was triggering inserting downloaded files checksums into database, I need to fix fetch: trust remote checksums #3200 to make sure that proper checksums are also inserted in case when we do have cache.

[EDIT]
actually that is not that easy, as the calculating checksum of cache files is part of verifying that cache is not corrupted, we need to consider that.

@pared
Copy link
Contributor

pared commented Jan 31, 2020

@kevlar1818 do you set up fresh docker container for each time you use this repo? Or is it kind of your development environment and it will persist?
If the latter, long checkout should not occur again, if your use case is the first one, it will be a recurrent issue.

I think we should consider trusting the cache or provide an option allowing the user to do that.

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Jan 31, 2020
@pared
Copy link
Contributor

pared commented Jan 31, 2020

@kevlar1818 please rerun pull, and tell us if its also taking so much time

@kevin-hanselman
Copy link
Author

kevin-hanselman commented Jan 31, 2020

@pared Rerunning pull on the same repo (not a new clone) seems to produce the same runtime as the first pull.

The hope is that the repo can be cloned in a new docker container whenever a developer wants to work on the project. So I'd bin our use case in your first description above. That being said, all developers share the /ssd cache.

@efiop
Copy link
Contributor

efiop commented Jan 31, 2020

@kevlar1818 And could you show prof results for that one too, please?

@kevin-hanselman
Copy link
Author

@efiop I'll have to re-run with profiling enabled on Monday.

@kevin-hanselman
Copy link
Author

dvc_pull2_prof.zip

Screenshot_2020-02-03_13-19-24

@kevin-hanselman
Copy link
Author

@efiop @pared @MrOutis Please see above for the requested profiling of a second dvc pull

@efiop
Copy link
Contributor

efiop commented Feb 4, 2020

@kevlar1818 One more thing, could you please show dvc version output?

@kevin-hanselman
Copy link
Author

@efiop

# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip

@efiop
Copy link
Contributor

efiop commented Feb 5, 2020

@kevlar1818 Oh, sorry, forgot to mention that it would be great to pip install psutil before that and then run dvc version inside the repository, that way it will provide FS info that might be useful here. Could you please do that and show the output once again? Sorry for the inconvenience.

@kevin-hanselman
Copy link
Author

kevin-hanselman commented Feb 10, 2020

@efiop

# pip install psutil
...
Installing collected packages: psutil
Successfully installed psutil-5.6.7
# dvc version
DVC version: 0.82.6
Python version: 3.7.5
Platform: Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip
Filesystem type (workspace): ('nfs', '...')

@efiop
Copy link
Contributor

efiop commented Feb 11, 2020

Thanks @kevlar1818 !

Btw, looks like one more line is missing from that output. The one for cache.

But now that we see that it is nfs, I suspect that it is probably the cause here. Need to check that maybe inode is broken or something like that. Could you please show full log for dvc pull -v?

@efiop
Copy link
Contributor

efiop commented Feb 11, 2020

@kevlar1818 Also, what is the remote that you are dvc pull-ing from? I don't see it in the config.

@efiop
Copy link
Contributor

efiop commented Feb 11, 2020

@kevlar1818 Btw, what we could also do here is have a video call together and debug on the spot. It should be much faster for everyone. 🙂 Would that work for you? If it would, please ping me through the email (see it in my github profile) and we'll arrange something ASAP.

@kevin-hanselman
Copy link
Author

kevin-hanselman commented Feb 11, 2020

Btw, looks like one more line is missing from that output. The one for cache.

@efiop That was the entire output of dvc version. As I noted in my initial bug report, my cache is located at /ssd/... which is a locally mounted ext4 FS.

Also, what is the remote that you are dvc pull-ing from? I don't see it in the config.

The remote is a MinIO (S3) instance on the same LAN.

@kevin-hanselman
Copy link
Author

@efiop As for a video call, let me get back to you on that. I'd have to carve out some time at work.

@luiscarox
Copy link

I have a similar issue, even on recent DVC (after #3200 was merged).

  • fresh git pull
  • dvc cache is a shared location
  • dvc pull takes ages doing md5 checksums, even though all the data is in the cache and as far as I can tell, it doesnt need to get it from the actual remote.
  • subsequent dvc pull calls dont take long at all (seconds rather than several minutes)

in my use case, I'd happily make the cache read only if that offers additional guarantees against corruption, as this is a CI environment and nothing is expected to alter data

@casperdcl
Copy link
Contributor

casperdcl commented Feb 12, 2020

btw from the OP

perhaps written in C/C++ and without TQDM integration

from what I can see in the profiler results, tqdm is relatively negligible but let me know if it's still considered unacceptably high.

(related: #3060)

@efiop efiop self-assigned this Mar 10, 2020
@efiop
Copy link
Contributor

efiop commented Mar 11, 2020

@kevlar1818 Really sorry for this taking so long, we have a lot on our plate these days. Thanks to amazing research by @pared , I've created #3472 that does some tricks with read-only permissions, so that it could trust any cache. Please consider trying it out with

pip install git+https://github.com/efiop/dvc@3261

and let us know if that helps with the issue or not. Thanks for the feedback!

efiop added a commit to efiop/dvc that referenced this issue Mar 17, 2020
efiop added a commit to efiop/dvc that referenced this issue Mar 17, 2020
efiop added a commit that referenced this issue Mar 17, 2020
@efiop
Copy link
Contributor

efiop commented Mar 18, 2020

@kevlar1818 The fix was released in 0.90 , please give it a try and let us know if that fixed the issue. Thanks for the feedback and your initial PR! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
7 participants