Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster data transfer from remote #3407

Closed
vernt opened this issue Feb 25, 2020 · 3 comments
Closed

Faster data transfer from remote #3407

vernt opened this issue Feb 25, 2020 · 3 comments
Labels
p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks

Comments

@vernt
Copy link
Contributor

vernt commented Feb 25, 2020

Currently transferring data from an ssh remote uses Paramiko. On my files I observe an 18x improvement in transfer rate by using rsync instead of dvc pull. Perhaps there are other options too. It would be nice to use some fast tools when they are available.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 25, 2020
@shcheklein
Copy link
Member

@vernt have you tried increasing number of jobs and checksum jobs (--jobs in the dvc pull and probably dvc config core.checksum_jobs) parameters?

could you please share dvc pull -v with the latest DVC version so that we can see how much each operation takes (calculating md5s vs downloadin vs linking files to the workspace).

It would be great to see the profiler results as well:

python -m cProfile -o dvc-pull.prof -m dvc pull -j

@efiop efiop added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks labels Feb 26, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 26, 2020
@jorgeorpinel
Copy link
Contributor

Just for references, here's some previous exploration along these lines: #2330

@jorgeorpinel jorgeorpinel added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) labels Feb 26, 2020
@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

We've now migrated from paramiko to asyncssh (see https://github.com/iterative/sshfs), which is up to 4 times faster than old approach (#6064 (comment)). We plan on pushing it even further in the future.

@efiop efiop closed this as completed Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

4 participants