Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallelism to checksum calculation for pulling multiple large files #3416

Closed
Ykid opened this issue Feb 27, 2020 · 22 comments
Closed

Add parallelism to checksum calculation for pulling multiple large files #3416

Ykid opened this issue Feb 27, 2020 · 22 comments
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@Ykid
Copy link

Ykid commented Feb 27, 2020

when checking out, we do it sequentially, stage after stage, that is why checksums calculations are sequential

It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 27, 2020
@Ykid Ykid changed the title Add parallelism to pulling multiple large files Add parallelism to checksum calculation for pulling multiple large files Feb 27, 2020
@Ykid
Copy link
Author

Ykid commented Feb 27, 2020

@pared

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Feb 28, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 28, 2020
@pared
Copy link
Contributor

pared commented Mar 3, 2020

@efiop @Ykid Sorry for the delay,
Here is the start of our conversation:
https://discordapp.com/channels/485586884165107732/563406153334128681/682124179885260804

@Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on dvc pull take a lot of time.

So, I see 2 potential problems:

  1. Files were on NFS, so reading files and feeding it to hash function might be taking a lot of time.
  2. We do parallelize checksum computations, but on out level. What I mean by that, is that when checking out here we do it stage by stage, so we will have:
  • checkout A.dvc
  • checkout B.dvc

and so on.
And as checksum calculation happens down the road of stage.checkout our checksum calculations will be sequential (because stage checkouts are sequential).

So I would vote for changing the name of the issue to Parallelize stage checkouts

@pared pared pinned this issue Mar 3, 2020
@pared pared self-assigned this Mar 3, 2020
@pared pared added research and removed awaiting response we are waiting for your reply, please respond! :) labels Mar 3, 2020
@pared
Copy link
Contributor

pared commented Mar 3, 2020

@Ykid
What version of dvc are you using?

@pared pared added the awaiting response we are waiting for your reply, please respond! :) label Mar 3, 2020
@pared pared unpinned this issue Mar 3, 2020
@Ykid
Copy link
Author

Ykid commented Mar 7, 2020

@pared

DVC version: 0.86.5
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

@efiop
Copy link
Contributor

efiop commented Mar 9, 2020

@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right?

@pared
Copy link
Contributor

pared commented Mar 10, 2020

@efiop Thats right, when pulling we should trust the checksums by default.
@Ykid did you configuration changed in any way since we talked on discord?
It used to be like that:

[cache]
    type = symlink
    protected = true
[core]
    remote = default
    checksum_jobs = 24
['remote "default"']
    url = /path/to/nfs/mount

@Ykid
Copy link
Author

Ykid commented Mar 11, 2020

@pared I didn't change my config. What's the discrepancy you spotted ?

@pared
Copy link
Contributor

pared commented Mar 11, 2020

@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem.

@pared
Copy link
Contributor

pared commented Mar 23, 2020

@Ykid @efiop seems like a bug,
I can confirm that when pulling, md5 is calculated. Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo clone
mkdir repo storage git_repo

main=$(pwd)

pushd git_repo
git init --quiet --bare
popd

pushd repo

git init --quiet
git remote add origin $main/git_repo
dvc init -q

dvc remote add -q -d str $main/storage 

git add -A
git commit -am "init"

dd if=/dev/urandom of=data1 bs=1M count=1 &> /dev/null
dvc add -q data1

dd if=/dev/urandom of=data2 bs=1M count=1 &> /dev/null
dvc add -q data2

git add -A
git commit -am "add data"
dvc push -q
git push origin master
popd

git clone git_repo clone
pushd clone
dvc pull

@pared pared added bug Did we break something? p1-important Important, aka current backlog of things to do and removed awaiting response we are waiting for your reply, please respond! :) labels Mar 23, 2020
@pared
Copy link
Contributor

pared commented Mar 23, 2020

Interestingly I am unable to write test reproducing this behaviour:

def test_trust_remote_checksums(tmp_dir, tmp_path_factory, erepo_dir, mocker):
    with erepo_dir.chdir():

        erepo_dir.dvc_gen({"file": "file content"}, commit="add dir")

    from dvc.scm import Git
    Git.clone(fspath(erepo_dir), fspath(tmp_dir))

    from dvc.repo import Repo

    dvc = Repo(fspath(tmp_dir))

    md5_spy = mocker.spy(dvc.cache.local, "get_file_checksum")

    dvc.pull()

    assert md5_spy.call_count == 0

This one passes, I think we should include this task in next sprint.

@pared
Copy link
Contributor

pared commented May 6, 2020

Well, my test has not been working, due to #2888, will fix it during this issue.
The reason why my bash script results in md5 calculation and test does not is that erepo_dir has been using @efiop optimization from #3472. The cache file was protected. Thanks to that, when we were gathering dvc/remote/local._status, cache_exists did not have to calculate checksum (because file in cache was read only) and in my bash script I was using remote storage (local one), which does not protect its content by default. Then, when I am cloning the git repo, and pull the data, DVC checks cache_exists -> changed_cache_file which recalculates md5, because files on local remote are not protected.

@pared
Copy link
Contributor

pared commented May 20, 2020

So I would say that the issue we have here is how we treat LocalRemote:
Why do we calculate md5 of obtained files?
When pulling we check files status which in results calls cache_exists. For LocalRemote, we are
iterating over desired checskums and call changed_cache_file in sequential manner.

So to help slow checksum calculation, we could introduce parallelization here

Another thing is that in case of other remotes (besides ssh) we do not trigger checksum calculation, it is assumed that checksum on remote is the right one. I guess we are not doing it locally to verify no cache corruption has occurred. But this is the reason why we are calculating the checksum.

@pared
Copy link
Contributor

pared commented May 21, 2020

Problem: changed_cache_file for local uses state, therefore, SQLite is preventing us from parallelizing cache_exist for LocalRemote.

@shcheklein
Copy link
Member

@pared SQLLite should be thread-safe and can be used from multiple threads?

@efiop
Copy link
Contributor

efiop commented May 21, 2020

@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected?

@pared
Copy link
Contributor

pared commented May 22, 2020

@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

@efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that.

@shcheklein
Copy link
Member

Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue.

@pared
Copy link
Contributor

pared commented May 22, 2020

@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote.
As to parallelization, this issue needs to be researched further.

@efiop
Copy link
Contributor

efiop commented May 23, 2020

State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858

@efiop efiop closed this as completed May 23, 2020
@efiop
Copy link
Contributor

efiop commented Aug 28, 2020

For the record: this was probably caused by #4485 , preparing a fix. CC @pared

@pared
Copy link
Contributor

pared commented Aug 31, 2020

@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
Development

No branches or pull requests

4 participants