Add parallelism to checksum calculation for pulling multiple large files #3416

Ykid · 2020-02-27T01:27:01Z

when checking out, we do it sequentially, stage after stage, that is why checksums calculations are sequential

It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.

Ykid · 2020-02-27T01:30:12Z

@pared

efiop · 2020-02-28T03:04:06Z

@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case?

pared · 2020-03-03T14:32:45Z

@efiop @Ykid Sorry for the delay,
Here is the start of our conversation:
https://discordapp.com/channels/485586884165107732/563406153334128681/682124179885260804

@Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on dvc pull take a lot of time.

So, I see 2 potential problems:

Files were on NFS, so reading files and feeding it to hash function might be taking a lot of time.
We do parallelize checksum computations, but on out level. What I mean by that, is that when checking out here we do it stage by stage, so we will have:

checkout A.dvc
checkout B.dvc

and so on.
And as checksum calculation happens down the road of stage.checkout our checksum calculations will be sequential (because stage checkouts are sequential).

So I would vote for changing the name of the issue to Parallelize stage checkouts

pared · 2020-03-03T17:30:37Z

@Ykid
What version of dvc are you using?

Ykid · 2020-03-07T05:50:15Z

@pared

DVC version: 0.86.5
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

efiop · 2020-03-09T12:00:02Z

@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right?

pared · 2020-03-10T16:13:29Z

@efiop Thats right, when pulling we should trust the checksums by default.
@Ykid did you configuration changed in any way since we talked on discord?
It used to be like that:

[cache]
    type = symlink
    protected = true
[core]
    remote = default
    checksum_jobs = 24
['remote "default"']
    url = /path/to/nfs/mount

Ykid · 2020-03-11T02:24:36Z

@pared I didn't change my config. What's the discrepancy you spotted ?

pared · 2020-03-11T11:52:38Z

@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem.

pared · 2020-03-23T10:11:34Z

@Ykid @efiop seems like a bug,
I can confirm that when pulling, md5 is calculated. Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo clone
mkdir repo storage git_repo

main=$(pwd)

pushd git_repo
git init --quiet --bare
popd

pushd repo

git init --quiet
git remote add origin $main/git_repo
dvc init -q

dvc remote add -q -d str $main/storage 

git add -A
git commit -am "init"

dd if=/dev/urandom of=data1 bs=1M count=1 &> /dev/null
dvc add -q data1

dd if=/dev/urandom of=data2 bs=1M count=1 &> /dev/null
dvc add -q data2

git add -A
git commit -am "add data"
dvc push -q
git push origin master
popd

git clone git_repo clone
pushd clone
dvc pull

pared · 2020-03-23T12:38:28Z

Interestingly I am unable to write test reproducing this behaviour:

def test_trust_remote_checksums(tmp_dir, tmp_path_factory, erepo_dir, mocker):
    with erepo_dir.chdir():

        erepo_dir.dvc_gen({"file": "file content"}, commit="add dir")

    from dvc.scm import Git
    Git.clone(fspath(erepo_dir), fspath(tmp_dir))

    from dvc.repo import Repo

    dvc = Repo(fspath(tmp_dir))

    md5_spy = mocker.spy(dvc.cache.local, "get_file_checksum")

    dvc.pull()

    assert md5_spy.call_count == 0

This one passes, I think we should include this task in next sprint.

pared · 2020-05-06T09:47:02Z

Well, my test has not been working, due to #2888, will fix it during this issue.
The reason why my bash script results in md5 calculation and test does not is that erepo_dir has been using @efiop optimization from #3472. The cache file was protected. Thanks to that, when we were gathering dvc/remote/local._status, cache_exists did not have to calculate checksum (because file in cache was read only) and in my bash script I was using remote storage (local one), which does not protect its content by default. Then, when I am cloning the git repo, and pull the data, DVC checks cache_exists -> changed_cache_file which recalculates md5, because files on local remote are not protected.

pared · 2020-05-20T15:53:15Z

So I would say that the issue we have here is how we treat LocalRemote:
Why do we calculate md5 of obtained files?
When pulling we check files status which in results calls cache_exists. For LocalRemote, we are
iterating over desired checskums and call changed_cache_file in sequential manner.

So to help slow checksum calculation, we could introduce parallelization here

Another thing is that in case of other remotes (besides ssh) we do not trigger checksum calculation, it is assumed that checksum on remote is the right one. I guess we are not doing it locally to verify no cache corruption has occurred. But this is the reason why we are calculating the checksum.

pared · 2020-05-21T12:50:19Z

Problem: changed_cache_file for local uses state, therefore, SQLite is preventing us from parallelizing cache_exist for LocalRemote.

shcheklein · 2020-05-21T21:20:02Z

@pared SQLLite should be thread-safe and can be used from multiple threads?

efiop · 2020-05-21T21:46:54Z

@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected?

pared · 2020-05-22T14:54:13Z

@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

@efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that.

shcheklein · 2020-05-22T19:31:19Z

Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads.

To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue.

pared · 2020-05-22T20:03:19Z

@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote.
As to parallelization, this issue needs to be researched further.

efiop · 2020-05-23T01:34:21Z

State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858

efiop · 2020-08-28T00:53:25Z

For the record: this was probably caused by #4485 , preparing a fix. CC @pared

pared · 2020-08-31T10:55:49Z

@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import.

triage-new-issues bot added the triage Needs to be triaged label Feb 27, 2020

Ykid changed the title ~~Add parallelism to pulling multiple large files~~ Add parallelism to checksum calculation for pulling multiple large files Feb 27, 2020

efiop added the awaiting response we are waiting for your reply, please respond! :) label Feb 28, 2020

triage-new-issues bot removed the triage Needs to be triaged label Feb 28, 2020

weekly-digest bot mentioned this issue Mar 1, 2020

Weekly Digest (23 February, 2020 - 1 March, 2020) #3425

Closed

pared pinned this issue Mar 3, 2020

pared self-assigned this Mar 3, 2020

pared added research and removed awaiting response we are waiting for your reply, please respond! :) labels Mar 3, 2020

pared added the awaiting response we are waiting for your reply, please respond! :) label Mar 3, 2020

pared unpinned this issue Mar 3, 2020

pared added bug Did we break something? p1-important Important, aka current backlog of things to do and removed awaiting response we are waiting for your reply, please respond! :) labels Mar 23, 2020

pared mentioned this issue May 6, 2020

tests: erepo upstream should not be its cache #2888

Closed

This was referenced May 6, 2020

tests: abstract remote setup to fixture #3748

Merged

[WIP] Copy file mode local push #3756

Closed

pared mentioned this issue May 22, 2020

remote: local: protect uploaded files #3858

Merged

3 tasks

efiop closed this as completed May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallelism to checksum calculation for pulling multiple large files #3416

Add parallelism to checksum calculation for pulling multiple large files #3416

Ykid commented Feb 27, 2020

Ykid commented Feb 27, 2020

efiop commented Feb 28, 2020

pared commented Mar 3, 2020 •

edited

Loading

pared commented Mar 3, 2020

Ykid commented Mar 7, 2020

efiop commented Mar 9, 2020

pared commented Mar 10, 2020

Ykid commented Mar 11, 2020

pared commented Mar 11, 2020

pared commented Mar 23, 2020

pared commented Mar 23, 2020

pared commented May 6, 2020

pared commented May 20, 2020

pared commented May 21, 2020

shcheklein commented May 21, 2020

efiop commented May 21, 2020

pared commented May 22, 2020

shcheklein commented May 22, 2020

pared commented May 22, 2020

efiop commented May 23, 2020

efiop commented Aug 28, 2020

pared commented Aug 31, 2020

Add parallelism to checksum calculation for pulling multiple large files #3416

Add parallelism to checksum calculation for pulling multiple large files #3416

Comments

Ykid commented Feb 27, 2020

Ykid commented Feb 27, 2020

efiop commented Feb 28, 2020

pared commented Mar 3, 2020 • edited Loading

pared commented Mar 3, 2020

Ykid commented Mar 7, 2020

efiop commented Mar 9, 2020

pared commented Mar 10, 2020

Ykid commented Mar 11, 2020

pared commented Mar 11, 2020

pared commented Mar 23, 2020

pared commented Mar 23, 2020

pared commented May 6, 2020

pared commented May 20, 2020

pared commented May 21, 2020

shcheklein commented May 21, 2020

efiop commented May 21, 2020

pared commented May 22, 2020

shcheklein commented May 22, 2020

pared commented May 22, 2020

efiop commented May 23, 2020

efiop commented Aug 28, 2020

pared commented Aug 31, 2020

pared commented Mar 3, 2020 •

edited

Loading