-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallelism to checksum calculation for pulling multiple large files #3416
Comments
@pared Do you remember the context? Aren't we trusting pulled checksums these days? Or is it about unpacked dir case? |
@efiop @Ykid Sorry for the delay, @Ykid has a repository where there are few big files (A, B, C), each having its own stage (A.dvc, B.dvc, C.dvc). @Ykid noticed that checksum calculations on So, I see 2 potential problems:
and so on. So I would vote for changing the name of the issue to |
@Ykid |
|
@pared Not sure I follow, what hashes are calculated on checkout? For the files we've downloaded? But we should've trusted those and then when checking out we are trusting the cache hashes, right? |
@efiop Thats right, when pulling we should trust the checksums by default.
|
@pared I didn't change my config. What's the discrepancy you spotted ? |
@Ykid with the version that you are using, dvc should not even calculate checksum when pulling. I need to try to reproduce your problem. |
@Ykid @efiop seems like a bug,
|
Interestingly I am unable to write test reproducing this behaviour:
This one passes, I think we should include this task in next sprint. |
Well, my test has not been working, due to #2888, will fix it during this issue. |
So I would say that the issue we have here is how we treat LocalRemote: So to help slow checksum calculation, we could introduce parallelization here Another thing is that in case of other remotes (besides |
Problem: |
@pared SQLLite should be thread-safe and can be used from multiple threads? |
@pared Does local remote preserve protected mode when uploading? If it doesn't, maybe we should do that and then trust on pull if cache files are still protected? |
@shcheklein Maybe we are not initializing it properly? I get errors specifically pointing to the fact that we are accessing DB from multiple threads. @efiop Local remote does not preserve the mode. I think that could be a proper way to solve it, utilize cache optimization and trust protected local remote. I will prepare PR for that. |
To be honest, I don't know. We need to check the docs and do some research. Of course, if it's a bottleneck for solving this particular issue. |
@Ykid sorry it took so long, you should not experience file checksum recalculation for local remote. |
State not working in parallel mode is a well-known thing, it is just because of the way it is currently set up. We will be looking at hash optimations separately, closing this issue for now, since it is mitigated by #3858 |
@efiop I am not sure if they are related. This one was related to cloning and pulling the project, not import. |
It would be better if multiple stages (which might correspond to large files) can perform checksum calculation at the same time.
The text was updated successfully, but these errors were encountered: