Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure/gs/s3: re-design checksum computation for external dependencies/outputs #1410

Closed
efiop opened this issue Dec 4, 2018 · 8 comments
Closed
Assignees
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important
Milestone

Comments

@efiop
Copy link
Contributor

efiop commented Dec 4, 2018

Unlike S3, azure does provide content-md5(need to investigate if it still works, since s3 had support for it some time ago and doesn't have it right now) This is related to external outs/deps feature, that is not implemented yet. Usual push/pull are not going to be affected.

@efiop efiop added the enhancement Enhances DVC label Dec 4, 2018
@efiop efiop added this to the Queue milestone Dec 4, 2018
@efiop efiop self-assigned this Dec 4, 2018
@efiop
Copy link
Contributor Author

efiop commented Dec 12, 2018

Ok, turns out user should set content-md5 by himself in order for this to work. Also ETag does indeed change when you simply copy blob from one location to another. So we need to require Content-Md5 to be present or to compute and set it ourselves. We can use ETag for dependencies though.

Also, md5 for google cloud storage only supported for non-composite objects, so looks like we need to do something with it as well.

@efiop efiop changed the title azure: use md5 instead of etag azure: support external dependency Mar 20, 2019
@efiop efiop added p2-medium Medium priority, should be done, but less important p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Mar 20, 2019
@efiop
Copy link
Contributor Author

efiop commented Apr 5, 2019

Ok, turns out when you move multipart objects on s3 they don't retain their etag. See https://discordapp.com/channels/485586884165107732/485596304961962003/563864123587297280 . So we might need to come up with a general approach to handle all of these clouds :(

@efiop efiop changed the title azure: support external dependency azure/gs/s3: support external dependency/output Apr 5, 2019
@efiop
Copy link
Contributor Author

efiop commented Apr 6, 2019

Another approach to handle all of these in a unified way is https://discordapp.com/channels/485586884165107732/485596304961962003/563899005151608843

@efiop efiop changed the title azure/gs/s3: support external dependency/output azure/gs/s3: re-design checksum computation for external dependencies/outputs Apr 6, 2019
@efiop
Copy link
Contributor Author

efiop commented Apr 6, 2019

@olveirap
Copy link
Contributor

olveirap commented Apr 6, 2019

I'm giving a shot at the s3 workaround. As far as I can see we need to check if the origin object of the copy in here is a multipart uploaded object by checking if it has a suffix in the etag. If that's the case, we need to calculate from the size it has and the number of parts it was uploaded in from the suffix of the etag, the chunk size of we need to set in the TransferConfig passed to the copy in order to achieve the same number of parts in the copy.

@olveirap
Copy link
Contributor

olveirap commented Apr 6, 2019

Actually, the head_object has the part count and the size, using that is less hacky.

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Apr 6, 2019
efiop pushed a commit that referenced this issue Apr 15, 2019
* s3: fixed wrong etag when copying multipart objects

The etag of multipart objects depends of the number of parts, when copying to the cache we should do so in the same number of parts that the original object was moved/uploaded in.

Fixes part of #1410

* s3: added check on copy for equal etag

* s3: added specific exception for ETag mismatch

* s3: use multipart copy to preserve etags

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

* test: add tests for etag preservation on s3

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

* test: requirements: use dev version moto

Specifically because of this getmoto/moto#1941

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

* test: requirements: install dev moto without -e

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>

* test: stop using moto

Turned out to be quite buggy.

Signed-off-by: Ruslan Kuprieiev <ruslan@iterative.ai>
@harshavardhana
Copy link

Actually, the head_object has the part count and the size, using that is less hacky.

aws/aws-sdk-java#1303 that has been generally considered undocumented and shouldn't be used as per aws-sdk-java

@efiop
Copy link
Contributor Author

efiop commented May 3, 2021

Closing in favor of #3920

@efiop efiop closed this as completed May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

3 participants