-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETag mismatch on MinIO external dependency add. #3454
Comments
Seems to run fine for now me locally
Do you have an example of the dataset which we can reproduce this issue? |
@pared please try with |
https://github.com/minio/minio#caveats to understand more about why |
Here's what I tried:
I still got a similar error:
|
Notice you need to purge the existing objects, dvc caches data. |
Ok so here's what I did:
Now I get a slightly different error, but again related to ETags:
What's interesting is this particular line:
While on the other hand it seems like for other files the MD5/ETag was computed correctly (albeit with the |
Yes because now that you purged .minio.sys you lost all metadata including the ETag for all your objects. Since dvc remembers ETag it is not possible to start server without --compat and start it again with --compat. Server should be started with --compat on the get go, so when I meant fresh means fresh everything including the content on MinIO not just metadata. |
Unless of course a refresh of sort can be asked from dvc. The original issue here is a bug in MinIO CopyObject which is not preserving the previous ETag which is not intentional. This happens only without --compat flag so this needs to be fixed on MinIO. Once that is fixed i expect that Q: Does |
@harshavardhana To be clear, I actually purged both MinIO and DVC metadata at the same time, then restarting the MinIO server. But you seem to be saying that I need to ensure that the content is new? For example copy it to another folder? |
Yes because you already have content, we don't compute new etags for existing data. Just purging .minio.sys simply looses all object metadata. It won't regenerate it magically. |
Also let's avoid lengthy discussion here, as a courtesy towards dvc maintainers. You are not doing what I demonstrated anyways. Let's discuss on our slack instead @benjamintanweihao |
@harshavardhana Thanks a lot for the explanation! Looks like there is not much we can do on dvc side, so closing this issue for now. |
@efiop does it make sense to report it back to minio? |
@shcheklein, there's an issue which was closed as it was supposed to work as this on minio. See: minio/minio#8012 (comment) We could however suggest user to use |
The solution provided by @harshavardhana seems to be working, though I encounter another problem. It seems that running MinIO server with Next error I encounter is: |
@pared that exception looks like client side perhaps a new issue to be reported, feel free to CC me if you think it's MinIO |
@harshavardhana Yes, we try to ensure that etag is the same after we |
Variadic parts doesn't seem to be handled, even the number of parts calculation based on ETag cannot know for certain what was the part size used. This is tricky, has there been a thought on fixing this by using something more appropriate for all usecases? https://github.com/s3git/s3git might give some ideas |
Yes, the assumption is that the part size stays the same. An important note is that this is only used for some very advanced functionality, so maybe we didn't have enough runs in the wild, but so far it worked fine. Thanks for sharing s3git link! Looks like they don't rely on etags, but instead manually compute hash for objects on s3, right? |
@harshavardhana thanks for sharing your insight! |
Yes that is correct it, I shall cc @fwessels for more clarity. |
@harshavardhana As far as I can tell, BLAKE2 is used, but even though it is ~x2 faster than md5, it still takes time to compute. That is something that we avoid by leveraging ETags as a free way to ensure that objects are unchanged. We have thought about computing some hash instead, but for now there wasn't enough need for that. But we might come to that in the future at which point we will probably take a look at BLAKE3 or something like that. Thanks for the question! 🙏 |
For consistent hashing you should look into the verified streaming behaviour that BLAKE3 offers. This fixes the size of the individual parts for which hashes are computed at the lowest level which then ultimately results in the final hash at the top. "Only" caveat here is that, when working with multipart objects, you need to know where the boundaries are for the parts to hash (and some hashes will need to be computed between two consecutive parts). So it is definitely doable, but not completely trivial to implement (and dropping parts upon finalization of the multipart upload would not be allowed (at least not without triggering recomputing the hashes from the point onwards where a part is left out)). |
it should be handled as far as I remember: obj = cls.get_head_object(
s3, from_info.bucket, from_info.path, PartNumber=i
)
part_size = obj["ContentLength"] so, we try to preserve the same number of parts and the same length of them. It is still a gray area, since I'm not sure it's not officially documented how ETAG is calculated from multi-parts, but I think it is reasonable for our users to rely on that optimization for now. As @efiop mentioned we can introduce BLAKE2 os something similar if Amazon at some decides to change the logic behind it. Btw, I wonder if |
No, it is not Parts can be uploaded in this manner
This will result in ETag as Now if you assume 3 parts content-length is 11MiB you have no idea what is the length used
multipart ETAG is nothing but the
The server-side copy of parts is called CopyObjectPart() - which I see that you are using when you see ETag as a NOTE: This assumption will also fail for SSE-C encrypted objects as well because AWS S3 doesn't return a proper ETag when you have SSE-C encrypted objects - meaning an SSE-C object will change its ETag automatically upon an overwrite. https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
|
I think that's not how DVC does this. Please, check this code - https://github.com/iterative/dvc/blob/master/dvc/remote/s3.py#L103-L107 Unless I'm missing something it takes exact sizes of all parts.
could you share the script, please? It might be a bug.
Yep, I know. It doesn't make it official - thus "grey" area. Though, like I mentioned it's very unlikely they change it, so we made a decision to utilize this behavior.
Yes. See #2701 (comment) and iterative/dvc.org#774 |
Ah, you are using the PartNumber API @shcheklein which was undocumented for a few years and then has silently appeared. for i in range(1, n_parts + 1):
obj = cls.get_head_object(
s3, from_info.bucket, from_info.path, PartNumber=i
)
part_size = obj["ContentLength"]
lastbyte = byte_position + part_size - 1
if lastbyte > size:
lastbyte = size - 1 We have had debates about should we ever implement this on MinIO, but looks like since you guys use it - it makes sense to do it. |
@harshavardhana thanks! 🙏 Will there be a ticket for us to follow? |
Assuming we have MinIO instance set up with two buckets (
dvc-cache
,data
) on localhost:9000, and we try to add data fromdata
bucket as external dependency we will getETag mismatch
error.Example:
Will result with:
Related: #2629 , #3441
The text was updated successfully, but these errors were encountered: