-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
md5 hash displayed to user is wrong #9501
Comments
I wonder if this bug is also what's making it difficult to use pooch to download tabular files that Dataverse was able to ingest. When I use pooch to try to download a tabular file that's been ingested, pooch tries to use a checksum to verify the file's integrity, and because of some checksum mixup that sounds like the one you described @charmoniumQ, pooch doesn't let me download the file. Been meaning to report this somewhere, but haven't had time to dig into it. |
Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/VSN1O0/QEH8LH&version=1.0 aslo seems wrong. It reports a size of 0 and hash of These exact transformation which is causing the hash to differ is different between these examples. But I think in all cases, there is some transformation (e.g., removing the headers) happening during the hash that is not happening before downloading or vice versa. |
There's a related GitHub issue at IQSS/dataverse.harvard.edu#37, in a repo where we track things related to the Harvard Dataverse Repository that are more at the "production"-level and less related to the current version of the Dataverse software, unless other repositories using Dataverse are affected, too. I'm just mentioning this since I think this GitHub issue and your GitHub issue at IQSS/dataverse.harvard.edu#220 might wind up being resolved with changes to the software (like how checksums are displayed and organized in metadata exports) and changes to the Harvard Dataverse (like an audit of that repository's files). |
Thanks, I didn't know about the issue tracker for issues specific to Harvard's dataverse instance. I'll post future discrepancies to IQSS/dataverse.harvard.edu#37. |
Definitely easier to find this main Dataverse repo :) |
Huh. I'm seeing the same thing. After downloading the CSV, I'm getting 2965cd060e16781a2f6fafa5a54a6c59 as the MD5 checksum...
... but Dataverse is asserting that it's 71988070448ef1f28b8538ebee9919bf @charmoniumQ thanks for the heads up about this! Very strange. |
Apologies for the delay with this, I missed this issue earlier (@pdurbin brought your comment to my attention this morning). Please try downloading the above again, you should get the right file now. (I haven't even looked at the other things mentioned in the issue yet, will take a look and reply asap) |
Thanks @landreev ! |
(I'm debating whether I should move this issue to the local support repo as well; but I'll deal with that later) Addressing the original report at the top:
Good catch, thank you. Plus, extra credit for figuring out that the displayed md5 was in fact that of the raw tabular data file without the variable name header. Unfortunately, I'm at a bit of a loss as to
As I said earlier, it really looks like we need to review our system of file integrity audits and re-validate everything. |
Hi @landreev ! Is there any progress on this issue? We've just been hit by this. We have some RO-Crate related functions that depend on the md5 of the files and if that changes in an async manner due to ingestion it causes inconsistencies for us. By the way, is there a way to turn off ingestion of tabular files? |
@beepsoft If this is causing problems with your RO-Crate use case, yes, it is very easy to disable ingest completely: just set |
What steps does it take to reproduce the issue?
See the dataset file page here.
9e9be...
. But this is not true.20ddc4...
.1f75c2...
.cat file.tab | tail --lines=+2 | md5sum
) has an md5 hash of9e9be...
.This is a bug because it will lead users to believe that they downloaded a corrupted file.
There are two parts: the incorrect labeling, and cutting off the header row. The label should be "Tab-Delimited File MD5" not "Original File MD5." Cutting off the header row is more interesting. Why does Dataverse send the file to the user, but hash a transformed version of that file?
Unknown
The Stata files of this dataset that I checked by hand.
Which version of Dataverse are you using?
The one hosted at https://dataverse.harvard.edu/, 5.13 build 1244-79d6e57
The text was updated successfully, but these errors were encountered: