Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5 hash displayed to user is wrong #9501

Open
charmoniumQ opened this issue Apr 6, 2023 · 12 comments
Open

md5 hash displayed to user is wrong #9501

charmoniumQ opened this issue Apr 6, 2023 · 12 comments
Labels

Comments

@charmoniumQ
Copy link

charmoniumQ commented Apr 6, 2023

What steps does it take to reproduce the issue?

See the dataset file page here.

  • This page says the "Original File MD5" begins with 9e9be.... But this is not true.
  • The "Stata Binary (Original File Format)" file has an md5 hash beginning with 20ddc4....
  • The "Tab-Delimited" file has an md5 hash beginning with 1f75c2....
  • However "Tab-Delimited" file without the header row (cat file.tab | tail --lines=+2 | md5sum) has an md5 hash of 9e9be....

This is a bug because it will lead users to believe that they downloaded a corrupted file.

There are two parts: the incorrect labeling, and cutting off the header row. The label should be "Tab-Delimited File MD5" not "Original File MD5." Cutting off the header row is more interesting. Why does Dataverse send the file to the user, but hash a transformed version of that file?

  • When does this issue occur?

Unknown

  • Which page(s) does it occurs on?

The Stata files of this dataset that I checked by hand.

Which version of Dataverse are you using?

The one hosted at https://dataverse.harvard.edu/, 5.13 build 1244-79d6e57

@jggautier
Copy link
Contributor

I wonder if this bug is also what's making it difficult to use pooch to download tabular files that Dataverse was able to ingest. When I use pooch to try to download a tabular file that's been ingested, pooch tries to use a checksum to verify the file's integrity, and because of some checksum mixup that sounds like the one you described @charmoniumQ, pooch doesn't let me download the file.

Been meaning to report this somewhere, but haven't had time to dig into it.

@charmoniumQ
Copy link
Author

charmoniumQ commented Apr 6, 2023

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/VSN1O0/QEH8LH&version=1.0 aslo seems wrong. It reports a size of 0 and hash of 1e67.... When I download, I get size of 0 and hash of d41d8... which is well-known as the hash of the empty string. The 1e67... hash is actually md5_hash("\n\n").

These exact transformation which is causing the hash to differ is different between these examples. But I think in all cases, there is some transformation (e.g., removing the headers) happening during the hash that is not happening before downloading or vice versa.

@jggautier
Copy link
Contributor

There's a related GitHub issue at IQSS/dataverse.harvard.edu#37, in a repo where we track things related to the Harvard Dataverse Repository that are more at the "production"-level and less related to the current version of the Dataverse software, unless other repositories using Dataverse are affected, too.

I'm just mentioning this since I think this GitHub issue and your GitHub issue at IQSS/dataverse.harvard.edu#220 might wind up being resolved with changes to the software (like how checksums are displayed and organized in metadata exports) and changes to the Harvard Dataverse (like an audit of that repository's files).

@charmoniumQ
Copy link
Author

Thanks, I didn't know about the issue tracker for issues specific to Harvard's dataverse instance. I'll post future discrepancies to IQSS/dataverse.harvard.edu#37.

@jggautier
Copy link
Contributor

Definitely easier to find this main Dataverse repo :)

@charmoniumQ
Copy link
Author

@atrisovic

@pdurbin
Copy link
Member

pdurbin commented Apr 18, 2023

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

Huh. I'm seeing the same thing.

After downloading the CSV, I'm getting 2965cd060e16781a2f6fafa5a54a6c59 as the MD5 checksum...

$ md5 ph2_endline_attitudes_survey.csv 
MD5 (ph2_endline_attitudes_survey.csv) = 2965cd060e16781a2f6fafa5a54a6c59

... but Dataverse is asserting that it's 71988070448ef1f28b8538ebee9919bf

Screen Shot 2023-04-18 at 10 01 52 AM

@charmoniumQ thanks for the heads up about this! Very strange.

@landreev
Copy link
Contributor

@charmoniumQ

Also note that https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/MPQGZP/Q1IS9E&version=1.0 seems to just be wrong. It reports a hash of 71988... and size of 642.5 KB, but the only file Dataverse lets me download has a hash of 2965... and size of 609.3 KiB (yes, I'm dividing by 1024 not 1000; neither seem to work) and no header. I don't see an obvious way to get the original hash or size.

Apologies for the delay with this, I missed this issue earlier (@pdurbin brought your comment to my attention this morning). Please try downloading the above again, you should get the right file now.
The underlying problemw was a "partial ingest" failure (note that you were getting a tab-delimited, not the actual CSV version of the file). There was a bug in Dataverse at some point that (occasionally) resulted in this; where tabular data ingest would fail, but the application would still end up saving the converted tab-delimited file in place of the original. (luckily, the original would also be saved, with the .orig extension, so the fix is simply to move the .orig back).
What's alarming (and super embarrassing) is that for this file it was never fixed, and it was in fact sitting in this state since 2019. I was honestly under the impression that after we found and fixed the bug, we ran an audit that found and fixed all the files that had been affected. It really looks like I need to do that again.

(I haven't even looked at the other things mentioned in the issue yet, will take a look and reply asap)

@charmoniumQ
Copy link
Author

Thanks @landreev !

@landreev
Copy link
Contributor

landreev commented Apr 18, 2023

(I'm debating whether I should move this issue to the local support repo as well; but I'll deal with that later)

Addressing the original report at the top:

See the dataset file page here.

* This page says the "Original File MD5" begins with `9e9be...`. But this is not true.

* The "Stata Binary (Original File Format)" file has an md5 hash beginning with `20ddc4...`.

* The "Tab-Delimited" file has an md5 hash beginning with `1f75c2...`.

* However "Tab-Delimited" file without the header row (`cat file.tab | tail --lines=+2 | md5sum`) has an md5 hash of `9e9be...`.

Good catch, thank you. Plus, extra credit for figuring out that the displayed md5 was in fact that of the raw tabular data file without the variable name header.
I have corrected the md5sum entry in the database ("corrected" as in, the file page is now showing the correct md5 of the original file - "Original File MD5: 20ddc4ec170ffdadd8a91d5e2db0066e"; I'll address your other question, whether this is what we want to display, separately).

Unfortunately, I'm at a bit of a loss as to

  1. how this happened in the first place. Unlike that "partial ingest" problem, I have not seen this before. The tab-delimited version of the file is in fact stored on disk without the variable header (for, well, reasons), which is added in real time when the file is downloaded. So calculating the checksum from that stored copy would produce this md5... but I can't think of how or why it would ever be recalculated like that after ingest. and
  2. how it survived numerous audits without this having been detected.

As I said earlier, it really looks like we need to review our system of file integrity audits and re-validate everything.

@beepsoft
Copy link
Contributor

beepsoft commented Nov 8, 2023

Hi @landreev ! Is there any progress on this issue? We've just been hit by this. We have some RO-Crate related functions that depend on the md5 of the files and if that changes in an async manner due to ingestion it causes inconsistencies for us.

By the way, is there a way to turn off ingestion of tabular files?

@landreev
Copy link
Contributor

landreev commented Nov 13, 2023

@beepsoft
Hi,
Please note that there are several different things discussed in this issue. One, the one that I replied to and addressed back in April, was an actual data corruption of a specific file on our production server.
What this issue was originally opened for was that the way Dataverse handles and stores ingested tabular data files is confusing, especially with how the md5 signatures are shown to the user. This is not a bug per se, but very ancient legacy, this is how it was implemented early on - back when this tabular data ingest was a central feature of the application; for some specific reasons that time may have forgotten.

If this is causing problems with your RO-Crate use case, yes, it is very easy to disable ingest completely: just set:TabularIngestSizeLimit to zero, and it will stop.
Also, please note that there is the "uningest" API, that can undo it for any ingested files that already exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Status: 🔍 Interest
Development

No branches or pull requests

6 participants