-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct S3 Upload: Does not detect non-browser supplied file types, locally detected, such as .dta #6762
Comments
An update: with #7124, direct upload now does the same detection by extension that the normal upload does, which allows more types to be recognized and ingested (if the right type). What is still not happening is the detection based on file contents. The implementation of those tests is awaiting progress on #6937 which will help standardize how a few bytes can be retrieved from a file independent of storage and upload method. (Waiting is not strictly necessary if this issue becomes a higher priority - just expecting less work/rework if #6937 happens first.) |
Since #6937 is merged, can it be used to improve detection of file types of files uploaded using direct to s3 upload? I ask because I'm wondering if it will help prevent the file type detection issues we see in the dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJ3YXU. That dataset's FITS files are being uploaded using direct to s3 upload, and many were detected as text files. I've used the redetect API so that Dataverse recognizes more of those FITS files as FITS files, but for some files the API call has failed, maybe because of the large size of the files? The remaining FITS files that are still labelled as text files are several GBs. |
As part of #9611 I've been testing FITS files with S3 direct upload and they've been coming up as unknown rather than text. In that pull request I added a few bullets to set expectations about S3 direct upload: what is expected not to work: https://dataverse-guide--9611.org.readthedocs.build/en/9611/developers/big-data-support.html#features-that-are-disabled-if-s3-direct-upload-is-enabled I'm honestly not sure what to say about file detection. I'm not sure what the pattern is, what the exact expected behavior is. If anyone wants to add to that PR to further clarify the situation, please go ahead! I didn't touch the mention of "the file contents are not inspected to evaluate their mimetype", highlighted in blue in the screenshot below. I just added the section with bullets: I'll go ahead and mark my PR as related to this issue. Again, maybe we can at least clarify the guides a bit to better set expectations. |
To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'. If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment. |
Currently, direct to s3 upload relies solely on the browser supplied file type detection. It does not make use of local file type detection (JHOVE, examining signature bits) so does not detect .dta.
The text was updated successfully, but these errors were encountered: