Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct S3 Upload: Does not detect non-browser supplied file types, locally detected, such as .dta #6762

Closed
kcondon opened this issue Mar 23, 2020 · 4 comments
Assignees

Comments

@kcondon
Copy link
Contributor

kcondon commented Mar 23, 2020

Currently, direct to s3 upload relies solely on the browser supplied file type detection. It does not make use of local file type detection (JHOVE, examining signature bits) so does not detect .dta.

@qqmyers
Copy link
Member

qqmyers commented Aug 12, 2020

An update: with #7124, direct upload now does the same detection by extension that the normal upload does, which allows more types to be recognized and ingested (if the right type).

What is still not happening is the detection based on file contents. The implementation of those tests is awaiting progress on #6937 which will help standardize how a few bytes can be retrieved from a file independent of storage and upload method. (Waiting is not strictly necessary if this issue becomes a higher priority - just expecting less work/rework if #6937 happens first.)

@jggautier
Copy link
Contributor

jggautier commented May 26, 2023

Since #6937 is merged, can it be used to improve detection of file types of files uploaded using direct to s3 upload?

I ask because I'm wondering if it will help prevent the file type detection issues we see in the dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJ3YXU. That dataset's FITS files are being uploaded using direct to s3 upload, and many were detected as text files.

I've used the redetect API so that Dataverse recognizes more of those FITS files as FITS files, but for some files the API call has failed, maybe because of the large size of the files? The remaining FITS files that are still labelled as text files are several GBs.

@pdurbin
Copy link
Member

pdurbin commented May 26, 2023

As part of #9611 I've been testing FITS files with S3 direct upload and they've been coming up as unknown rather than text.

In that pull request I added a few bullets to set expectations about S3 direct upload: what is expected not to work: https://dataverse-guide--9611.org.readthedocs.build/en/9611/developers/big-data-support.html#features-that-are-disabled-if-s3-direct-upload-is-enabled

I'm honestly not sure what to say about file detection. I'm not sure what the pattern is, what the exact expected behavior is. If anyone wants to add to that PR to further clarify the situation, please go ahead!

I didn't touch the mention of "the file contents are not inspected to evaluate their mimetype", highlighted in blue in the screenshot below. I just added the section with bullets:

Screen Shot 2023-05-26 at 11 34 44 AM

I'll go ahead and mark my PR as related to this issue. Again, maybe we can at least clarify the guides a bit to better set expectations.

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants