Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix S3 direct upload NPE and NetCDF/HDF5 and keep geospatial metadata extraction #9611

Merged
merged 8 commits into from
Jul 10, 2023

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented May 23, 2023

What this PR does / why we need it:

The fileMetadataExtractableFromNetcdf method added in PR #9541 breaks S3 direct upload. This PR fixes it.

It turns out a number of features don't work when S3 direct upload is enabled. This PR enumerates them.

Finally a config setting was added: dataverse.netcdf.geo-extract-s3-direct-upload.

Which issue(s) this PR closes:

Special notes for your reviewer:

As if (driverType.equals("tmp")) in IngestServiceBean.saveAndAddFilesToDataset() is executing, note that if S3 direct upload is enabled, driverType is "s3". However, if S3 direct upload is disabled (but S3 is still enabled), driverType is "tmp". (I found this confusing. I'd be happy to add a comment in the code about this.)

This means that methods like extractMetadataNcml() and code to generate thumbnails is not executed when S3 direct upload is enabled.

The behavior above is not changed by this pull request, but it did lead me to set expectations better in the guides.

Suggestions on how to test this:

Primary bug:

  • enable S3 direct upload
  • upload any file, should work
  • upload a NetCDF or HDF5 file, expect no NcML file (always) and no geospatial metadata extraction (unless you enable the new config setting: dataverse.netcdf.geo-extract-s3-direct-upload)

To test FITS extraction:

  • enable S3 direct upload
  • enable astro block
  • upload a FITS file (expect no metadata extraction and the following error in the log: "fits is a filename/extension Dataverse doesn't know about. Consider adding it to the MimeTypeDetectionByFileExtension.properties file.")

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, I updated the previous release note to mention the new setting.

Additional documentation:

I expanded our docs on what doesn't work during S3 direct upload, including NetCDF/HDF5 metadata extraction. I also documented the new setting.

Note that the NcML aux file is not created when S3 direct upload is
enabled.
@pdurbin pdurbin self-assigned this May 23, 2023
@github-actions

This comment has been minimized.

@pdurbin pdurbin changed the title fix S3 direct upload NPE and keep NetCDF/HDF5 features #9601 fix S3 direct upload NPE and NetCDF/HDF5 and keep geospatial metadata extraction #9601 May 24, 2023
@pdurbin pdurbin marked this pull request as ready for review May 24, 2023 13:45
@pdurbin pdurbin removed their assignment May 24, 2023
@github-actions

This comment has been minimized.

By default, keep S3 direct upload fast. Don't download NetCDF or HDF5
files to try to pull geospatial metadata out of them when S3 direct
upload is configured. If you really want this, add this setting and make
it true.
@github-actions

This comment has been minimized.

@pdurbin pdurbin changed the title fix S3 direct upload NPE and NetCDF/HDF5 and keep geospatial metadata extraction #9601 fix S3 direct upload NPE and NetCDF/HDF5 and keep geospatial metadata extraction May 25, 2023
@github-actions

This comment has been minimized.

@pdurbin pdurbin added this to the 5.14 milestone May 25, 2023
@github-actions
Copy link

📦 Pushed preview application image as

ghcr.io/gdcc/dataverse:9601-keep-features

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Conflicts:
doc/sphinx-guides/source/installation/config.rst
src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java
@pdurbin pdurbin removed their assignment Jun 15, 2023
@github-actions

This comment has been minimized.

@sekmiller sekmiller self-assigned this Jun 15, 2023
@sekmiller sekmiller self-assigned this Jun 21, 2023
@sekmiller sekmiller removed their assignment Jun 22, 2023
@kcondon kcondon self-assigned this Jun 23, 2023
@kcondon
Copy link
Contributor

kcondon commented Jun 23, 2023

Having some trouble getting this to work. Tried env var, tried jvm option, tried .properties file.
Will try a known good setting to make sure I am configuring it correctly.

@kcondon
Copy link
Contributor

kcondon commented Jun 27, 2023

@pdurbin I was unable to get this working using a jvm-option:
-Ddataverse.netcdf.geo-extract-s3-direct-upload=true

I also tried an env var, DATAVERSE_NETCDF_GEO_EXTRACT_S3_DIRECT_UPLOAD and no luck.

@kcondon kcondon assigned pdurbin and unassigned kcondon Jun 27, 2023
@@ -123,6 +123,10 @@ public enum JvmSettings {
SCOPE_UI(PREFIX, "ui"),
UI_ALLOW_REVIEW_INCOMPLETE(SCOPE_UI, "allow-review-for-incomplete"),
UI_SHOW_VALIDITY_FILTER(SCOPE_UI, "show-validity-filter"),

// NetCDF SETTINGS
SCOPE_NETCDF(PREFIX, "netcdf"),
Copy link
Contributor

@poikilotherm poikilotherm Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please start having an "ingest" scope where we collect these? (Netcdf being a subscope of it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The timing is poor because I was away for two weeks and now you're away until August.

What are you suggesting, specifically?

Instead of...

dataverse.netcdf.geo-extract-s3-direct-upload

... you'd like to see...

dataverse.ingest.netcdf.geo-extract-s3-direct-upload?

I'm ok with it if @landreev and/or @scolapasta are ok with it. And @sekmiller did the review. And @qqmyers does a lot of storage work. They might want to weigh in.

For the record, I looked for ingest stuff in our config page and we do already have these:

  • dataverse.files.<id>.ingestsizelimit
  • :TabularIngestSizeLimit

Personally, I've been referring to this feature as "metadata extraction" (from a file) rather than ingest. I'm not sure where we draw the line on what ingest is.

I'll plug having a glossary, like I always do! 😜

@cmbz cmbz added Size: 3 A percentage of a sprint. 2.1 hours. Status: Needs Input Applied to issues in need of input from someone currently unavailable labels Jul 5, 2023
@pdurbin pdurbin removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jul 10, 2023
@coveralls
Copy link

Coverage Status

coverage: 20.373% (+0.003%) from 20.37% when pulling bbea224 on 9601-keep-features into f24c7de on develop.

@github-actions
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9601-keep-features
ghcr.io/gdcc/configbaker:9601-keep-features

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@pdurbin
Copy link
Member Author

pdurbin commented Jul 10, 2023

I removed the "waiting" label 'cuz I'm back, baby! 😜

First, I merged the latest from develop.

Then I retested, this time on my dev1 server using JVM options rather than on my laptop using Docker and environment variables (both should work).

Here's the file I'm using to test: https://github.com/IQSS/dataverse/blob/develop/src/test/resources/netcdf/ICOADS_R3.0.0_1662-10.nc

The geospatial block must be enabled for the bounding box to be populated.

By using <jvm-options>-Ddataverse.netcdf.geo-extract-s3-direct-upload=true</jvm-options> in domain.xml, it was populated, like this:

Screen Shot 2023-07-10 at 2 24 47 PM

If you remove that JVM option and keep the setup the same (s3 direct upload enabled), the bounding box should not be populated.

For reference, here are some of my JVM options. I configured a new collection to use this "s3" store:

<jvm-options>-Ddataverse.files.s3.type=s3</jvm-options>
<jvm-options>-Ddataverse.files.s3.label=s3</jvm-options>
<jvm-options>-Ddataverse.files.s3.custom-endpoint-url=s3.us-east-2.amazonaws.com</jvm-options>
<jvm-options>-Ddataverse.files.s3.custom-endpoint-region=us-east-2</jvm-options>
<jvm-options>-Ddataverse.files.s3.bucket-name=pdurbin</jvm-options>
<jvm-options>-Ddataverse.files.s3.access-key=REDACTED</jvm-options>
<jvm-options>-Ddataverse.files.s3.secret-key=REDACTED</jvm-options>
<jvm-options>-Ddataverse.files.s3.upload-redirect=true</jvm-options>
<jvm-options>-Ddataverse.netcdf.geo-extract-s3-direct-upload=true</jvm-options>

I'm putting this into "Ready for QA". I'm happy to be co-assigned to work through any problems.

@pdurbin pdurbin removed their assignment Jul 10, 2023
@kcondon kcondon self-assigned this Jul 10, 2023
@kcondon
Copy link
Contributor

kcondon commented Jul 10, 2023

@pdurbin Thanks for the examples. I misunderstood -I was thinking of the aux file that is generated post upload for netcdf files. Also, the netcdf file may not contain bounding box info so need to have one that does. I have confirmed bounding box metadata is extracted for direct s3 upload. Note that the netcdf aux file is not generated for direct s3 upload.

@kcondon kcondon merged commit 2b93c55 into develop Jul 10, 2023
@kcondon kcondon deleted the 9601-keep-features branch July 10, 2023 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fileMetadataExtractableFromNetcdf causing regression for S3 direct upload
6 participants