Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract metadata (NcML XML) from NetCDF/HDF5 files, new "requirements" option for external tools #9239

Merged
merged 14 commits into from
Jan 20, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/release-notes/9153-extract-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
For NetCDF and HDF5 files, an attempt will be made to extract metadata in NcML (XML) format and save it as an auxiliary file.

An "extractNcml" API endpoint has been added, especially for installations with existing NetCDF and HDF5 files. After upgrading, they can iterate through these files and try to extract an NcML file.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Tool Type Scope Description
Data Explorer explore file A GUI which lists the variables in a tabular data file allowing searching, charting and cross tabulation analysis. See the README.md file at https://github.com/scholarsportal/dataverse-data-explorer-v2 for the instructions on adding Data Explorer to your Dataverse.
Whole Tale explore dataset A platform for the creation of reproducible research packages that allows users to launch containerized interactive analysis environments based on popular tools such as Jupyter and RStudio. Using this integration, Dataverse users can launch Jupyter and RStudio environments to analyze published datasets. For more information, see the `Whole Tale User Guide <https://wholetale.readthedocs.io/en/stable/users_guide/integration.html>`_.
File Previewers explore file A set of tools that display the content of files - including audio, html, `Hypothes.is <https://hypothes.is/>`_ annotations, images, PDF, text, video, tabular data, spreadsheets, GeoJSON, and ZipFiles - allowing them to be viewed without downloading the file. The previewers can be run directly from github.io, so the only required step is using the Dataverse API to register the ones you want to use. Documentation, including how to optionally brand the previewers, and an invitation to contribute through github are in the README.md file. Initial development was led by the Qualitative Data Repository and the spreasdheet previewer was added by the Social Sciences and Humanities Open Cloud (SSHOC) project. https://github.com/gdcc/dataverse-previewers
File Previewers explore file A set of tools that display the content of files - including audio, html, `Hypothes.is <https://hypothes.is/>`_ annotations, images, PDF, text, video, tabular data, spreadsheets, GeoJSON, zip, and NcML files - allowing them to be viewed without downloading the file. The previewers can be run directly from github.io, so the only required step is using the Dataverse API to register the ones you want to use. Documentation, including how to optionally brand the previewers, and an invitation to contribute through github are in the README.md file. Initial development was led by the Qualitative Data Repository and the spreasdheet previewer was added by the Social Sciences and Humanities Open Cloud (SSHOC) project. https://github.com/gdcc/dataverse-previewers
Data Curation Tool configure file A GUI for curating data by adding labels, groups, weights and other details to assist with informed reuse. See the README.md file at https://github.com/scholarsportal/Dataverse-Data-Curation-Tool for the installation instructions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"displayName": "AuxFileViewer",
"description": "Show an auxiliary file from a dataset file.",
"toolName": "auxPreviewer",
"scope": "file",
"types": [
"preview"
],
"toolUrl": "https://example.com/AuxFileViewer.html",
"toolParameters": {
"queryParameters": [
{
"fileid": "{fileId}"
}
]
},
"requirements": {
"auxFilesExist": [
{
"formatTag": "myFormatTag",
"formatVersion": "0.1"
}
]
},
"contentType": "application/foobar"
}
14 changes: 12 additions & 2 deletions doc/sphinx-guides/source/api/external-tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,21 @@ External tools must be expressed in an external tool manifest file, a specific J
Examples of Manifests
+++++++++++++++++++++

Let's look at two examples of external tool manifests (one at the file level and one at the dataset level) before we dive into how they work.
Let's look at a few examples of external tool manifests (both at the file level and at the dataset level) before we dive into how they work.

.. _tools-for-files:

External Tools for Files
^^^^^^^^^^^^^^^^^^^^^^^^

:download:`fabulousFileTool.json <../_static/installation/files/root/external-tools/fabulousFileTool.json>` is a file level both an "explore" tool and a "preview" tool that operates on tabular files:
:download:`fabulousFileTool.json <../_static/installation/files/root/external-tools/fabulousFileTool.json>` is a file level (both an "explore" tool and a "preview" tool) that operates on tabular files:

.. literalinclude:: ../_static/installation/files/root/external-tools/fabulousFileTool.json

:download:`auxFileTool.json <../_static/installation/files/root/external-tools/auxFileTool.json>` is a file level preview tool that operates on auxiliary files associated with a data file (note the "requirements" section):

.. literalinclude:: ../_static/installation/files/root/external-tools/auxFileTool.json

External Tools for Datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -113,6 +119,10 @@ Terminology
allowedApiCalls httpMethod Which HTTP method the specified callback uses such as ``GET`` or ``POST``.

allowedApiCalls timeOut For non-public datasets and datafiles, how many minutes the signed URLs given to the tool should be valid for. Must be an integer.

requirements **Resources your tool needs to function.** For now, the only requirement you can specify is that one or more auxiliary files exist (see auxFilesExist in the :ref:`tools-for-files` example). Currently, requirements only apply to preview tools. If the requirements are not met, the preview tool is not shown.

auxFilesExist **An array containing formatTag and formatVersion pairs** for each auxiliary file that your tool needs to download to function properly. For example, a required aux file could have a ``formatTag`` of "NcML" and a ``formatVersion`` of "1.0". See also :doc:`/developers/aux-file-support`.

toolName A **name** of an external tool that is used to differentiate between external tools and also used in bundle.properties for localization in the Dataverse installation web interface. For example, the toolName for Data Explorer is ``explorer``. For the Data Curation Tool the toolName is ``dct``. This is an optional parameter in the manifest JSON file.
=========================== ==========
Expand Down
41 changes: 41 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2248,6 +2248,47 @@ Currently the following methods are used to detect file types:
- The file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``.
- The file name (e.g. "Dockerfile") is used, defined in a file called ``MimeTypeDetectionByFileName.properties``.

.. _extractNcml:

Extract NcML
~~~~~~~~~~~~

As explained in the :ref:`netcdf-and-hdf5` section of the User Guide, when those file types are uploaded, an attempt is made to extract an NcML file from them and store it as an auxiliary file.

This happens automatically but superusers can also manually trigger this NcML extraction process with the API endpoint below.

Note that "true" will be returned if an NcML file was created. "false" will be returned if there was an error or if the NcML file already exists (check server.log for details).

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export ID=24

curl -H "X-Dataverse-key:$API_TOKEN" -X POST "$SERVER_URL/api/files/$ID/extractNcml"

The fully expanded example above (without environment variables) looks like this:

.. code-block:: bash

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/files/24/extractNcml

A curl example using a PID:

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_ID=doi:10.5072/FK2/AAA000

curl -H "X-Dataverse-key:$API_TOKEN" -X POST "$SERVER_URL/api/files/:persistentId/extractNcml?persistentId=$PERSISTENT_ID"

The fully expanded example above (without environment variables) looks like this:

.. code-block:: bash

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST "https://demo.dataverse.org/api/files/:persistentId/extractNcml?persistentId=doi:10.5072/FK2/AAA000"

Replacing Files
~~~~~~~~~~~~~~~

Expand Down
13 changes: 13 additions & 0 deletions doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,11 +177,15 @@ File Handling

Certain file types in the Dataverse installation are supported by additional functionality, which can include downloading in different formats, previews, file-level metadata preservation, file-level data citation; and exploration through data visualization and analysis. See the sections below for information about special functionality for specific file types.

.. _file-previews:

File Previews
-------------

Dataverse installations can add previewers for common file types uploaded by their research communities. The previews appear on the file page. If a preview tool for a specific file type is available, the preview will be created and will display automatically, after terms have been agreed to or a guestbook entry has been made, if necessary. File previews are not available for restricted files unless they are being accessed using a Private URL. See also :ref:`privateurl`.

Installation of previewers is explained in the :doc:`/admin/external-tools` section of in the Admin Guide.

Tabular Data Files
------------------

Expand Down Expand Up @@ -299,6 +303,15 @@ Astronomy (FITS)

Metadata found in the header section of `Flexible Image Transport System (FITS) files <http://fits.gsfc.nasa.gov/fits_primer.html>`_ are automatically extracted by the Dataverse Software, aggregated and displayed in the Astronomy Domain-Specific Metadata of the Dataset that the file belongs to. This FITS file metadata, is therefore searchable and browsable (facets) at the Dataset-level.

.. _netcdf-and-hdf5:

NetCDF and HDF5
---------------

For NetCDF and HDF5 files, an attempt will be made to extract metadata in NcML_ (XML) format and save it as an auxiliary file. (See also :doc:`/developers/aux-file-support` in the Developer Guide.) A previewer for these NcML files is available (see :ref:`file-previews`).

.. _NcML: https://docs.unidata.ucar.edu/netcdf-java/current/userguide/ncml_overview.html

Compressed Files
----------------

Expand Down
5 changes: 4 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/AuxiliaryFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,10 @@ public class AuxiliaryFile implements Serializable {
private String formatTag;

private String formatVersion;


/**
* The application/entity that created the auxiliary file.
*/
private String origin;

private boolean isPublic;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,13 @@ public AuxiliaryFile save(AuxiliaryFile auxiliaryFile) {
* @param type how to group the files such as "DP" for "Differentially
* @param mediaType user supplied content type (MIME type)
* Private Statistics".
* @return success boolean - returns whether the save was successful
* @param save boolean - true to save immediately, false to let the cascade
* do persist to the database.
* @return an AuxiliaryFile with an id when save=true (assuming no
* exceptions) or an AuxiliaryFile without an id that will be persisted
* later through the cascade.
*/
public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile dataFile, String formatTag, String formatVersion, String origin, boolean isPublic, String type, MediaType mediaType) {
public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile dataFile, String formatTag, String formatVersion, String origin, boolean isPublic, String type, MediaType mediaType, boolean save) {

StorageIO<DataFile> storageIO = null;
AuxiliaryFile auxFile = new AuxiliaryFile();
Expand Down Expand Up @@ -114,7 +118,14 @@ public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile
auxFile.setType(type);
auxFile.setDataFile(dataFile);
auxFile.setFileSize(storageIO.getAuxObjectSize(auxExtension));
auxFile = save(auxFile);
if (save) {
auxFile = save(auxFile);
} else {
if (dataFile.getAuxiliaryFiles() == null) {
dataFile.setAuxiliaryFiles(new ArrayList<>());
}
dataFile.getAuxiliaryFiles().add(auxFile);
}
} catch (IOException ioex) {
logger.severe("IO Exception trying to save auxiliary file: " + ioex.getMessage());
throw new InternalServerErrorException();
Expand All @@ -129,7 +140,11 @@ public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile
}
return auxFile;
}


public AuxiliaryFile processAuxiliaryFile(InputStream fileInputStream, DataFile dataFile, String formatTag, String formatVersion, String origin, boolean isPublic, String type, MediaType mediaType) {
return processAuxiliaryFile(fileInputStream, dataFile, formatTag, formatVersion, origin, isPublic, type, mediaType, true);
}

public AuxiliaryFile lookupAuxiliaryFile(DataFile dataFile, String formatTag, String formatVersion) {

Query query = em.createNamedQuery("AuxiliaryFile.lookupAuxiliaryFile");
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -5490,7 +5490,7 @@ public List<ExternalTool> getCachedToolsForDataFile(Long fileId, ExternalTool.Ty
return cachedTools;
}
DataFile dataFile = datafileService.find(fileId);
cachedTools = ExternalToolServiceBean.findExternalToolsByFile(externalTools, dataFile);
cachedTools = externalToolService.findExternalToolsByFile(externalTools, dataFile);
cachedToolsByFileId.put(fileId, cachedTools); //add to map so we don't have to do the lifting again
return cachedTools;
}
Expand Down
15 changes: 14 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/FilePage.java
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import edu.harvard.iq.dataverse.util.JsfHelper;
import static edu.harvard.iq.dataverse.util.JsfHelper.JH;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.json.JsonUtil;
import java.io.IOException;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
Expand All @@ -57,6 +58,9 @@
import javax.faces.view.ViewScoped;
import javax.inject.Inject;
import javax.inject.Named;
import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;
import javax.validation.ConstraintViolation;

import org.primefaces.PrimeFaces;
Expand Down Expand Up @@ -125,6 +129,8 @@ public class FilePage implements java.io.Serializable {
ExternalToolServiceBean externalToolService;
@EJB
PrivateUrlServiceBean privateUrlService;
@EJB
AuxiliaryFileServiceBean auxiliaryFileService;

@Inject
DataverseRequestServiceBean dvRequestService;
Expand Down Expand Up @@ -285,8 +291,15 @@ public void setDatasetVersionId(Long datasetVersionId) {
this.datasetVersionId = datasetVersionId;
}

// findPreviewTools would be a better name
private List<ExternalTool> sortExternalTools(){
List<ExternalTool> retList = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.PREVIEW, file.getContentType());
List<ExternalTool> retList = new ArrayList<>();
List<ExternalTool> previewTools = externalToolService.findFileToolsByTypeAndContentType(ExternalTool.Type.PREVIEW, file.getContentType());
for (ExternalTool previewTool : previewTools) {
if (externalToolService.meetsRequirements(previewTool, file)) {
retList.add(previewTool);
}
}
Collections.sort(retList, CompareExternalToolName);
return retList;
}
Expand Down
21 changes: 21 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/Files.java
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,27 @@ public Response redetectDatafile(@PathParam("id") String id, @QueryParam("dryRun
}
}

@Path("{id}/extractNcml")
@POST
public Response extractNcml(@PathParam("id") String id) {
try {
AuthenticatedUser au = findAuthenticatedUserOrDie();
if (!au.isSuperuser()) {
// We can always make a command in the future if there's a need
// for non-superusers to call this API.
return error(Response.Status.FORBIDDEN, "This API call can be used by superusers only");
}
DataFile dataFileIn = findDataFileOrDie(id);
java.nio.file.Path tempLocationPath = null;
boolean successOrFail = ingestService.extractMetadataNcml(dataFileIn, tempLocationPath);
NullSafeJsonBuilder result = NullSafeJsonBuilder.jsonObjectBuilder()
.add("result", successOrFail);
return ok(result);
} catch (WrappedResponse wr) {
return wr.getResponse();
}
}

/**
* Attempting to run metadata export, for all the formats for which we have
* metadata Exporters.
Expand Down
4 changes: 3 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/TestApi.java
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,9 @@ public Response getExternalToolsForFile(@PathParam("id") String idSupplied, @Que
ApiToken apiToken = externalToolService.getApiToken(getRequestApiKey());
ExternalToolHandler externalToolHandler = new ExternalToolHandler(tool, dataFile, apiToken, dataFile.getFileMetadata(), null);
JsonObjectBuilder toolToJson = externalToolService.getToolAsJsonWithQueryParameters(externalToolHandler);
tools.add(toolToJson);
if (externalToolService.meetsRequirements(tool, dataFile)) {
tools.add(toolToJson);
}
}
return ok(tools);
} catch (WrappedResponse wr) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ public class ExternalTool implements Serializable {
public static final String CONTENT_TYPE = "contentType";
public static final String TOOL_NAME = "toolName";
public static final String ALLOWED_API_CALLS = "allowedApiCalls";
public static final String REQUIREMENTS = "requirements";

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
Expand Down Expand Up @@ -103,6 +104,15 @@ public class ExternalTool implements Serializable {
@Column(nullable = true, columnDefinition = "TEXT")
private String allowedApiCalls;

/**
* When non-null, the tool has indicated that it has certain requirements
* that must be met before it should be shown to the user. This
* functionality was added for tools that operate on aux files rather than
* data files so "auxFilesExist" is one of the possible values.
*/
@Column(nullable = true, columnDefinition = "TEXT")
private String requirements;

/**
* This default constructor is only here to prevent this error at
* deployment:
Expand All @@ -118,10 +128,10 @@ public ExternalTool() {
}

public ExternalTool(String displayName, String toolName, String description, List<ExternalToolType> externalToolTypes, Scope scope, String toolUrl, String toolParameters, String contentType) {
this(displayName, toolName, description, externalToolTypes, scope, toolUrl, toolParameters, contentType, null);
this(displayName, toolName, description, externalToolTypes, scope, toolUrl, toolParameters, contentType, null, null);
}

public ExternalTool(String displayName, String toolName, String description, List<ExternalToolType> externalToolTypes, Scope scope, String toolUrl, String toolParameters, String contentType, String allowedApiCalls) {
public ExternalTool(String displayName, String toolName, String description, List<ExternalToolType> externalToolTypes, Scope scope, String toolUrl, String toolParameters, String contentType, String allowedApiCalls, String requirements) {
this.displayName = displayName;
this.toolName = toolName;
this.description = description;
Expand All @@ -131,6 +141,7 @@ public ExternalTool(String displayName, String toolName, String description, Lis
this.toolParameters = toolParameters;
this.contentType = contentType;
this.allowedApiCalls = allowedApiCalls;
this.requirements = requirements;
}

public enum Type {
Expand Down Expand Up @@ -326,5 +337,12 @@ public void setAllowedApiCalls(String allowedApiCalls) {
this.allowedApiCalls = allowedApiCalls;
}

public String getRequirements() {
return requirements;
}

public void setRequirements(String requirements) {
this.requirements = requirements;
}

}
Loading