Skip to content

Commit

Permalink
Merge pull request #5049 from QualitativeDataRepository/IQSS/4706-DPN…
Browse files Browse the repository at this point in the history
…_Submission_of_archival_copies

Iqss/4706 dpn submission of archival copies
  • Loading branch information
kcondon authored Jan 23, 2019
2 parents 4cb8cfd + 8d09fbf commit 14fa723
Show file tree
Hide file tree
Showing 29 changed files with 2,171 additions and 280 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,4 @@ scripts/installer/default.config
# do not track IntelliJ IDEA files
.idea
**/*.iml
/bin/
11 changes: 9 additions & 2 deletions doc/sphinx-guides/source/admin/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,15 +93,22 @@ SHARE
`SHARE <http://www.share-research.org>`_ is building a free, open, data set about research and scholarly activities across their life cycle. It's possible to add and installation of Dataverse as one of the `sources <https://share.osf.io/sources>`_ they include if you contact the SHARE team.

Research Data Preservation
-------------------
--------------------------

Archivematica
+++++
+++++++++++++

`Archivematica <https://www.archivematica.org>`_ is an integrated suite of open-source tools for processing digital objects for long-term preservation, developed and maintained by Artefactual Systems Inc. Its configurable workflow is designed to produce system-independent, standards-based Archival Information Packages (AIPs) suitable for long-term storage and management.

Sponsored by the `Ontario Council of University Libraries (OCUL) <https://ocul.on.ca/>`_, this technical integration enables users of Archivematica to select datasets from connected Dataverse instances and process them for long-term access and digital preservation. For more information and list of known issues, please refer to Artefactual's `release notes <https://wiki.archivematica.org/Archivematica_1.8_and_Storage_Service_0.13_release_notes>`_, `integration documentation <https://www.archivematica.org/en/docs/archivematica-1.8/user-manual/transfer/dataverse/>`_, and the `project wiki <https://wiki.archivematica.org/Dataverse>`_.

DuraCloud/Chronopolis
+++++++++++++++++++++

Dataverse can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_

For details on how to configure this integration, look for "DuraCloud/Chronopolis" in the :doc:`/installation/config` section of the Installation Guide.

Future Integrations
-------------------

Expand Down
96 changes: 1 addition & 95 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ Configuring the RSAL Mock

Info for configuring the RSAL Mock: https://github.com/sbgrid/rsal/tree/master/mocks

Also, to configure Dataverse to use the new workflow you must do the following (see also the section below on workflows):
Also, to configure Dataverse to use the new workflow you must do the following (see also the :doc:`workflows` section):

1. Configure the RSAL URL:

Expand Down Expand Up @@ -301,98 +301,4 @@ In the GUI, this is called "Local Access". It's where you can compute on files o

``curl http://localhost:8080/api/admin/settings/:LocalDataAccessPath -X PUT -d "/programs/datagrid"``

Workflows
---------

Dataverse can perform two sequences of actions when datasets are published: one prior to publishing (marked by a ``PrePublishDataset`` trigger), and one after the publication has succeeded (``PostPublishDataset``). The pre-publish workflow is useful for having an external system prepare a dataset for being publicly accessed (a possibly lengthy activity that requires moving files around, uploading videos to a streaming server, etc.), or to start an approval process. A post-publish workflow might be used for sending notifications about the newly published dataset.

Workflow steps are created using *step providers*. Dataverse ships with an internal step provider that offers some basic functionality, and with the ability to load 3rd party step providers. This allows installations to implement functionality they need without changing the Dataverse source code.

Steps can be internal (say, writing some data to the log) or external. External steps involve Dataverse sending a request to an external system, and waiting for the system to reply. The wait period is arbitrary, and so allows the external system unbounded operation time. This is useful, e.g., for steps that require human intervension, such as manual approval of a dataset publication.

The external system reports the step result back to dataverse, by sending a HTTP ``POST`` command to ``api/workflows/{invocation-id}``. The body of the request is passed to the paused step for further processing.

If a step in a workflow fails, Dataverse make an effort to roll back all the steps that preceeded it. Some actions, such as writing to the log, cannot be rolled back. If such an action has a public external effect (e.g. send an EMail to a mailing list) it is advisable to put it in the post-release workflow.

.. tip::
For invoking external systems using a REST api, Dataverse's internal step
provider offers a step for sending and receiving customizable HTTP requests.
It's called *http/sr*, and is detailed below.

Administration
~~~~~~~~~~~~~~

A Dataverse instance stores a set of workflows in its database. Workflows can be managed using the ``api/admin/workflows/`` endpoints of the :doc:`/api/native-api`. Sample workflow files are available in ``scripts/api/data/workflows``.

At the moment, defining a workflow for each trigger is done for the entire instance, using the endpoint ``api/admin/workflows/default/«trigger type»``.

In order to prevent unauthorized resuming of workflows, Dataverse maintains a "white list" of IP addresses from which resume requests are honored. This list is maintained using the ``/api/admin/workflows/ip-whitelist`` endpoint of the :doc:`/api/native-api`. By default, Dataverse honors resume requests from localhost only (``127.0.0.1;::1``), so set-ups that use a single server work with no additional configuration.


Available Steps
~~~~~~~~~~~~~~~

Dataverse has an internal step provider, whose id is ``:internal``. It offers the following steps:

log
^^^

A step that writes data about the current workflow invocation to the instance log. It also writes the messages in its ``parameters`` map.

.. code:: json
{
"provider":":internal",
"stepType":"log",
"parameters": {
"aMessage": "message content",
"anotherMessage": "message content, too"
}
}
pause
^^^^^

A step that pauses the workflow. The workflow is paused until a POST request is sent to ``/api/workflows/{invocation-id}``.

.. code:: json
{
"provider":":internal",
"stepType":"pause"
}
http/sr
^^^^^^^

A step that sends a HTTP request to an external system, and then waits for a response. The response has to match a regular expression specified in the step parameters. The url, content type, and message body can use data from the workflow context, using a simple markup language. This step has specific parameters for rollback.

.. code:: json
{
"provider":":internal",
"stepType":"http/sr",
"parameters": {
"url":"http://localhost:5050/dump/${invocationId}",
"method":"POST",
"contentType":"text/plain",
"body":"START RELEASE ${dataset.id} as ${dataset.displayName}",
"expectedResponse":"OK.*",
"rollbackUrl":"http://localhost:5050/dump/${invocationId}",
"rollbackMethod":"DELETE ${dataset.id}"
}
}
Available variables are:

* ``invocationId``
* ``dataset.id``
* ``dataset.identifier``
* ``dataset.globalId``
* ``dataset.displayName``
* ``dataset.citation``
* ``minorVersion``
* ``majorVersion``
* ``releaseStatus``
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/developers/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ Developer Guide
geospatial
selinux
big-data-support
workflows
130 changes: 130 additions & 0 deletions doc/sphinx-guides/source/developers/workflows.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
Workflows
================

Dataverse has a flexible workflow mechanism that can be used to trigger actions before and after Dataset publication.

.. contents:: |toctitle|
:local:


Introduction
---------

Dataverse can perform two sequences of actions when datasets are published: one prior to publishing (marked by a ``PrePublishDataset`` trigger), and one after the publication has succeeded (``PostPublishDataset``). The pre-publish workflow is useful for having an external system prepare a dataset for being publicly accessed (a possibly lengthy activity that requires moving files around, uploading videos to a streaming server, etc.), or to start an approval process. A post-publish workflow might be used for sending notifications about the newly published dataset.

Workflow steps are created using *step providers*. Dataverse ships with an internal step provider that offers some basic functionality, and with the ability to load 3rd party step providers. This allows installations to implement functionality they need without changing the Dataverse source code.

Steps can be internal (say, writing some data to the log) or external. External steps involve Dataverse sending a request to an external system, and waiting for the system to reply. The wait period is arbitrary, and so allows the external system unbounded operation time. This is useful, e.g., for steps that require human intervension, such as manual approval of a dataset publication.

The external system reports the step result back to dataverse, by sending a HTTP ``POST`` command to ``api/workflows/{invocation-id}``. The body of the request is passed to the paused step for further processing.

If a step in a workflow fails, Dataverse make an effort to roll back all the steps that preceded it. Some actions, such as writing to the log, cannot be rolled back. If such an action has a public external effect (e.g. send an EMail to a mailing list) it is advisable to put it in the post-release workflow.

.. tip::
For invoking external systems using a REST api, Dataverse's internal step
provider offers a step for sending and receiving customizable HTTP requests.
It's called *http/sr*, and is detailed below.

Administration
~~~~~~~~~~~~~~

A Dataverse instance stores a set of workflows in its database. Workflows can be managed using the ``api/admin/workflows/`` endpoints of the :doc:`/api/native-api`. Sample workflow files are available in ``scripts/api/data/workflows``.

At the moment, defining a workflow for each trigger is done for the entire instance, using the endpoint ``api/admin/workflows/default/«trigger type»``.

In order to prevent unauthorized resuming of workflows, Dataverse maintains a "white list" of IP addresses from which resume requests are honored. This list is maintained using the ``/api/admin/workflows/ip-whitelist`` endpoint of the :doc:`/api/native-api`. By default, Dataverse honors resume requests from localhost only (``127.0.0.1;::1``), so set-ups that use a single server work with no additional configuration.


Available Steps
~~~~~~~~~~~~~~~

Dataverse has an internal step provider, whose id is ``:internal``. It offers the following steps:

log
+++

A step that writes data about the current workflow invocation to the instance log. It also writes the messages in its ``parameters`` map.

.. code:: json
{
"provider":":internal",
"stepType":"log",
"parameters": {
"aMessage": "message content",
"anotherMessage": "message content, too"
}
}
pause
+++++

A step that pauses the workflow. The workflow is paused until a POST request is sent to ``/api/workflows/{invocation-id}``.

.. code:: json
{
"provider":":internal",
"stepType":"pause"
}
http/sr
+++++++

A step that sends a HTTP request to an external system, and then waits for a response. The response has to match a regular expression specified in the step parameters. The url, content type, and message body can use data from the workflow context, using a simple markup language. This step has specific parameters for rollback.

.. code:: json
{
"provider":":internal",
"stepType":"http/sr",
"parameters": {
"url":"http://localhost:5050/dump/${invocationId}",
"method":"POST",
"contentType":"text/plain",
"body":"START RELEASE ${dataset.id} as ${dataset.displayName}",
"expectedResponse":"OK.*",
"rollbackUrl":"http://localhost:5050/dump/${invocationId}",
"rollbackMethod":"DELETE ${dataset.id}"
}
}
Available variables are:

* ``invocationId``
* ``dataset.id``
* ``dataset.identifier``
* ``dataset.globalId``
* ``dataset.displayName``
* ``dataset.citation``
* ``minorVersion``
* ``majorVersion``
* ``releaseStatus``

archiver
+++++++

A step that sends an archival copy of a Dataset Version to a configured archiver, e.g. the DuraCloud interface of Chronopolis. See the `DuraCloud/Chronopolis Integration documentation <http://guides.dataverse.org/en/latest/admin/integrations.html#id15>`_ for further detail.

Note - the example step includes two settings required for any archiver and three (DuraCloud*) that are specific to DuraCloud.

.. code:: json
{
"provider":":internal",
"stepType":"archiver",
"parameters": {
"stepName":"archive submission"
},
"requiredSettings": {
":ArchiverClassName": "string",
":ArchiverSettings": "string",
":DuraCloudHost":"string",
":DuraCloudPort":"string",
":DuraCloudContext":"string"
}
}
Loading

0 comments on commit 14fa723

Please sign in to comment.