-
Notifications
You must be signed in to change notification settings - Fork 36
Darwin Core Archive Publishing
Specify 7 is capable of exporting collections data in the Darwin Core Archive format. This capability expands on the existing Specify 6 functionality by supporting extensions to the core format.
Publishing of DwCAs in Specify 7 is managed via several App Resource records in XML format. There are three relevant types of resource: the Basic DwCA definition resource, the Dataset metadata resource, and the Export feed resource.
- Darwin Core Archive Publishing in Specify 7
- App resources
- Basic DwCA definition resource
- Dataset metadata resource
- Export feed resource
- Updating the data publishing feed
App resources can be accessed by admin users through the Resources
option in the User Tools dialog that is opened by clicking on the
user name, or by navigating to http://SITE/specify/appresources/
on a Specify 7
site.
The App Resources are organized hierarchically starting with Global and Discipline level resources and descending through Collection, User Types, and User levels. The various levels may be expanded or collapsed by clicking on the headings. At each level, individual resources can be opened by clicking on the name of the resource, and new resources can be added by clicking on New Resource and entering a resource name.
The fundamental resource for producing a DwCA consists of an XML
definition that is modeled on the DwCA Metafile (meta.xml
) as
described in
the
Darwin Core Text Guide,
hereafter DCTG.
New resources for DwCA definitions should be created under a particular collection level within the App Resource hierarchy corresponding to the collection whose data they will be used to export.
The basic structure is a single <archive>
element containing exactly
one <core>
stanza and optional <extension>
stanzas, each of which must
indicate the rowType
that specifies the class of data represented by
that stanza. This corresponds exactly to the rowType
that will
appear in the generated meta.xml
as described in the DCTG.
Each <core>
and <extension>
element may contain multiple free standing
<field>
elements with term
and value
attributes. The term
is the
URI for the term represented by the field, and value
is a value for
that term which is the same for all rows in the data set represented
by the stanza. Aside from these constant values, the data for each row
in the data set will be produced from one or more queries defined in
the <queries>
stanza within each <core>
and <extension>
stanza.
<?xml version="1.0" encoding="utf-8"?>
<archive>
<!-- The core of this DwCA will consist of Occurrence records. -->
<core rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<!-- Free standing constant fields -->
<field term="http://rs.tdwg.org/dwc/terms/institutionCode"
value="KU"/>
<field term="http://rs.tdwg.org/dwc/terms/institutionID"
value="http://grbio.org/cool/iakn-125z"/>
<field term="http://purl.org/dc/terms/rights"
value="http://creativecommons.org/licenses/by/4.0/deed.en_US"/>
<field term="http://purl.org/dc/terms/accessRights"
value="http://biodiversity.ku.edu/research/university-kansas-biodiversity-institute-data-publication-and-use-norms"/>
<field term="http://rs.tdwg.org/dwc/terms/collectionCode"
value="KUI"/>
<field term="http://rs.tdwg.org/dwc/terms/basisOfRecord"
value="PreservedSpecimen"/>
<field term="http://rs.tdwg.org/dwc/terms/datasetName"
value="University of Kansas Biodiversity Institute Fish Voucher Collection"/>
<!-- The remaining field are produced by the following query: -->
<queries>
<query contextTableId="1" name="occurrence.txt">
<id term="http://rs.tdwg.org/dwc/terms/occurrenceID" isNot="false" isRelFld="false" oper="11" stringId="1.collectionobject.guid" value=""/>
<field term="http://purl.org/dc/terms/accessRights" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.termsOfUse" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/basisOfRecord" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.collectionType" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/catalogNumber" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.catalogNumber" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/class" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Class" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/collectionCode" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.code" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/continent" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.Continent" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/country" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.Country" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/county" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.County" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/datasetName" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.description" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/decimalLatitude" isNot="false" isRelFld="false" oper="1" stringId="1,10,2.locality.latitude1" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/decimalLongitude" isNot="false" isRelFld="false" oper="1" stringId="1,10,2.locality.longitude1" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/eventDate" isNot="false" isRelFld="false" oper="1" stringId="1,10.collectingevent.startDate" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/family" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Family" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/fieldNumber" isNot="false" isRelFld="false" oper="11" stringId="1.collectionobject.fieldNumber" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/genus" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Genus" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/geodeticDatum" isNot="false" isRelFld="false" oper="11" stringId="1,10,2.locality.datum" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/georeferencedDate" isNot="false" isRelFld="false" oper="1" stringId="1,10,2,123-geoCoordDetails.geocoorddetail.geoRefDetDate" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/higherGeography" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.fullName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Subspecies" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/institutionCode" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.code" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/institutionID" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.altName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/kingdom" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Kingdom" value=""/>
<field term="http://purl.org/dc/terms/license" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.copyright" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/locality" isNot="false" isRelFld="false" oper="11" stringId="1,10,2.locality.localityName" value=""/>
<field term="http://purl.org/dc/terms/modified" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.timestampModified" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/order" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Order" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/phylum" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Phylum" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/preparations" isNot="false" isRelFld="true" oper="11" stringId="1,63-preparations.preparation.preparations" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/recordedBy" isNot="false" isRelFld="true" oper="11" stringId="1,10,30-collectors.collector.collectors" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/scientificName" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.fullName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/specificEpithet" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Species" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/stateProvince" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.State" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/typeStatus" isNot="false" isRelFld="false" oper="1" stringId="1,9-determinations.determination.typeStatusName" value=""/>
<field isNot="false" isRelFld="false" oper="13" stringId="1,9-determinations.determination.isCurrent" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/sex" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.text2" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/lifeStage" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.text3" value=""/>
</query>
</queries>
</core>
<!-- The DwCA will include a Multimedia extension. -->
<extension rowType="http://rs.tdwg.org/ac/terms/Multimedia">
<!-- Constant fields. -->
<field term="http://purl.org/dc/terms/rights" value="http://creativecommons.org/licenses/by/4.0/deed.en_US"/>
<field term="http://purl.org/dc/terms/accessRights" value="http://biodiversity.ku.edu/research/university-kansas-biodiversity-institute-data-publication-and-use-norms"/>
<!-- In this case there are two queries to produces the rows. -->
<queries>
<!-- Rows from the collection object attachments records. -->
<query contextTableId="111" name="collectionobjectattachment.csv">
<id isNot="false" isRelFld="false" oper="1" stringId="111,1.collectionobject.guid" value=""/>
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false" isRelFld="true" oper="1" stringId="111,41.attachment.attachment" value="" formatName="AttachmentTest"/>
<field term="http://purl.org/dc/terms/identifier" isNot="false" isRelFld="false" oper="1" stringId="111,41.attachment.guid" value=""/>
<field term="http://purl.org/dc/terms/format" isNot="false" isRelFld="false" oper="1" stringId="111,41.attachment.mimeType" value=""/>
</query>
<!-- Rows from the research work attachment records. -->
<query contextTableId="1" name="researchworkattachment.csv">
<id isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.guid" value=""/>
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false" isRelFld="true" oper="1" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.attachment" value="" formatName="AttachmentTest"/>
<field term="http://purl.org/dc/terms/identifier" isNot="true" isRelFld="false" oper="12" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.guid" value=""/>
<field term="http://purl.org/dc/terms/format" isNot="false" isRelFld="false" oper="1" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.mimeType" value=""/>
</query>
</queries>
</extension>
</archive>
The query stanzas within the DwCA definition are meant to be generated with the assistance of the Specify 7 query builder. After a Specify query has been defined, tested, and saved using the query builder, it can be used as the basis for a query stanza. There is an export button in the edit dialog that appears when the pencil icon next to query name is clicked. The export button will open a dialog containing XML that can be copy-pasted into an editor for inclusion in a DwCA definition.
After the XML export of the query has been copied, the Specify query can be deleted, if so desired.
<?xml version="1.0" ?>
<query contextTableId="1" name="DwCKUI">
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.code" value=""/>
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.altName" value=""/>
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.copyright" value=""/>
<!-- . -->
<!-- . -->
<!-- . -->
<!-- . -->
</query>
The exported XML contains enough information to generate the rows for
the DwCA file, but needs to be edited to include the term
attribute
for each field which should be a URI. Fields which do not specify a
term will not be included in the output but maybe required for
filtering within the query.
It is also necessary for one of the fields to be designated the core
id field by changing the element tag from field
to id
. This field
will become the <id>
field if included in the <core>
stanza or the
<coreId>
field if included in an <extension>
stanza.
See the DCTG for more information about the <id>
and <coreId>
fields
and term
attributes.
Finally, the name
attribute should be adjusted to an appropriate
filename for the corresponding CSV file that will be included in the
DwCA file. That is, it should be unique within a given DwCA definition
and not contain special characters.
After these modifications, a query element and its contents can be
copied into the <queries>
element of a DwCA definition resource.
When a query field requires formatting or aggregating rows from a
table Specify uses definitions from the DataObjFormatters
appresource at the corresponding discipline level. By default the
data export will use the same formatters and aggregators as the form
system as configured through the schema config tool. For increased
flexibility this mechanism can be overridden by adding a formatName
attribute to <field>
elements in the query stanza. The value of the
attribute should be the name of one of the formatters or aggregators
defined in DataObjFormatters for the relavent table.
For instance, the following would apply the formatter named
CreateAttachmentURL
to the rows from the attachment table:
<query contextTableId="111" name="collectionobjectattachment.csv">
<!-- . -->
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false"
isRelFld="true" oper="1"
stringId="111,41.attachment.attachment" value=""
formatName="CreateAttachmentURL"/>
<!-- . -->
</query>
After a DwCA definition resource has been created, it can be tested by instructing Specify to produce a DwCA file based on it.
The user should be logged into the collection corresponding to the DwCA definition resource. Clicking on the user name followed by Make DwCA in User Tools opens a dialog that accepts a DwCA definition and a Metadata resource. Enter the name of the DwCA definition resource in the DwCA definition field, and click start. The Metadata resource can be left blank. Specify will begin running the queries and generating the DwCA file. A notification with a link to download the archive will be produced when the process completes.
Specify 7 can include a metadata file in generated DwCA files describing the whole dataset in Ecological Metadata Language. Like the DwCA definition resource, the metadata resource should be created at the collection level of the resource hierachy.
When generating a DwCA
as described above, the name of
a metadata resource can be provided. The contents of the resource will
be included in the generated DwCA unchanged except for pubDate
which
will be updated to reflect the date the archive was generated. Inside
the archive the resource will be given the filename eml.xml
.
One final resource, the export feed resource, allows DwCA generation to
be automated and advertised through an RSS feed. While a single
Specify 7 instance (or database) may utilize multiple DwCA definition
and metadata resources, it may only use one export feed resource that
will define all of the published exports. The export feed resource
must be created at the Global Resources level of the hierarchy and
must be named ExportFeed
.
The content of the export feed resource should be XML with a root
element <channel>
comprising one or more <item>
elements. The
channel element corresponds to the eponymous RSS element and may also
include <title>
, <description>
, and <language>
elements to be
passed through to the generated RSS unchanged. Each <item>
resource
corresponds to an individual DwCA that may be published by the system.
Like the <channel>
element, the <item>
element accepts several
child elements that will be included in the RSS unchanged. These are
<title>
, <description>
, <id>
, and <guid>
.
Additionally, the <item>
element requires the following attributes
that control the generation of the DwCA the item is to represent:
-
collectionId
- This is the database id of the collection whose data the DwCA will contain. The id of a collection can be found by logging into the collection in a browser then visiting the URLhttp://SITE/context/collection/
and inspecting thecurrent
value in the resulting JSON. -
userId
- This is the database id of the Specify user under whose account the DwCA query should be executed. This can, in principle, affect things like which formatters are used and should generally be the same user as created the query. User ids can be found in the URL shown in the browser when editing a user record. -
notifyUserId
- This attribute is optional and indicates that notifications generated during export should go to the indicated user rather than the user specified byuserId
. -
definition
- The name of the DwCA definition resource to be used to generate the DwCA. This resource must be associated with the collection indicated bycollectionId
. -
metadata
- The name of the dataset metadata resource to be included in the DwCA. This resource must be associated with the collection indicated bycollectionId
. -
days
- When the RSS feed is updated via the command line tool, the DwCA represented by this item will only be updated if it is more than the stated number of days old. -
filename
- The filename to be given to the generated DwCA. This filename will be used in URLs, so it is better if it doesn't contain any awkward characters. -
publish
- If this attribute is not included and set totrue
, the corresponding DwCA will not be included in the RSS feed. Nevertheless, it will be generated or updated when the feed is updated, and the file will continue to be available for download to those knowing the URL.
<channel>
<title>KUBI ichthyology RSS Feed</title>
<description>RSS feed for KUBI Ichthyology Voucher and Tissue collections</description>
<item collectionId="4" userId="2" notifyUserId="2" definition="DwCA_voucher" metadata="DwCA_voucher_metadata" days="7" filename="kui-dwca.zip" publish="true">
<title>KU Fish</title>
<id>8f79c802-a58c-447f-99aa-1d6a0790825a</id>
</item>
<item collectionId="32768" userId="2" notifyUserId="2" definition="DwCA_tissue" metadata="DwCA_tissue_metadata" days="7" filename="kuit-dwca.zip" publish="true">
<title>KU Fish Tissue</title>
<id>56caf05f-1364-4f24-85f6-0c82520c2792</id>
</item>
</channel>
When a valid ExportFeed
resource is present at the Global
Resources app resource level, the http://SITE/export/rss/
URL will become
active and return the generated RSS feed.
When an RSS publishing feed has been defined its contents can be updated in two separate ways.
-
An admin user may select Update Feed Now from the User Tools dialog. This will immediately update all items in the feed irrespective of any
days
attribute. Notifications will also go to the activating user rather than the user specified by thenotifyUserId
attribute. -
The following command may be invoked on the Specify 7 server from the Specify 7 installation directory:
python manage.py update_feed [--force]
The
--force
option will cause all items in the feed to be updated regardless of anydays
attributes. Otherwise, only DwCA files older than the given number of days will be updated by this command.This command is intended to facilate a work-flow utilizing cron or other task scheduler to maintain an up-to-date data publishing feed.
Note: If the installation is utilizing a python virtualenv, it must be activated prior to issuing the command.