-
Notifications
You must be signed in to change notification settings - Fork 39
Marking up your dataset with DCAT
The Data Catalog Vocabulary (DCAT) defines a standard way to publish machine-readable metadata about a dataset.
The simplest way to publish a description of your dataset is to publish DCAT metadata using RDFa. RDFa allows machine-readable metadata to be embedded in a webpage. This means that publishing your dataset metadata can be easily achieved by updating the HTML for your dataset homepage.
This guide provides a short introduction to publishing DCAT metadata using RDFa. For more advanced use cases, including publishing data in other formats, take a look at the official W3C documentation for DCAT. The RDFa primer may also be useful background reading.
The Open Data Certificates application supports reading DCAT published as RDFa. So as well as providing machine-readable metadata for data consumers, using DCAT will simplify the process of certifying your dataset as the application will be able to automatically populate some of the answers for you.
The first thing to do is to let applications know that your web page is describing a dataset. To do this we need to declare the metadata schemas we will be using to describe the dataset and then indicate the type of thing being described.
Here is a fragment of HTML that provides a starting point. Replace {url}
with the URL of your dataset page.
<html prefix="dct: http://purl.org/dc/terms/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
dcat: http://www.w3.org/ns/dcat#
foaf: http://xmlns.com/foaf/0.1/">
<body>
<div typeof="dcat:Dataset" resource="{url}">
...
</div>
</body>
</html>
The html
element has a prefix
attribute that declares the schemes. The div
element declares which resource it is describing and the type of resource using the resource
and typeof
attributes.
The rest of the metadata about the dataset will then be added to HTML elements nested inside this container <div>
. You don't have to use a div
element, it could be any HTML element, so adapt this to the structure of your dataset's page.
The following sections each illustrate how to add extra metadata elements that flesh out the description of your dataset. Try and provide as full a description of your dataset as possible.
Again, the HTML elements used in the examples are just a suggestion, they can be anything you want, so adapt the examples to your existing page. You may need to add some extra elements, e.g. to wrap dates, titles, etc .that are already in the page as plain-text.
The important part of the examples are the RDFa attributes: about
, property
, content
, datatype
, etc. These attributes define what property of the dataset is being described and provide the machine-readable metadata.
Specify the title for your dataset using the dct:title
property:
<h1 property="dct:title">Example Dataset</h1>
Specify the date your dataset was created using the dct:created
property:
<p property="dct:created" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p>
In this case the human-readable text for the property is contained within the paragraph tag. It's value can be anything, but the machine readable date (specified in the content
attribute), must use a defined data type so it can be easily parsed.
It's recommended that you use the XML Schema date or XML schema dateTime format format.
Specify the date your dataset was last updated using the dct:modified
property:
<p property="dct:modified" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p>
See above for notes on the date formats.
Markup the description of your dataset using the dct:description
property:
<p property="dct:description">This is the description.<p>
The markup for declaring your dataset license is slightly more complex. You need to declare the license property (dct:license
) as well as the name and URL for the license:
Substitute the {license URL}
and {license name}
placeholders for the values that apply to your dataset.
<div property="dct:license"
resource="{license URL}">
<a href="{license URL}">
<span property="dct:title">{license name}</span>
</a>
</div>
For a more detailed guide on publishing a comprehensive rights statement for your dataset, including license, copyright statements and preferred form of attribution read the Publishers Guide to the Open Data Rights Statement Vocabulary.
Declare the publisher of your dataset using the dct:publisher
property. Again, there are several elements to declare here including the name and homepage URL for the publisher.
Replace the publisher URL
and publisher name
properties with the appropriate values.
<div property="dct:publisher"
resource="{publisher URL}">
<a href="{publisher URL}" about="{publisher URL}" property="foaf:homepage">
<span property="foaf:name">{publisher name}</span>
</a>
</div>
Keywords can be attached to a dataset using the dcat:keyword
property. The property values are simple labels or tags. You can have as many or as few (or none!) of these as you want.
<span property="dcat:keyword">Examples</span>, <span property="dcat:keyword">DCAT</span>
The dcat:accrualPeriodicity
property is used to define how often a dataset is updated. The values for the property are URIs which are taken a simple controlled vocabulary.
<a href="{frequency}" property="dcat:accrualPeriodicity">{frequency (human readable)}</a>
Substitute the {frequency}
placeholder with one of the following URIs:
-
http://purl.org/linked-data/sdmx/2009/code#freq-A
- Annual -
http://purl.org/linked-data/sdmx/2009/code#freq-B
- Every working day (Mon - Fri) -
http://purl.org/linked-data/sdmx/2009/code#freq-D
- Daily (7 days a week) -
http://purl.org/linked-data/sdmx/2009/code#freq-M
- Monthly -
http://purl.org/linked-data/sdmx/2009/code#freq-N
- Every minute -
http://purl.org/linked-data/sdmx/2009/code#freq-Q
- Every quarter -
http://purl.org/linked-data/sdmx/2009/code#freq-S
- Half yearly -
http://purl.org/linked-data/sdmx/2009/code#freq-W
- Weekly
A dataset can have a number of distributions. Distributions describe how a dataset is packaged and released. Your dataset may have several distributions, e.g. if you publish a series of data over a period of time as separate packages, or if it is available in different formats.
The markup here is a little more complex. It defines a new resource (a dcat:Distribution
) and associates that with your dataset. The nested markup then provides metadata about the new resource, e.g. its format, size, publication date, etc.
<div property='dcat:distribution' typeof='dcat:Distribution'>
<span property="dct:title">{Distribution title}</span>
<ul>
<li><strong>Format</strong> <span content='{format}' property='dcat:mediaType'>{format (human readable)</span></li>
<li><strong>Size</strong> <span content='{size in bytes}' datatype='xsd:decimal' property='dcat:byteSize'>{size (human readable)}</span></li>
<li><strong>Issued</strong> <span property='dct:issued' content='{date issued}' datatype='xsd:date'>{date issued (human readable)}</span></li>
</ul>
<p><a href='{link to data}' property='dcat:accessURL'>Download the full dataset</a></p>
</div>
The {format}
placeholder should be a recognised MIME type, (for example text/csv
or application/json
)
The dct:issued
property specifies the date that the distribution was published. The property should follow the same guidelines as for the dct:created
property outlined above.
Here is a complete example of an HTML page marked up using DCAT. It provides all of the core metadata for the dataset, including a description of a single distribution.:
<!DOCTYPE html>
<html prefix="dct: http://purl.org/dc/terms/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
dcat: http://www.w3.org/ns/dcat#
foaf: http://xmlns.com/foaf/0.1/">
<head>
<title>DCAT in RDFa</title>
</head>
<body>
<div typeof="dcat:Dataset" resource="http://gov.example.org/dataset/finances">
<h1 property="dct:title">Example DCAT Dataset</h1>
<p property="dct:created" content='2010-10-25T09:00:00+00:00' datatype='xsd:dateTime'>25th October 2010</p>
<p property="dct:modified" content='2013-05-10T13:39:36+00:00' datatype='xsd:dateTime'>10th March 2013</p>
<p property="dct:description">This is the description.<p>
<div property="dct:license"
resource="http://reference.data.gov.uk/id/open-government-licence">
<a href="http://reference.data.gov.uk/id/open-government-licence">
<span property="dct:title">UK Open Government Licence (OGL)</span>
</a>
</div>
<div property="dct:publisher"
resource="http://example.org/publisher">
<a href="http://example.org/publisher" about="http://example.org/publisher" property="foaf:homepage">
<span property="foaf:name">Example Publisher</span>
</a>
</div>
<div>
<span property="dcat:keyword">Examples</span>, <span property="dcat:keyword">DCAT</span>
</div>
<div>
<a href="http://purl.org/linked-data/sdmx/2009/code#freq-W" property="dcat:accrualPeriodicity">Weekly</a>
</div>
<div property='dcat:distribution' typeof='dcat:Distribution'>
<span property="dct:title">CSV download</span>
<ul>
<li><strong>Format</strong> <span content='text/csv' property='dcat:mediaType'>CSV</span></li>
<li><strong>Size</strong> <span content='240585277' datatype='xsd:decimal' property='dcat:byteSize'>1024MB</span></li>
<li><strong>Issues</strong> <span property='dct:issued'>2012-01-01</span></li>
</ul>
<p><a class='btn btn-primary' href='http://example.org/distribution.csv.zip' property='dcat:accessURL'>Download the full dataset</a></p>
</div>
</body>
</html>
You can also see an example 'in the wild' at http://smtm.labs.theodi.org/download/