Skip to content

Latest commit

 

History

History
568 lines (466 loc) · 27.3 KB

GETTING-STARTED.md

File metadata and controls

568 lines (466 loc) · 27.3 KB

Home | Getting Started

Getting Started

If you are new to publishing schema.org, here are some general tips to getting started.

Goals

To provide a place for the scientific data community to work out how best to implement schema.org and other external vocabularies on web pages by publishing guidance documents. Pull requests and Github Issues are welcome!

Approach

  1. To be pragmatic with our use of schema.org and external vocabulary adoption.
  2. To consider schema.org classes and properties first before considering external vocabularies.
  3. Use JSON-LD in our guidance documents for simplicity and terseness as compared to Microdata and RDFa. For more, see Why JSON-LD? from the Conventions document.
  4. Presently, the Google Rich Results Tool enforces use of schema.org classes and properties by displaying an error whenever external vocabularies are used. schema.org proposes linking to external vocabularies using the schema:additionalType property. While this property is defined as a sub property of rdf:type, its data type is a literal. However, using the Schema.org Validator allows for the use of external vocabularies. We encourage the use of JSON-LD '@type' for typing classes to external vocabularies. For more, see Typing to External Vocabularies from the Conventions document.
  5. See Governance for how we will govern the project.
  6. See Conventions for guidance on creating/editing guidance documents.

Prerequisites

  1. We assume a general understanding of JSON.
  2. We assume a basic knowledge about JSON-LD.

JSON-LD is valid JSON, so standard developer tools that support JSON can be used. For some specific JSON-LD and schema.org help though, there are some other resources.

JSON-LD resources https://json-ld.org

Generating the JSON-LD is best done via libraries like those you can find at https://json-ld.org. There are libraries for; Javascript, Python, PHP, Ruby, Java, C# and Go. While JSON-LD is just JSON and can be generated many ways, these libraries can generate valid JSON-LD spec output.

The playground is hosted at the very useful JSON-LD web site site. You can explore examples of JSON-LD and view how they convert to RDF, flatten, etc. Note that JSON-LD is not associated with schema.org. It can be used for much more and so most examples at the JSON-LD website don't use schema.org and the site will NOT look to see if you are using schema.org types and properties correctly; it will only check that your JSON-LD is well-formed.

  1. We assume that you've heard about schema.org and have already decided that it's useful to you.
  2. We assume that you have a general understanding of what may describe a scientific dataset.

Let's go!

Introduction

There is an emerging practice to leverage structured metadata to aid in the discovery of web-based resources. Much of this work is taking place in the context (no pun intended) of schema.org. This approach has extended to the resource type Dataset. This page will present approaches, tools and references that will aid in the understanding and development of schema.org in JSON-LD and its connection to external vocabularies. For a more thorough presentation on this, visit the Google AI Blog entry of January 24 2017 at https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html .

Using schema.org

Modifying web pages to include schema.org as JSON-LD

JSON-LD should be incorporated into the landing page html inside the <head></head> as a <script> element with a type of application/ld+json.

<html>
  <head>
    ...
    <script id="schemaorg" type="application/ld+json">
    {
      "@context": "https://schema.org/",
       "@id": "http://opencoredata.org/id/dataset/bcd15975-680c-47db-a062-ac0bb6e66816",
       "@type": "Dataset",
       "description": "Janus Thermal Conductivity for ocean drilling ..."
    }
    </script>
    ...
  </head>
  ...
</html>

Specifying the context

The context in a JSON-LD document defines the namespaces used in the document and their mappings to URIs when they are referenced using prefix notation. The JSON-LD 1.1 specification provides many rules that impact how the context is loaded and how it is retrieved, but ultimately the goal is to define a context map with the namespace mappings for each vocabulary used in the document. For the schema.org vocabulary specifically, the official namespace is http://schema.org/ (note this is not an https URI), but the context file for schema.org can be retrieved from the https web location at https://schema.org by following the JSON-LD processing rules. For providers, this translates to a few simple recommendations.

  1. We recommend retrieving the context file from its https location using the following syntax:
{
  "@context": "https://schema.org/",
   "@type": "Dataset",
   "name": "Example dataset title",
   ...
}

Using this approach, the schema.org namespace will be set to http URIs. For example the Dataset type will be expanded to http://schema.org/Dataset.

  1. Should you need to define additional namespaces in your context, it can be done by expanding the context using a JSON array as follows:
{
  "@context": [
    "https://schema.org/",
    {
      "prov": "http://www.w3.org/ns/prov#"
    }
  ],
  "@type": "Dataset",
  "name": "Example dataset title",
  "prov:wasDerivedFrom": {
    "@id": "https://doi.org/10.xxxx/Dataset-1"
  }
}

Note the square brackets, in which the first entry is the URL of a context file to be retrieved, and the second value is a JSON object to be combined with the retrieved context. This approach still retrieves the context from the secure https URL at schema.org, but then adds an additional namespace for the prov vocabulary to the context. Now, terms from the PROV namespace can be referenced using prefix notation (e.g., prov:wasDerivedFrom).

  1. Additional approaches to defining the context are possible, but users should use care to ensure that the terms within schema.org use the http://schema.org/ namespace as defined in the official schema.org context file. Because contributors to schema.org are working towards accepting https://schema.org/ as an equivalent namespace URI for all terms, processors should treat schema.org terms in the http and https URI spaces as equivalent, but providers might find it safer to continue to use http://schema.org/ as the official namespace for now. This particularly applies when defining a default vocabulary for un-prefixed terms, in which case we recommend using "@vocab": "http://schema.org/" if this is necessary. That said, most users should not have need to define @vocab in typical usage.

Provide a Sitemap.xml file

Many harvesters and aggregators depend on the existence of a sitemap.xml file on your site that lists all of the dataset landing pages from your site that you want to be harvested and indexed for search. Google Dataset Search, DataONE, and Geocodes all can make use of a sitemap to more efficiently harvest your site. A sitemap is a simple text file that lists each page that you want harvested. This can contain any webpage, but in this context we specifically want to list pages that contain a schema:Dataset entry to be harvested. Here's an example sitemap.xml file listing two Dataset landing pages, along with their lastmod date:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2GH9BB0Q</loc>
    <lastmod>2021-12-06</lastmod>
  </url>
  <url>
    <loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2ST7DZ2Q</loc>
    <lastmod>2021-12-07T12:15:05Z</lastmod>
  </url>
</urlset>

Note the <lastmod> field, which indicates the last date on which the page was modified and is formatted as a W3C DateTime and may vary in precision. Most harvesters will use that date along with the HTTP Last-modified header to determine if a page has changed since the last time that a harvest was attempted. Keeping accurate lastmod values can massively improve the efficiency of indexing your catalog, as only the few items that have changed will need to be indexed.

Location: The sitemap.xml file can be located anywhere on your site that is above the path in the hierarchy in which your pages are listed. Typically, the sitemap.xml is placed at the root of the site, but other locations can be used as well. A great way to indicate to harvesters where your sitemap is located would be to include it in your robots.txt file, which is basically an instruction manual for harvesters, at the root of your web site. For example, you might have a robots.txt file with the following contents:

User-agent: *
Sitemap: https://arcticdata.io/sitemap1.xml

Sitemaps are limited to 50,000 records and 50MB, so if your site is larger than that you can break up your sitemap into multiple files, linked together using a sitemap index. Details about sitemap-index.xml and other aspects of sitemaps are provided in the https://sitemaps.org site, as well as from the Google sitemap documentation.

By providing a sitemap and advertising its location, you make it simple for harvesters to find and index your Dataset listings.

Data Types

For each schema.org type, such as Person or Event, there are fields that let you specify more information about that type. Each of these fields has an expected data type that is defined in the documentation as you can see from Figure 1..

Figure 1. schema.org field data types
The expected data type for each field appears in the middle column. The left column is the name of the field, the middle column is the data type, and the right column is the description of the field.

Every data type is either a resource or a literal. Resources refer to other schema.org types. For example, a Dataset type has a field called 'author' of which the data type can be either a 'Person' or an 'Organization'. Because 'Person' and 'Organization' are other schema.org "types" with their own fields, they are called resources. In JSON-LD, you specify resources by using curly brackets {}:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}

In the JSON-LD above, the 'author' is a resource of type 'Person'. Fields that simply have a value are called literal data types. For examples, the 'Person' type above has a 'name' of "Jane Goodall" - a literal text value.

Schema.org defines six literal, or primitive, data types: Text, Number, Boolean, Date, DateTime, and Time. Text has two special variations: URL and how to specify when text is actually HTML.

When using schema.org, literal data types are not specified using curly brackets {} as these are reserved for specifying 'objects' or 'resources' such as other schema.org types like Person, Organization, etc. First, let's see how to use a primitive data type by using fields of CreativeWork, the superclass for Dataset.

Text

Imagine we want to say the name of our Creative Work is "Passenger Manifest for H.M.S. Titanic". The name field of CreativeWork specifies that it expects Text as the data type. We would use it in this way:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic"
}

Number

Let's say we want to specify the version number of our manifest using the version field of CreativeWork which expects a Number. To specify numbers in JSON-LD, we omit the quotations surrounding the value:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1
}

URL

Now, let's specify the URL of our manifest using the url field of CreativeWork, an inherited field from Thing. This fields expects a valid URL represented as Text:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv"
}

Boolean

Using the Boolean value, we can specify that our manifest is accessible for free using the field isAccessibleForFree by using the text true or false and omitting the quotes:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true
}

Date

To specify the datePublished, which allows either a Date or DateTime, as a Date, we can use any ISO 8601 date format by wrapping the date in double-quotes:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29"
}

DateTime

To specify the dateModified as a DateTime, as a Date, we must follow the ISO 8601 format for combining date and time representations using the form [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm] :

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z"
}

Time

Time is a rarely-used data type because it must represent a point in time recurring on multiple days following the XML Schema definition using the form hh:mm:ss[Z|(+|-)hh:mm] (see XML schema for details).

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z"
}

HTML

The HTML data type is a special variation of the Text data type. In some cases where Text is the expected data type, our actual data type may be HTML (because we are dealing with web pages). In this case, the schema.org JSON-LD context defines HTML to mean rdf:HTML, the data type for specifying that a string of text should be interpreted as HTML. Let's say that we have a description of our manifest and want to use the description field, but we have HTML inside that text. Using the text field as we did above for the name field, we would specify the description as:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z",
  "description": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>."
}

However, to specify that the description field should be interpreted as HTML, you specify description as a resource, setting the @type of that resource to "HTML" and placing the HTML string in a JSON-LD property @value:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z",
  "description": {
    "@type": "HTML",
    "@value": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>."
  }
}

Resource Types

All schema.org resources should make use of the @type property which 'classifies' the resources as a specific type. For example, an un-typed resource would look like:

{
  "@context": "https://schema.org/",
  "name": "My Dataset"
}

Even though the above resource has a name of 'My Dataset', harvesters are unaware that your intent was to classify it as a Dataset. Un-typed resources are not valid schema.org resources, and so they require the @type property:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "My Dataset"
}

In some cases, it useful to multi-type a resource. One example of this may be a data repository. A data repository is typically functioning as both an 'Organization' that employs people and has an address, but also as a 'Service' to its user community. To assign multiple types to a resource, we use JSON arrays:

{
  "@context": "https://schema.org/",
  "@type": ["Organization", "Service"],
  "name": "My Data Repository"
}

All schema.org types may be found here.

Time of resource modification

An indication of when a resource was modified is valuable to a consumer for a variety of reasons.

A consumer tracking changes in a collection of SO:Dataset or similar resources being advertised with a sitemap.xml or similar mechanism has at least three timestamps that can be examined to determine if an already retrieved resource may have been modified: the schema.org/dateModified property in the JSON-LD, the Last-Modified time reported by the web server, and the <lastmod> time that may be reported in a sitemap.xml document.

The schema.org/dateModified value should be considered authoritative for indicating when the resource was modified. The Last-Modified header should reflect the corresponding schema.org/dateModified entry. This property provides an important hint for consumers as to whether a cached copy of a resource should be updated for example. Similarly the <lastmod> entry should reflect the Last-Modified header and the schema.org/dateModified value.

A typical pattern for a consumer interesting in synchronizing a cache of resource is:

  1. Examine the sitemap for new or updated entries using hints from <lastmod>
  2. Retrieve the resource directly or by previewing with a HTTP HEAD request. A Last-Modified provides a hint as to whether the resource should be retrieved.
  3. Examine the schema.org/dateModified property of the resource(s) extracted from the resource.

Providing accurate hints early in the process can reduce requirements for effectively sharing data resources.

1. schema.org/dateModified

Each schema.org instance derived from schema.org/CreativeWork may have a dateModified property to indicate "The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed." This property should be provided with any instance of schema.org/Dataset or any other schema.org entity published in a landing page or though other mechanisms. The JSON spec does not include a built-in type for date time values, but the general consensus and a sensible practice is to represent a date time value as a time zone aware ISO 8601 formatted string. For example:

{
  "dateModified": "2018-12-10T13:45:00.000Z"
}

2. HTTP Last-Modified Header

A schema.org instance is typically embedded in a landing page or may be accessed directly as a JSON-LD document over the HTTP protocol. HTTP resource providers (i.e. web servers) may include a Last-Modified header which contains the date and time at which the origin server believes the resource was last modified. The format for the date value follows the RFC 2616 specification. For example:

Last-Modified: Mon, 10 Dec 2018 13:45:00 GMT

3. sitemap.xml lastmod value

A sitemap.xml document provides a mechanism for a resource server to advertise available resources. Each <url> element may include a <lastmod> tag to indicate when the resource identified by the <url>/<loc> was last modified. The specification is fairly loose, indicating that date in the W3C Datetime format of YYYY-MM-DD may be sufficient. However, for the purposes of content synchronization, a higher precision is desireable, and should be provided where possible. For example:

2018-12-10T13:45:00.000Z

JSON-LD Graph Techniques

JSON-LD documents represent a graph model, even though at times that graph is implicit rather than being named. Here are some techniques that may be useful when constructing such graphs.

Ordering items with JSON-LD @list

Unlike plain JSON, collections in JSON-LD are unordered [1, 2]. In cases where ordering of items needs to be preserved, we can use the @list keyword to specify that order should be preserved for a collection. Ordered lists would be important, for example, when a list of authors or creators should be ordered as intended when rendering a view of the metadata, or when a list of bounding box coordinates in an array need to come in a particular order.

In the following example, the list of creator items is not ordered, and so client tools could return the creator names in any order, and different tools may return them in different orders. This would be problematic for building a citation, for example.

Example 1. Ordering for this list of creators will not be preserved:

{
  "@context": "https://schema.org/",
  "@id": "unordered_01",
  "@type": "Dataset",
  "creator": [
    {
      "@id": "https://www.sample-data-repository.org/person/51317",
      "@type": "Person",
      "name": "Dr Uta Passow"
    },
    {
      "@id": "https://www.sample-data-repository.org/person/50663",
      "@type": "Person",
      "name": "Dr Mark Brzezinski"
    }
  ]
}

To order a list, use the JSON-LD @list keyword`, as shown in Example 2:

Example 2. Order will be preserved for this list of creators:

{
  "@context": "https://schema.org/",
  "@id": "order_01",
  "@type": "Dataset",
  "creator": {
    "@list": [
      {
        "@id": "https://www.sample-data-repository.org/person/51317",
        "@type": "Person",
        "name": "Dr Uta Passow"
      },
      {
        "@id": "https://www.sample-data-repository.org/person/50663",
        "@type": "Person",
        "name": "Dr Mark Brzezinski"
      }
    ]
  }
}

Ordering may be specified globally within the document by specifying the container type in a context. For example, after retrieving the context file from schema.org, we can define the schema:creator to be a list container globally in the document using the @container property:

Example 3. Ordering of a list of creators is preserved anywhere such a list appears within the context.

{
  "@context": [
    "https://schema.org/",
    {
      "creator": {
        "@container": "@list"
      }
    }
  ],
  "@id": "order_02",
  "@type": "Dataset",
  "creator": [
    {
      "@id": "https://www.sample-data-repository.org/person/51317",
      "@type": "Person",
      "name": "Dr Uta Passow"
    },
    {
      "@id": "https://www.sample-data-repository.org/person/50663",
      "@type": "Person",
      "name": "Dr Mark Brzezinski"
    }
  ]
}

With this technique, ordering can be set once in the context using @list, and then order will be preserved any time that concept is used in the document.