Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signposting support in Dataverse #5962

Closed
rvanheest opened this issue Jun 21, 2019 · 17 comments · Fixed by #8981
Closed

Signposting support in Dataverse #5962

rvanheest opened this issue Jun 21, 2019 · 17 comments · Fixed by #8981
Labels
GDCC: DANS related to GDCC work for DANS

Comments

@rvanheest
Copy link

rvanheest commented Jun 21, 2019

Signposting is an approach to make the scholarly web more friendly to machines. It uses Typed Links as a means to clarify patterns that occur repeatedly in scholarly portals. For resources of any media type, these typed links are provided in HTTP Link headers.

At DANS we implemented Signposting in EASY on our dataset landing pages with patterns for 'bibliographic metadata' and 'identifiers'. See https://signposting.org/adopters/#dans for examples.

It would be nice to have Signposting implemented in Dataverse as well. We have to see which patterns are suitable. Let's discuss that in this issue.

@pdurbin pdurbin changed the title Signposting support in DataVerse Signposting support in Dataverse Jun 21, 2019
@rvanheest
Copy link
Author

rvanheest commented Jul 17, 2019

After some discussion with @hvdsomp at DANS (one of the authors of Signposting), we came up with the following specs for integrating Signposting in Dataverse. Please feel free to comment on this proposal and share ideas for improvement.

The trick in Signposting is to add a header with key Link and as body a series of comma separated entries following the pattern <[url]>; rel="[relation identifier]". See as an example the DANS/EASY implementation that we worked on a couple of years ago. In the specs below we assume a new Link header every time to have complete syntax, but in practice they need to be aggregated into a comma separated listing.

Signposting has various so-called 'patterns'. Aggregating these patterns may result in a large Link header, which may require appropriate configuration of the Dataverse webservers. In this case it may be wise to prioritize in these URLs and only provide a subset of them (for example, a limited number of describedby or author links).
A better alternative, however, is to implement Signposting 'by reference' instead of 'by value' using the linkset relation type (new version of the official proposal is upcoming in the next couple of weeks). Although this spec is still 'work in progress', @hvdsomp has confirmed that the part we would typically use is already finished. At the end of this post we will discuss what changes would have to be made in order to implement Signposting by reference. The first part of this post assumes an implementation of Signposting by value.

Patterns

Author pattern

https://signposting.org/author/
Use on dataset landing page

For every author in a dataset, this will try to add one author relation for the authorID (ORCID, VIAF, ISNI).

In general:

for every author in the dataset
  if author has ORCID
    Link: <http://orcid.org/:id>; rel="author"
  else if author has VIAF
    Link: <http://viaf.org/viaf/:id/>; rel="author"
  else if author has ISNI
    Link: <http://www.isni.org/:id>; rel="author"

Example:

https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS contains 4 authors with an ORCID. Hence we expect a Link header like:

Link: <http://orcid.org/0000-0002-2260-9672>
      ; rel="author",
      <http://orcid.org/0000-0003-1069-7510>
      ; rel="author",
      <http://orcid.org/0000-0002-3505-9753>
      ; rel="author",
      <http://orcid.org/0000-0001-8266-2216>
      ; rel="author"

Identifier pattern

https://signposting.org/identifier/
Use on dataset landing page, as well as the file landing page of every file related to the dataset.

This will add cite-as relations for every dataset identifier corresponding to a dataset.

If the files have their own persistent identifier, the landing page should contain a similar Link header with the URL for that identifier. If the files don't have their own persistent identifier, the landing page should instead contain a Link header with URL of the identifier of their dataset, the same as the one that is displayed above. Basically, the URL to be used for the file landing page is the one displayed in 'File citation'.

If a dataset has multiple identifiers (e.g. DOI, Handle, etc.), an entry like the one above can be added (comma separated) for each identifier.

Example (1):

https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/KYCKAI has a Handle and 2 files. The files don't have a persistent identifier. Hence we expect the following Link header on both the dataset landing page and both file landing pages:

Link: <https://hdl.handle.net/10411/KYCKAI>
      ; rel="cite-as"

Example (2):

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BUPJNR has a DOI and 49 files. The files have their own persistent identifier. Hence we expect the following Link header on the dataset landing page:

Link: <https://doi.org/10.7910/DVN/BUPJNR>
      ; rel="cite-as"

and we expect the following Link header on the first file in that list (the rest of the files go in a similar way):

Link: <https://doi.org/10.7910/DVN/BUPJNR/AQEWZS>
      ; rel="cite-as"

Bibliographic pattern

https://signposting.org/bibliographic_metadata/
Use on dataset and file landing pages.

This will add describedby relations for every citation format and metadata format exported by this dataset/file. Conversely, every URL that is used in a describedby relation should have a describes relation itself.

Example:
If the dataset/file has a DOI, the following Links must be added:

Link: <[URL to doi.org]>
      ; rel="describedby"
      ; type="application/vnd.datacite.datacite+xml",
      <[URL to doi.org]>
      ; rel="describedby"
      ; type="application/vnd.citationstyles.csl+json"

Additional citation formats could be added if your Link header is not getting too large. For example:

https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/YDUA7T has 3 citation formats (EndNote, RIS en BibTex), as well as 5 (suitable) metadata export formats (Dublin Core, DDI, DataCite, OAI_ORE and Schema.org JSON-LD). Hence we expect a Link header on the dataset landing page like:

Link: <https://dataverse.nl/api/datasets/export?exporter=dcterms&persistentId=hdl%3A10411/YDUA7T>
      ; rel="describedby"
      ; type="application/xml"
      ; profile="http://dublincore.org/documents/dcmi-terms/",
      <https://dataverse.nl/api/datasets/export?exporter=ddi&persistentId=hdl%3A10411/YDUA7T>
      ; rel="describedby"
      ; type="applcation/xml"
      ; profile="http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd",
      <https://dataverse.nl/api/datasets/export?exporter=Datacite&persistentId=hdl%3A10411/YDUA7T>
      ; rel="describedby"
      ; type="application/vnd.datacite.datacite+xml",
      <https://dataverse.nl/api/datasets/export?exporter=OAI_ORE&persistentId=hdl%3A10411/YDUA7T>
      ; rel="resourcemap"
      ; type="application/ld+json",
      <https://dataverse.nl/api/datasets/export?exporter=schema.org&persistentId=hdl%3A10411/YDUA7T>
      ; rel="describedby"
      ; type="application/ld+json"
      ; profile="http://schema.org",
      <[URL to 'Cite Dataset As BibTex']>
      ; rel="describedby"
      ; type="application/x-bibtex",
      <[URL to 'Cite Dataset As RIS']>
      ; rel="describedby"
      ; type="application/x-research-info-systems"
      <[URL to 'Cite Dataset As EndNote']>
      ; rel="describedby"
      ; type="application/x-research-info-systems"

Conversely, all these URLs should have their own Link header, like:

Link: <https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/YDUA7T>
      ; rel="describes"

All these hold similar for the landing pages of every file in the dataset, but with the URLs and metadata formats for the file.

Resource Type pattern

https://signposting.org/resource_type/
Use on dataset landing page, as well as the file landing page of every file related to the dataset.

This will add a type relation for the dataset/file landing page.

Example (1):
https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS is a dataset landing page. Hence we expect a Link header like:

Link: <http://schema.org/Dataset>
      ; rel="type"

Example (2):
https://dataverse.nl/file.xhtml?fileId=15529&version=1.0 is a file landing page. Depending on the kind of file we can determine which schema.org value is to be chosen.

Publication Boundary pattern (1)

https://signposting.org/publication_boundary/
Use on dataset landing page, as well as the file landing page of every file related to the dataset.

This will add an item relation from the dataset landing page to the file landing page of every file in the dataset, as well as the converse collection relation from the file landing pages to their corresponding dataset landing page.

Example:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XNOARV is a dataset with 9 files. Hence we expect a Link header on the dataset landing page like:

Link: <https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/XNOARV/MEYZEF&version=2.1>
      ; rel="item"
      ; type="text/html",
      <https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/XNOARV/JTN6MT&version=2.1>
      ; rel="item"
      ; type="text/html",
      [etcetera]

Conversely, on each of these file landing pages we expect a Link header like:

Link: <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XNOARV>
      ; rel="collection"

Warning:
This pattern can quickly cause the Link header to grow very large as the number of files grows larger. It is therefore recommended to implement this pattern when Signposting is implemented by reference. In case we use a by value approach, it is generally recommended to not implement this pattern.

Publication Boundary pattern (2)

https://signposting.org/publication_boundary/
Use on file landing page

This will add an item relation from the file landing page to the actual file download URL, as well as the converse collection relation from the file download URL to the corresponding file landing page.

Example:
https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/XNOARV/MEYZEF&version=2.1 is a file landing page. We expect a Link header on this page like:

Link: <https://dataverse.harvard.edu/api/access/datafile/3347023?gbrecs=true>
      ; rel="item"
      ; type="type/x-r-syntax"

Conversely, on the download URL we expect a Link header like:

Link: <https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/XNOARV/MEYZEF&version=2.1>
      ; rel="collection"

Signposting by value vs. by reference

As mentioned before, the number of relations proposed in the above section will cause the Link header to grow large quite quickly. This may require the Dataverse webservers to be configured in special ways in order to support such large Link headers.

One approach would be to implement only a subset of these patterns and relations. For example:

  • only list the relation for the first author as supposed to all authors
  • only implement the DOI relations in the Bibliographic pattern
  • not implementing the Public Boundary pattern at all

An obvious drawback of this is the loss of relations that might help in the discoverability of the datasets and files.

An alternative to this so-called by value implementation is the by reference implementation that uses the linkset relation type (new version of the official proposal is upcoming in the next couple of weeks). This proposal is still work in progress, but is, according to @hvdsomp, completed in the parts that we require for this implementation.

In the by reference implementation we provide only one relation in Link header of the dataset/file landing pages. This link will return a response that is formatted like the Link headers above, containing all relations.

Example:
On a dataset/file landing page we expect a Link header like:

Link: <https://dataverse.nl/api/[datasets|files]/export?exporter=signposting&persistentId=[dataset/file id]>
      ; rel="linkset"

Please note that the URL is made up and is very much open for discussion.

Following this URL will return a formatted text like:

Link: <http://orcid.org/:id>
      ; rel="author"
      ; anchor="https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS",
      <https://hdl.handle.net/10411/DCXGYS>
      ; rel="cite-as"
      ; anchor="https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS",
      <http://schema.org/Dataset>
      ; rel="type"
      ; anchor="https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS",
      <https://dataverse.nl/api/datasets/export?exporter=dcterms&persistentId=hdl%3A10411/YDUA7T>
      ; rel="describedby"
      ; type="application/xml"
      ; profile="http://dublincore.org/documents/dcmi-terms/"
      ; anchor="https://dataverse.nl/dataset.xhtml?persistentId=hdl:10411/DCXGYS",
      [etcetera]

Note that now every relation also has an anchor field pointing to the landing page, as we are no longer fetching this data from the landing page URL itself.

One of the major advantage of the by reference implementation is that we do not have to provide just a subset of all relations. With this we can provide as many relations as we want, without worrying about the length of the Link header.
Besides that it keeps the header of the landing page light-weight, and limits the number of bytes spent on headers that typically are not consumed by your browser.

We may want to consider to implement Signposting on dataset- and file landing pages by reference, whereas on all other links (such as API calls) the by value approach will suffice, since these mostly contain just one relation.

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2019

@rvanheest thanks for the detailed write up! I guess I have a few questions:

  • Are you able to make a pull request for this? Or someone at DANS?
  • Would it be configurable? Could an installation of Dataverse turn this feature off if they don't want it?
  • Do you have any concerns about performance? Would it slow down the dataset page?

Thanks!

@rvanheest
Copy link
Author

Are you able to make a pull request for this? Or someone at DANS?

Personally I don't have much experience with the Dataverse code base yet, so I'm not sure if I'd be the one to implement this. But I'll check on this within DANS and I'll let you know about this. On the other hand, would it be a possibility for Harvard (or another organisation) to do the implementation? In that case I'd be glad to help out with any further questions or advise regarding Signposting.

Would it be configurable? Could an installation of Dataverse turn this feature off if they don't want it?

I guess you could make it so that this feature can be disabled. However, I don't see the point of that. Signposting is considered to be the 'low-hanging fruit' of machine2machine discoverability in archives/repositories, so I'm wondering if there would be a special use case in which this feature is to be disabled? The only reason would be the by value issues and concerns with server configurations related to the length of the Link header, however that would be solved when Signposting is implemented by reference for the dataset- and file landing pages.

Of course you could make this as fancy as you would like it to be. In a very rich version you could imagine even having a UI panel where you can select which link relations are to be enabled/disabled.

Do you have any concerns about performance? Would it slow down the dataset page?

In our implementation in EASY we did not experience any performance issues with this. Now of course a huge number of relations in the Link header might be an issue, but as I mentioned in the proposal, this is why @hvdsomp is now working on the linkset relation type. This would cause the Link header of the dataset- and file landing pages to have only one (static?) URL. Only when you fetch this URL the relations are to be generated. Finally, if performance would be an issue with that, you may even consider generating these relations on beforehand and storing them somewhere in Dataverse (only updating them when the data/metadata changes).

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2019

I'll let you know about this. On the other hand, would it be a possibility for Harvard (or another organisation) to do the implementation?

Harvard/IQSS has a long to do list at https://www.iq.harvard.edu/roadmap-dataverse-project so it would be great if we could identifier an external contributor who is willing to do make a pull request. I'm happy to mentor anyone to get up to speed with hacking on Dataverse. 😄

In a very rich version you could imagine even having a UI panel where you can select which link relations are to be enabled/disabled.

For the initial version I think all these relations should be hard coded. Later, they could be configurable.

In our implementation in EASY we did not experience any performance issues with this.

Sorry, I meant the lookups from the Dataverse database. You're right, we could do some caching if necessary. I just wouldn't want this feature that many people haven't heard of (yet) to slow down their installation of Dataverse in any way. That's why I was suggesting a toggle to turn the feature off if it isn't desired.

At point I would suggest starting a "Signposting" thread at https://groups.google.com/forum/#!forum/dataverse-community that explains the benefit and points to this issue so that more people in the Dataverse community can learn about what it is.

@markwilkinson
Copy link

Would it perhaps be more efficient/expressive to use a Link "meta" header, pointing to a metadata URI (rather than having many more precise links to subsets of metadata).

@markwilkinson
Copy link

Apologies - though the "meta" Link header is still in the W3C's blog showing the utility of Link headers, it looks like it never made it through the standardization process. It appears that it was replaced by "describedBy" (https://www.iana.org/assignments/link-relations/link-relations.xhtml)

It seems that this would point to a "generic" document containing all metadata about the subject.

@jggautier
Copy link
Contributor

Folks building a tool to access the "FAIRNESS" of datasets in Dataverse repositories (https://www.fairsfair.eu/fairsfair-data-object-assessment-metrics-request-comments) recommended that Dataverse implement signposting. Typed links are one of several ways their tools are looking for information about datasets in Dataverse-based repositories.

@philippconzett
Copy link
Contributor

Here's some more information from the FAIRsFAIR team / @kitchenprinzessin3880:

The author of signposting (Herbert van de Sompel) is now working on the signposting patterns for FAIR data, see the draft here: https://signposting.org/FAIR/.

I'm currently working on a grant proposal together with DANS, Harvard and others for funding to upgrade DataverseNO, and we might consider to include Signposting implementation in Dataverse in one of the work packages. Does anyone have an idea how much resources (expressed in person months) this approximately will require?

@hvdsomp
Copy link

hvdsomp commented Oct 8, 2020

Ha! That draft wasn't really public yet, but, hey, I guess it is now :-)

The gist of the material is there but I still am working on examples at the end of the document. Feedback to the draft is very welcome, of course!

Great to hear that you might look into implementing the FAIR Signposting Profile for Dataverse. That would be super, of course. I am not a Dataverse expert at all, so it is hard for me to assess the resources required to implement, but you would be looking at:

  • For Level 1 and Level 2: Ability to add HTTP Link header for the landing page and provide the typed links required for the respective Level. For Level 1 that's ~4+ links. For Level 2 that is just 1 "linkset" link.
  • For Level 3: Ability to add HTTP Link header for the dataset files. That is just 1 "linkset" link.
  • For level 2 and Level 3: Ability to (dynamically) expose a Linkset document that contains all required typed links. I think that the OAI-ORE Resource Map that Dataverse makes available contains the info that is required and could be transformed to a Linkset document. I also think that the approach shown in section Implementing Level 2 and Level 3 with a Single Link Set is probably feasible in Dataverse.

@kitchenprinzessin3880
Copy link

@hvdsomp would you like to create a google doc version of https://signposting.org/FAIR/, so that we can provide feedback on the work? or do you prefer email instead? I think the gdoc option is better as the work is applicable to other repositories collaborating with the FAIRsFAIR team.

@philippconzett regarding the resources required, my software architect @uschindler took a day or so to implement the previous patterns ;) if you need further input on technical implementation, please do not hesitate to get in touch with him..

@philippconzett
Copy link
Contributor

@hvdsomp Sorry for that! I wasn't aware that the webpage wasn't public yet. Thanks for further input!

@kitchenprinzessin3880 Thanks for this feedback. I guess then that this will be rather easy to implement once Dataverse has
decided to do so.

@hvdsomp
Copy link

hvdsomp commented Oct 8, 2020

@kitchenprinzessin3880 I am OK to upload a version of the document into Google docs but will need to figure out how to do that as it's HTML with external CSS etc. Alternatively, we could just create a blank Google doc to provide feedback in, referring to section/paragraph in the doc?

@kitchenprinzessin3880
Copy link

@hvdsomp yes external references can be problematic. how about i create an issue at https://github.com/pangaea-data-publisher/fuji and add my feedback there? i will share issue link to other project collaborators. what do you say?

@pdurbin
Copy link
Member

pdurbin commented Oct 8, 2020

It's nice to see some chatter here. 😄 I'd feel remiss if I didn't mention that back in January it was a pleasure to meet @hvdsomp at PIDapalooza (when one could still travel!) where I also had a brief chat with Martin Klein, whom I asked about ResourceSync. In terms of time to implement, Martin said Signposting should be much easier since (from my understanding), you're just making some headers available. The devil is in the details, I'm sure, but there seems to be enough demand that we should probably try to get Signposting into Dataverse someday. 😄

@hvdsomp
Copy link

hvdsomp commented Oct 9, 2020

Indeed, @pdurbin it was good to chat at PIDapalooza in the days that was still possible in person. I now have a significantly reworked draft available at http://signposting.org/FAIR/ . Feedback invited via https://github.com/hvdsomp/signposting . I hope the Dataverse community can work towards embracing this approach.

@philippconzett
Copy link
Contributor

I just noticed that Signposting support is set out as a desired characteristics in the COAR Community Framework for Good Practices in Repositories (https://doi.org/10.5281/zenodo.4110829); cf.:

1.8 The repository supports HTTP link headers to provide automated discovery of metadata records and content resources associated with repository items. We recommend ​Signposting​ typed links to support this.

@hvdsomp
Copy link

hvdsomp commented Jan 22, 2021

Indeed. BTW, Implementation of the FAIR Signposting Profile (Level 1 and Level 2) for Dataverse is currently ongoing at DANS.

@jggautier jggautier mentioned this issue Jun 3, 2021
@qqmyers qqmyers added the GDCC: DANS related to GDCC work for DANS label Sep 8, 2021
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Mar 13, 2023
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Mar 13, 2023
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Mar 13, 2023
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Mar 13, 2023
pdurbin added a commit to GlobalDataverseCommunityConsortium/dataverse that referenced this issue Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS
Projects
None yet
8 participants