ProjectReportingExample

Data validation and schema interoperability

Leyla Garcia (corresponding author) (http://orcid.org/0000-0003-3986-0510)
Jerven Bolleman (https://orcid.org/0000-0002-7449-1266)
Michel Dumontier (http://orcid.org/0000-0003-4727-9435)
Simon Jupp (https://orcid.org/0000-0002-0643-3144)
Jose Emilio Labra Gayo (http://orcid.org/0000-0001-8907-5348)
Thomas Liener (https://orcid.org/0000-0003-3257-9937)
Tazro Ohta (https://orcid.org/0000-0003-3777-5945)
Núria Queralt-Rosinach (https://orcid.org/0000-0003-0169-8159)
Chunlei Wu (https://orcid.org/0000-0002-2629-6124)

State the problem you worked on

Develop ShEx schemas for a number of bio datasets
Develop new tools to visualize ShEx schemas and integrate those visualizations in documentation pipelines
Enable the interconversion of JSON schema, ShEx, SHACL * Create a FAIR schema repository and API for JSON schema, ShEx, and SHACL documents
Creating a proof of concept of a ShEx creator, a tool to support the creation of ShEx expressions (the ShEx creator)

Give the state-of-the art/plan

Validating RDF data becomes necessary in order to ensure data compliance against the conceptualization model it follows, e.g., schema or ontology behind the data. Validation could also help with data consistency and completeness. There are different approaches to validate RDF data. For instance, JSON schema is particularly useful for data expressed in JSON-LD RDF serialization while Shape Expression (ShEX) (Baker and Prud′hommeaux, 2019) and Shapes Constraint Language (SHACL) (Knublauch and Kontokostas, 2017) can be used with other serialization as well. Currently, no validation approach is prevalent regarding others, depending on data characteristics and personal preferences one or the other can be used. In some cases, the approaches are interchangeable; however, that is not always the case, making it necessary to identify a subset among them that can be seamlessly translated from one to another.

During the DBCLS/NDBC 2019 BioHackathon, we worked on a variety of topics related to RDF data validation, including (i) development of ShEX shapes for a number of datasets, (ii) development of a tool to semi-automatically create ShEx shapes, (iii) improvements to the RDFShape tool (Labra-Gayo et al., 2018), and (iv) enabling validation schema conversion from one format to the other. In the following sections we detailed the work done on each front.

Describe what you have done/results starting with The working group created...

Development of ShEx shapes

We created and updated ShEX shapes for different biomedical resources including Health Care Life Science (HCLS) dataset descriptions (Gray et al., 2015), Bioschemas (Gray et al., 2017), ~~European Joint Program (EJP) on rare disease data~~ and DisGeNET (Piñero et al., 2017). In order to make it easier for future updates, we developed some applications to automatically create ShEX shapes from HCLS datasets specification and Bioschemas profiles.

Bioschemas

Bioschemas is a community-driven project aiming to support schema.org types for Life Sciences. It contributes to the community by adding life Science types to schema.org, defining profiles adjust to community needs, and developing suppporting tools. A Bioschemas profile is a type customization including propeprty cardinalty and requirement level. Bioschemas shapes currently focus on profiles corresponding to the Biotea project, particularly those related to bibliographic data. Biotea (Garcia et al., 2018) provides a model to express scholarly articles in RDF, including not only bibliographic data but also article structure and named entities recognized in the text.

Biotea-Bioschemas ShEx shapes are created via a Jupyter notebook from the YAML Bioschemas profile files. Schema.org datatypes are transformed to XML Schema Definition (XSD) while supporting shapes are created for any combination of schema.org types used as ranges. In addition, three main shapes are created for any Bioschemas profile, corresponding to the three property requirement levels, i.e., minimum, recommended and optional. Profile information, i.e., profile name, schema.org type and YAML file location, are encoded in a comma separated value (CSV) file, making it easy to use the code to generate shapes for any other Bioschemas profile. More information is available at the GitHub repository for this project.

DisGeNET

DisGeNET is a comprehensive gene-disease association knowledge base in the Life Sciences. It is widely used by the biomedical community and its Linked Data representation has been selected as an Elixir Europe interoperability resource. However, it is still lacking a way to easily query this vast amount of information and explore this knowledge across other domains through its SPARQL endpoint.

During the BioHackathon we implemented the DisGeNET-RDF ShEx shape. In order to do so, we used RDFShape and the suite of generation and validation tools it comes with. We detected some disagreements between the DisGeNET schema illustrated on its website and the actual underlying data. We actively discussed around how to best tackle the development of the ShEx shapes in an automatic and data-driven way so we can continue working on it after the BioHackathon.

HCLS

The HCLS Community Profile for Dataset Descriptions offers a concrete guideline to specify dataset metadata as RDF including elements of data description, versioning, and provenance so as to support discovery, exchange, query, and retrieval of dataset metadata. As part of their work, the HCLS Community created Validata, a web application to check the compliance of RDF documents to the guideline specifications. Validata used a non-standard extension of ShEx to check various compliance levels.

We created a ShEx compliant document by processing the HCLS guideline using a PHP script. The result is several ShEx documents that can be used to check compliance at various levels (MUST, SHOULD, MAY, SHOULD NOT, MUST NOT). We validated our work against the exemplar documents that are provided as part of the guideline, and have also used it to detect errors in HCLS metadata from UniProt. Our work revealed errors in UniProt metadata and the RDFShape tool.

ShEx creator

While ShEX is very useful as demonstrated to validate RDF data, the syntax to actually write a ShEX expression can be hard for new users and is time-consuming also for experienced people. Therefore, a prototype of a ShEX creator was proposed for the Biohackathon. This tool should help users to write correct ShEX expressions faster. The prototype is implemented as a javascript tool, supporting the user through e.g. dropdown menus to create a correct ShEX structure and it uses the RDFShape API in the background to validate the created ShEX expression. The prototype can be found at https://github.com/LLTommy/RDFvalidation4humans.

Improvements to RDFShape tool

The RDFShape tool (Labra-Gayo et al., 2018) comprises a set of tools to create and validate RDF data via ShEX and SHACL shapes. During the BioHackathon, it was used to create shapes and validate RDF data from different endpoints. Thanks to it, we identified some improvements for this tool such as the possibility to validate triples obtained from a mix including RDF data provided by the user and data already contained in a SPARQL endpoint. This feature was added to the new version developed during the BioHackathon.

We also explored and implemented new visualization features for ShEx. Our implementation resulted in the separation in several modules:

RDFShape client which consists of a javascript client based on the React framework.
RDFShape server contains the server part and is implemented in Scala using the http4s library.
umlSHaclEx is a module that generates UML-like visualizations from Shapes schemas. The library can be used as a standalone command line tool.
SHaclEX contains the main validation modules for ShEx and SHACL.
SRDF defines a simple RDF interface with the main features required by the validation library. The module contains several implementations of that interface which enables the use of validation with Apache Jena models, RDF4j, or SPARQL endpoints.

Schema conversion across validation approaches

As part of our work, during the biohackathon we worked on identifying a common subset of ShEx that could be used as the basis for the generation of RDF data models documentation, which can later be converted to JSON schema, ShEx or SHACL. Although full interoperability between those languages is not feasible, we consider that a subset language could be defined that could handle the most common cases [Labra-Gayo et al., 2019].

Through CD2H's Data Discovery Engine project, we previously developed a web-based tool called Schema Playground to facilitate the schema visualization, hosting and extension. It helps developers to publish their existing schemas as well as build new schemas by extending the existing ones. Schema Playground currently supports schema.org schemas defined in JSON-LD format and JSON-schema-based data validation. While JSON-schema is a good-fit for the underlying JSON-based data structure, ShEx and SHACL provide a more expressive way to describe validation rules when the underlying data are presented as triples. At the hackathon, we converted several JSON-Schema based validation rules to ShEx and performed the validation on the underlying data (e.g., dataset metadata). These exercises help us to identify the requirements to add support for ShEx in our BioThings schema playground.

Write a conclusion

We developed a formal description of the HCLS dataset metadata guidelines in a manner that is compliant with the latest version of ShEx. This work is important not only to the HCLS community that uses the guideline, but also can form a basis for automated computational validation of metadata descriptions, as per the FAIR (Findable, Accessible, Interoperable, Reusable) principles. In a similar vein, we prototyped a ShEx shapes (semi)automatic solution for Bioschemas which could be later extended to Bioschemas profiles other than those defined by Biotea. We also developed a prototype corresponding to the first formal description of the DisGeNET-RDF data model by using ShEx. Our strategy to generate the DisGeNET-RDF ShEx shape comprised three steps: (i) manual building via the depicted schema on the web, then (ii) polishing via inference from some actual data instances, and (iii) validating against all the database via the SPARQL endpoint. The shapes created for DisGeNET will work as a basis to develop a more automated solution for this resource.

The development of ShEx shapes using the RDFShape tools resulted in a user testing exercise, where bugs were identified. This direct interaction with users allowed us quickly implement fixes and immediately testing them with users, giving place to a new version. In addition, from the creation of ShEx shapes and the transformation from one format to another, we identified a need to improve tools and technologies used to describe and validate RDF data. Such validation could facilitate machine-readable community agreements regarding metadata, thus leading to more Findable, Accessible, Interoperable and Reusable (FAIR) data as community-based validators could interoperate with the FAIR metrics evaluator (Wilkinson et al., 2018).

Write up any future work

Regarding the generation of ShEX shapes, HCLS team plans to Check the compliance of other HCLS dataset metadata documents on the web and report to the community our findings while Bioschemas will work on a validation platform that can later communicate with the FAIR evaluator. Regarding DisGeNET, the ShEX shapes will be finalized and move to a more automatic generation.

In order to overcome the necessity to learn yet another syntax, i.e., ShEx syntax, the work with the ShEx will continue. Currently, the ShEX creator is a rough prototype. Future work consists of (i) making the code more stable and potentially publish it as npm module and (ii) integrate the ShEX creator within the RDFShape tool website, so it could further be combined with existing functionality, e.g., ShEX visualization in RDFShape platform.

RDFShape will continue using user feedback to improve the services provided, taking into account scalability requirements of big SPARQL endpoints. Several issues appeared when validating those big data portals, such as the need to improve error messages, and to handle streaming validation for big RDF data. Regarding the visualization, we will work an a direction similar to the one carried out by the Japanese Life Science Database Integration portal. This portal uses data model representations drawn manually, combining instances and schemas. In such a way, they can show a visualization that users will follow more easily as they will observe real data rather than only the underlying model. Future work could extend the visualization capabilities of RDFShape to automatically generate those kind of visualizations. Other future works on the RDFShape platform include the development of Jupyter notebooks integrating and showcasing the different tools provided.

The BioThings team also plan to continue their work after the hackathon to allow publishing and visualizing ShEx schemas in Schema Playground, along with the support of schemas defined in schema.org and JSON-schema format. The ShEx parsing tools developed at RDFShape will be adopted to convert input ShEx schema into its JSON format for indexing purpose. And the visualization tool from RDFShape can also be used to generate the graph-representation of a ShEx schema.

Jupyter notebooks created

Bioschemas ShEX shapes: https://github.com/biotea/validation-shapes-bioschemas

References

Garcia, Alexander, Federico Lopez, Leyla Garcia, Olga Giraldo, Victor Bucheli, and Michel Dumontier. “Biotea: Semantics for Pubmed Central.” PeerJ 6 (January 2, 2018): e4201. https://doi.org/10.7717/peerj.4201.
Gray, Alasdair J. G., Joachim Baran, M. Scott Marshall, and Michel Dumontier. “Dataset Descriptions: HCLS Community Profile,” n.d. https://www.w3.org/TR/hcls-dataset/.
Gray, Alasdair J G, Carole Goble, and Rafael C Jimenez. “From Potato Salad to Protein Annotation,” n.d., 4.
Labra-Gayo, Jose Emilio, Daniel Fernandez-Alvarez, and Herminio Garcia-Gonzalez. “RDFShape: An RDF Playground Based on Shapes,” 4, 2018. http://ceur-ws.org/Vol-2180/paper-35.pdf.
Labra-Gayo, Jose Emilio, Herminio García-González, Daniel Fernández-Alvarez, and Eric Prud’hommeaux. “Challenges in RDF Validation.” In Current Trends in Semantic Web Technologies: Theory and Practice, edited by Giner Alor-Hernández, José Luis Sánchez-Cervantes, Alejandro Rodríguez-González, and Rafael Valencia-García, 121–51. Studies in Computational Intelligence. Cham: Springer International Publishing, 2019. https://doi.org/10.1007/978-3-030-06149-4_6.
Piñero, Janet, Àlex Bravo, Núria Queralt-Rosinach, Alba Gutiérrez-Sacristán, Jordi Deu-Pons, Emilio Centeno, Javier García-García, Ferran Sanz, and Laura I. Furlong. “DisGeNET: A Comprehensive Platform Integrating Information on Human Disease-Associated Genes and Variants.” Nucleic Acids Research 45, no. D1 (January 4, 2017): D833–39. https://doi.org/10.1093/nar/gkw943.
“Shape Expressions (ShEx) Primer,” October 9, 2019. https://shexspec.github.io/primer/.
“Shapes Constraint Language (SHACL),” July 20, 2017. https://www.w3.org/TR/shacl/.
Wilkinson, Mark D., Michel Dumontier, Susanna-Assunta Sansone, Luiz Olavo Bonino da Silva Santos, Mario Prieto, Peter McQuilton, Julian Gautier, Derek Murphy, Mercѐ Crosas, and Erik Schultes. “Evaluating FAIR-Compliance Through an Objective, Automated, Community-Governed Framework.” BioRxiv, September 25, 2018, 418376. https://doi.org/10.1101/418376.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly