Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for schema.org #263

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

add support for schema.org #263

wants to merge 5 commits into from

Conversation

tomkralidis
Copy link
Member

Fixes #231. Also adds early out for autodetection (first schema found).

@tomkralidis tomkralidis requested a review from pvgenuchten March 25, 2025 14:34
Copy link
Contributor

@pvgenuchten pvgenuchten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Tom,

Write support

i get impression you didn't check your implementation on https://validator.schema.org/, because it still has quite some validation issues, see below.

currently dataset type is not detected in https://validator.schema.org/
when using the validator, make sure to embed json in

<script type="application/ld+json">{}</script>

I wonder if we should use some of this work inside pycsw/pygeoapi...

noticed this on distribution
image
should be @type:'schema:dataDownload'
format or encoding can be used for the mimetype
seems the validator assumes 'type': as '@type'

Read support

I notice you also added read support, i tried with

pygeometa metadata import schema-org.json --schema schema-org -v DEBUG

and got a

WARNING:pygeometa.core:Import failed: list indices must be integers or slices, not str
null
...

when debugging

from pygeometa.schemas.schema_org import SchemaOrgOutputSchema
sos = SchemaOrgOutputSchema()
f = open("./schema-org.json", "r")
f2 = sos.import_(f.read())

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/geopython/lib/python3.10/site-packages/pygeometa-0.17.dev1-py3.10.egg/pygeometa/schemas/schema_org/__init__.py", line 116, in import_
    geo = md['spatialCoverage']['geo']
TypeError: list indices must be integers or slices, not str

would be nice this if this error is reported by the command-line client
schema-org.zip

The interesting part here is that rdf typically allows a single or a list of items as content of an element, which brings us to a next topic, seems this implementation expects a json-ld serialisation of rdf, which indeed is the most common form of schema-org. However quite some implementations of schema-org use RDF-a/microdata. In theory one can also serialise schema.org as turtle or rdf/xml. To support that case, rdflib can be used to read the rdf and serialise it to json-ld, before parsing.

after fixing the spatialcoverage, next error:

  File "/geopython/lib/python3.10/site-packages/pygeometa-0.17.dev1-py3.10.egg/pygeometa/schemas/schema_org/__init__.py", line 123, in import_
    mcf['spatial']['datatype'] = 'vector'
KeyError: 'spatial'

seems the datatype is set before spatial is initialized

LOGGER.debug('Generating baseline record')
record = {
'identifier': mcf['metadata']['identifier'],
'@type': dict(zip(TYPES.values(), TYPES.keys()))[mcf['metadata']['hierarchylevel']], # noqa
Copy link
Contributor

@pvgenuchten pvgenuchten Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have schema: prefix, e.g. "@type": "schema:journalpaper" and context, e.g. "@context": "http://schema.org/"

'box': f'{miny},{minx} {maxy},{maxx}'
}
}],
'title': title[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title property does not exist, it is called name


if 'url' in license:
LOGGER.debug('Encoding license as link')
license_link = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license is not a distribution, see https://schema.org/license

record['license'] = license['name']

LOGGER.debug('Checking for distribution')
for value in mcf['distribution'].values():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distribution should be of type https://schema.org/DataDownload

record['datePublished'] = generate_datetime(value)

LOGGER.debug('Checking for contacts')
record['contacts'] = self.generate_contacts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contacts is not a schema property, it should be split by role

dataset.creator : {},
dataset.funder: [{},{}]

and shoud be of type https://schema.org/Person or https://schema.org/Organisation

rp['phones'] = [{'value': phone}]

if contact.get('email') is not None:
rp['emails'] = [{'value': contact.get('email')}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emails is not a schema.org property
organisation.email:'foo@example.com'

'@type': 'Place',
'geo': {
'@type': 'GeoShape',
'box': f'{miny},{minx} {maxy},{maxx}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting for GeoShape Box for schema.org should be: "box": "miny minx maxy maxx" (notice no commas)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've seen other implementations having "y1,x1 y2,x2" which are properly parsed by google dataset search:

https://inspire-geoportal.ec.europa.eu/srv/api/records/e4778203-5558-4278-9183-b3b9a59191a4

maybe "y1 x1 y2 x2" also works, specs are unfortunately not clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed google dataset search is more forgiving, however I believe we should be consistent with "box": "miny minx maxy maxx" across all FOSS4G projects, for schema.org

@jmckenna
Copy link
Member

seconded, good work, but testing through https://validator.schema.org/ is critical (I wish there was an API available to validate, instead of manually testing through the validator).

@pvgenuchten
Copy link
Contributor

Validator is not available as a service, but a shacl oriented test is available at https://github.com/google/schemarama/blob/main/core/test/shacl-test.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add schema.org schema
3 participants