-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for schema.org #263
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Tom,
Write support
i get impression you didn't check your implementation on https://validator.schema.org/, because it still has quite some validation issues, see below.
currently dataset type is not detected in https://validator.schema.org/
when using the validator, make sure to embed json in
I wonder if we should use some of this work inside pycsw/pygeoapi...
noticed this on distribution
should be @type:'schema:dataDownload'
format or encoding can be used for the mimetype
seems the validator assumes 'type': as '@type'
Read support
I notice you also added read support, i tried with
pygeometa metadata import schema-org.json --schema schema-org -v DEBUG
and got a
WARNING:pygeometa.core:Import failed: list indices must be integers or slices, not str
null
...
when debugging
from pygeometa.schemas.schema_org import SchemaOrgOutputSchema
sos = SchemaOrgOutputSchema()
f = open("./schema-org.json", "r")
f2 = sos.import_(f.read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/geopython/lib/python3.10/site-packages/pygeometa-0.17.dev1-py3.10.egg/pygeometa/schemas/schema_org/__init__.py", line 116, in import_
geo = md['spatialCoverage']['geo']
TypeError: list indices must be integers or slices, not str
would be nice this if this error is reported by the command-line client
schema-org.zip
The interesting part here is that rdf typically allows a single or a list of items as content of an element, which brings us to a next topic, seems this implementation expects a json-ld serialisation of rdf, which indeed is the most common form of schema-org. However quite some implementations of schema-org use RDF-a/microdata. In theory one can also serialise schema.org as turtle or rdf/xml. To support that case, rdflib can be used to read the rdf and serialise it to json-ld, before parsing.
after fixing the spatialcoverage, next error:
File "/geopython/lib/python3.10/site-packages/pygeometa-0.17.dev1-py3.10.egg/pygeometa/schemas/schema_org/__init__.py", line 123, in import_
mcf['spatial']['datatype'] = 'vector'
KeyError: 'spatial'
seems the datatype is set before spatial is initialized
LOGGER.debug('Generating baseline record') | ||
record = { | ||
'identifier': mcf['metadata']['identifier'], | ||
'@type': dict(zip(TYPES.values(), TYPES.keys()))[mcf['metadata']['hierarchylevel']], # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have schema: prefix, e.g. "@type": "schema:journalpaper" and context, e.g. "@context": "http://schema.org/"
'box': f'{miny},{minx} {maxy},{maxx}' | ||
} | ||
}], | ||
'title': title[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title property does not exist, it is called name
|
||
if 'url' in license: | ||
LOGGER.debug('Encoding license as link') | ||
license_link = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license is not a distribution, see https://schema.org/license
record['license'] = license['name'] | ||
|
||
LOGGER.debug('Checking for distribution') | ||
for value in mcf['distribution'].values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distribution should be of type https://schema.org/DataDownload
record['datePublished'] = generate_datetime(value) | ||
|
||
LOGGER.debug('Checking for contacts') | ||
record['contacts'] = self.generate_contacts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contacts is not a schema property, it should be split by role
dataset.creator : {},
dataset.funder: [{},{}]
and shoud be of type https://schema.org/Person or https://schema.org/Organisation
rp['phones'] = [{'value': phone}] | ||
|
||
if contact.get('email') is not None: | ||
rp['emails'] = [{'value': contact.get('email')}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emails is not a schema.org property
organisation.email:'foo@example.com'
'@type': 'Place', | ||
'geo': { | ||
'@type': 'GeoShape', | ||
'box': f'{miny},{minx} {maxy},{maxx}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting for GeoShape Box for schema.org should be: "box": "miny minx maxy maxx"
(notice no commas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've seen other implementations having "y1,x1 y2,x2" which are properly parsed by google dataset search:
https://inspire-geoportal.ec.europa.eu/srv/api/records/e4778203-5558-4278-9183-b3b9a59191a4
maybe "y1 x1 y2 x2" also works, specs are unfortunately not clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed google dataset search is more forgiving, however I believe we should be consistent with "box": "miny minx maxy maxx"
across all FOSS4G projects, for schema.org
seconded, good work, but testing through https://validator.schema.org/ is critical (I wish there was an API available to validate, instead of manually testing through the validator). |
Validator is not available as a service, but a shacl oriented test is available at https://github.com/google/schemarama/blob/main/core/test/shacl-test.js |
Fixes #231. Also adds early out for autodetection (first schema found).