Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing geospatial data, a set of approaches #105

Open
fils opened this issue Jun 1, 2020 · 42 comments
Open

Representing geospatial data, a set of approaches #105

fils opened this issue Jun 1, 2020 · 42 comments
Labels
Dataset question Further information is requested

Comments

@fils
Copy link
Collaborator

fils commented Jun 1, 2020

Based on some working going on in the IGSN community we have been looking at what recommendation to make to this community. With help from @datadavev @ashepherd @abhritchie @dblodgett-usgs @jesserobertson I've working up the following for discussion.

The goal of this simple data graph is to present 4 options for representing spatial data available to us.

  1. subjectOf link to .geojson (the "linked data" pattern)
  2. the geosparql:hasGeometry.. (the OGC spatial in graph pattern)
  3. JSON literal... (the embed JSON in JSON-LD as a literal pattern see geoblob in the context and the body) This approach requires JSON-LD 1.1
  4. schema.org

This is being put forth for discussion and comments so that we can better refine the example. The basic POV is that there are many ways to represent spatial and any given community may have use cases that drive them.

For example, the schema.org approach is likely the only item that will deliver spatial data to Google, the other approaches allow for the representation of CRS parameters or deliver spatial data more aligned with OGC patterns. GeoSPARQL for example is likely the best patter for spatial data in spatially aware triple stores.

This issue is more to provide information on the options and not provide a recommendation.

Gist link: https://gist.github.com/fils/5899894e5d5783f8da0f92043a97badd?short_path=86de4fd

Load to playgroud: https://tinyurl.com/y9zajhov

{
    "@context": {
        "@version": 1.1,
        "geoblob": {
            "@id": "http://example.com/vocab/json",
            "@type": "@json"
        },
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "description": "http://igsn.org/core/v1/description",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "schema": "https://schema.org/"
    },
    "@id": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0",
    "@type": "http://igsn.org/core/v1/Sample",
    "description": "A fake ID for testing",
    "schema:subjectOf": [
        {
            "schema:url": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0.geojson",
            "@type": "schema:DigitalDocument",
            "schema:format": [
                "application/vnd.geo+json"
            ],
            "schema:conformsTo": "https://igsn.org/schema/spatial.schema.json"
        }
    ],
    "geosparql:hasGeometry": {
        "@id": "_:N98e75cacc29f40deb555eb583cb162dc",
        "@type": "http://www.opengis.net/ont/sf#Point",
        "geosparql:asWKT": {
            "@type": "http://www.opengis.net/ont/geosparql#wktLiteral",
            "@value": "POINT(-76 -18)"
        },
        "geosparql:crs": {
            "@id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }
    },
    "geoblob": {
        "type": "GeometryCollection",
        "geometries": [{
            "type": "Point",
            "coordinates": [-76, -18]
        }]
    },
    "schema:spatialCoverage": {
        "@type": "schema:Place",
        "schema:geo": {
          "@type": "schema:GeoCoordinates",
          "schema:latitude": -18,
          "schema:longitude": -76
        }
      }
}
@fils fils added the question Further information is requested label Jun 1, 2020
@rduerr
Copy link
Collaborator

rduerr commented Jun 1, 2020

Just checking but all four methods support all of the various types of spatial data? For example, bounding boxes, sets of points, etc.?

@fils
Copy link
Collaborator Author

fils commented Jun 1, 2020

@rduerr

So the "subject" of approach is a link to external GeoJSON. So the full set of spatial geometries it can do would be there.

The GeoSPARQL approach is really just WKT as a geo:wktLiteral, so again all the geometry of WKT is there.

The schema.org approach is https://schema.org/GeoShape (with all the wonderful features of that.. sigh).

The JSON Literal approach is should work but my only concern there is that I've just never tested it with a highly complex GeoJSON as the literal package. However, it "should" work. If we can break it that is likely something to report to JSON-LD 1.1 itself as an issue. So my guess (hope) is that has all been covered.

@fils
Copy link
Collaborator Author

fils commented Jun 1, 2020

For those curious about GeoJSON-LD and why it's not here I'd reference you to opengeospatial/SELFIE#52

@jesserobertson
Copy link

The only thing you'd have to be careful with the direct embedding approches (both WKT and the plain ol' GeoJSON blob) would be performance.

Wouldn't be great UX if you've embedded JSON data as a header in a landing page and a complex geometry requires your browser to load 1 Gb of GeoJSON before the page renders...

@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Jun 2, 2020

Yes, if you embed serialized geometry in the triple store it can blow out storage and performance.

In Loc-I we separated out all the geometry data into a separate store, and only serialize it on demand. See http://loci.cat/geometry-data-service.html

We have an instance deployed here https://gds.loci.cat/
e.g. https://gds.loci.cat/geometry/asgs16_sosr/203

There are multiple representations available using conneg & args

Also, simplified geometries:

And centroids:

The URI is the same for all of these, just variations in the args.

Swagger here: https://gds.loci.cat/api/doc/

@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Jun 2, 2020

@jyucsiro did the implementation.
@jyucsiro @benjaminleighton and @dr-shorthair did the design.

@jesserobertson
Copy link

@dr-shorthair @jyucsiro @benjaminleighton nice API!

I guess it's slightly orthogonal to the web architecture that we're proposing in IGSN though, since you don't want to have to understand an API to crawl pages right?

@jesserobertson
Copy link

One point we raised in the sprint meeting tonight is that these aren't mutually exclusive ways of publishing data - if publishers wanted to support both complex JSON and a simplified geometry for schema.org/Google purposes (e.g. a bounding box or centroid) they could include both serializations in the document at the same time.

@dr-shorthair
Copy link
Collaborator

I guess it's slightly orthogonal to the web architecture that we're proposing

Is it? Instead of inlining the geometry, you can have a URI reference. Then your crawler can follow the link. The basic link to a geometry gets you a web page which is littered with more links which can be crawled.

You can set a http Accept header to get the geometry in whichever representation you want without needing to understand the API - try it!

https://gds.loci.cat/geometry/asgs16_sosr/203
Accept: text/plain or text/turtle or application/json or text/html

@dr-shorthair
Copy link
Collaborator

@jyucsiro could we add schema:GeoShape to the options in GDS?

@dr-shorthair
Copy link
Collaborator

and schema:Place (maybe linked into the 'centroid' group, since it is just a lat-long).

@jesserobertson
Copy link

Is it? Instead of inlining the geometry, you can have a URI reference.

Agreed it's not from the PoV of the crawler. It might be from the PoV of the publisher if this is another service that needs to be run. It's a good option for publishing if they've already got a spatial store somewhere (likely if spatial data is important to them).

@dblodgett-usgs
Copy link

We explored this in ELFIE. Here's where we landed. https://docs.opengeospatial.org/per/18-097.html#_preview_geometry

Search that doc for "geojson" to see other places that might have interesting content re: this issue.

At the end of the day, I think it's critical to think about "what is the use case?" AND "who is the client?"

The answer to the first will dictate whether you need an "analysis-grade" geometry and all that entails or if you can get away with a "preview-grade" geometry.

The second will answer what encoding you can get away with and what kind of network architecture your client will tolerate.

As unsatisfying as it is, a point and/or convex hull encoded in a schema:geo block is probably about all we should be considering in the world of search indexing.

In other use cases -- linked data graphs, for example -- I think the calculus is much more nuanced but we probably want to lean toward what the GeoSPARQL folks decided and some of the network architecture logic (in line or not) that @dr-shorthair applies in spades there.

@abhritchie
Copy link

Building on/recapitulating what @dblodgett-usgs says and summarizing a side conversation that has been running with him, @fils, @jesserobertson, and @ashepherd...

One thing that is becoming clear in the discussions the ELFIEs have provoked is that there is a useful set of data shapes that can be defined and implemented based on media-type. With each media type being better aligned to particular clients and their use cases.

JSON-LD (using the schema.org vocabulary) as structured data in HTML is perfectly pitched at the indexing use case and, picking an example entirely at random, Google's expectations as a client. Using schema.org geometries in this context isn't unsatisfying at all - we are speaking the language of the target audience.

Meanwhile, JSON-LD allows us to be more expressive for data engineers and scientists, expanding our vocabulary to use domain ontologies, including GeoSPARQL and its more robust spatial data types (well understood data types that are much easier for me to use in other systems, like PostGIS).

Straight away we can access the content we need using content negotiation. (I know there a nuance's here, but it is a good start.)

GeoJSON nicely straddles these worlds allowing us to provide rich spatial representations to things like web applications, but at a cost - most notably the restriction to WGS84 as a CRS. This isn't a problem. GeoJSON is successful because it does a few things and does them well. Where it doesn't meet our needs we have alternatives (JSON-LD+GeoSPARQL). I labour this point because there is a tendency to try and merge representations rather than switch between them. GeoJSON-LD is a good example and is the subject of a whole 'nother thread. To me, however, it is a solution in search of a problem.

Ultimately, we think there's value binding a default spatial data type to a media type (HTML+JSON-LD: schema.org; JSON-LD: GeoSPARQL; GeoJSON: GeoJSON). Being the open world can can link across these (as @dr-shorthair shows) and use other vocabularies as appropriate but a core set of shapes for each media type is surely more developer friendly (it certainly makes this data engineer happy).

@abhritchie
Copy link

Whether to link to or inline geometries is a slightly different problem for which there can be no hard and fast rules. @fils desire to 'provide information on the options and not provide a recommendation' is wise here.

Sure it is probably always unwise to inline a not schema.org geometry in a HTML+JSON-LD landing page but elsewhere it is hugely impacted by the ontology and use case - sometimes minimizing the number of requests a client has to make is better for API performance than minimizing the size of a response.

@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Jun 3, 2020

Its a shame that we had 2 broadly adopted geometry serializations: WKT & GeoJSON, which already had a lot of support in software and libraries, then Schema.org had to butt in with their own. Their community process in this area is strangely impervious to prior art (how did this happen @danbri?). But that's life I guess.

Just make sure the type is clear and we can rely on libraries to take care of it I guess.

@dr-shorthair
Copy link
Collaborator

FWIW - adding both GeoJSON and Schema.org serializations, alongside WKT and GML, is already on the agenda for the revision of GeoSPARQL

@dr-shorthair
Copy link
Collaborator

(We'll likely test it out in http://linked.data.gov.au/def/geox first)

@jesserobertson
Copy link

@abhritchie out of curiosity is there anything against agreeing on an extra 'crs' member in your GeoJSON?

It's not against the spec (see https://tools.ietf.org/html/rfc7946#section-6.1) but I guess this would be non-normative and your json wouldn't work in a webmap straight away.

Might be a better approach then munging everything into EPSG4326 though...

@dr-shorthair
Copy link
Collaborator

Hahahaha about 5 years ago I tried to get the GeoJSON guys to soften just a little and accept an optional CRS pointer. This really was a point on which they absolutely would not budge. I tried to sell it to them on the basis that without it they exclude some important markets, but nothing doing. They really see the non-CRS niche as big enough. Maybe they're right. They can get quite rude about people who want more, and shoo them off to GML.

@abhritchie
Copy link

Oh they budged ... into a more restrictive positive than for the 2008 specification. The current spec is quite explicit about the use of WGS84: https://tools.ietf.org/html/rfc7946#section-4.

We could take advantage of the wriggle room they give

However, where all involved parties have a prior arrangement, alternative coordinate reference systems can be used without risk of data being misinterpreted.

but we (I assume) want widespread, not niche, use. The risk of misinterpretation is high.

@dr-shorthair
Copy link
Collaborator

Yeah - it was while they were moving GeoJSON into IETF that I was talking to them. My interventions possibly caused the robust clarification to be added. Recommended to don an emotional suit-of-armour ahead of every interaction.

@dr-shorthair
Copy link
Collaborator

@abhritchie
Copy link

abhritchie commented Jun 4, 2020

From a purely technical perspective it was a poor decision, but there's merit in that its simplicity does make for a more straightforward implementation path. There are fewer choices to make and things to understand. This is a big factor in its success.

Being in a charitable mood I assume schema.org's (flawed) decision to bake their own serialization is motivated by a similar desire for internal consistency and simplicity.

(Only a cynic would assume ego plays a significant role in the standardization process.)

@jesserobertson
Copy link

@dr-shorthair that initial thread, yikes

The only people who are going to complain about this change are geodesists and other coordinate nerds, but they have the GML book to take shelter with.

@abhritchie
Copy link

There's an endearing clarity that comes from ignorance.

Still, there's something the science data community can learn from here. We deal with more complex data and need the freedom to say new or different things, and we need to support multiple communities. But ... we should try and strive for some simplicity/elegance/consistency wherever possible to help with adoption and uptake. First by embracing GeoJSON and schema.org as is (what I understand scienceonschema.org is doing) and focus on a complimentary effort to fill the gaps.

I'm labouring the point because @dr-shorthair's comment about the revision of GeoSPARQL (now with 100% more WKT, GML, GeoJSON, schema.org) made me sad. It feels a bit like being all things to all people with the effect that, like Vogon spaceships, the spec if not so much constructed as congealed.

Saying 'just make sure the type is clear and we can rely on libraries to take care of it I guess' is easy to type, but as someone writing scripts to parse these/take care of data its dispiriting. Especially because in the past I've been criticized for advocating approached that involve making a lot of nuanced decisions during implementation.

I'll stop now before @fils yells at me for hijacking his issue.

@dr-shorthair
Copy link
Collaborator

dr-shorthair commented Jun 4, 2020

I'm not sure its as tragic as that @abhritchie .

One thing that GeoSPARQL got right is a clear boundary between semantics and coordinates/shapes. The latter are generally processed by different tools than reasoners, so having a clear transition from the semantic-graph to the geometry-blob is good. And it also means that you can substitute different microformats in the geometry-blob without disturbing the basic integrity of the semantics. There is no suggestion that boundary will be breached in any revision of GeoSPARQL, so I think we are safe.

Having an external geometry-data-service, so the geometry is via a URI-reference, with negotiation about the format of the payload, makes this separation even more clear.

What I worry more about is, GeoJSON embedded in JSON-LD - this really does blur the semantic/geometry boundary.

@dr-shorthair
Copy link
Collaborator

BTW - the GeoJSON dudes are definitely not ignorant - some very skillful people there. Including the creator of Shapely. They were really trying to find the 90-10 sweet spot.

@abhritchie
Copy link

Just to be clear @dr-shorthair, the intent was to call that comment ignorant, not the community. Again, we've a lot to learn from them and a lot to gain by simply using it 'as is'.

@smrgeoinfo
Copy link
Contributor

my 2 cents--
Schema.org is (as far as I can tell) oriented towards dataset/resource level indexing. Yes, their decision to ignore prior art (per @dr-shorthair, above) is unfortunate, but of course not without precedent. It seems to me that from the point of view of dataset indexing, bounding boxes and centroids have been in use for awhile now (under various standards and serialization schemes), and although not perfect, seem to have performed as a good 80/20 solution for indexing/discovery. The current Science on Schema.org recommendations (see #104) attempt to provide a convention for consistency/interoperability at that level.

I think we can consider encoding specify feature locations, e.g. sampling features of various types ( boreholes, sample locations, image footprints ), or feature extents (geologic polygons, vegetation classes, buildings ...) in data is a separate problem that can be solved by a variety of solutions mentioned in the discussion above. I'd argue that for data distribution, what we need are conventions to define and identify profiles (data types) that specify particular geospatial location conventions for data conforming to the profile, and content-negotiated services that advertise the profiles available and how to get those representations.

@smrgeoinfo
Copy link
Contributor

smrgeoinfo commented Jun 9, 2020

There are several orthogonal considerations for location:

  1. what is the spatial reference system within which coordinate locations are representated
  2. what is the serialization scheme for the coordinates of the location; I think @fils options at the top of this thread are focused on this)
  3. what is the relationship of the reported coordinates to the feature that the data item is about (the role). It might be a centroid, bounding box, set of bounding boxes, actual sample location(s), detailed feature geometry in 2-D or 3-D... Interesting to note a related conversation going on at geological-survey-of-queensland, which is mostly about this aspect.

It would be useful if any location entity in a dataset could be parsed to learn what its conventions are for all three of these aspects. I'd suggest that something like @type could be used for the serialization convention (?mime type, media type?), but there also needs to be something like dcat:conformsTo (note there is no schema:conformsTo) that identifies a profile/specification/set of conventions. A profile might specify a default SRS and relation/role, or conventions for how the srs and role are specified explicitly.

@datadavev
Copy link
Collaborator

Wouldn't the mechanisms for identifying the SRS and serialization of the spatial data be defined by the type definition of the spatial data (specified in a context elsewhere which in turn may refer to a schema or ontology as appropriate) and the relationship between the entity being described and the spatial data be asserted by the predicate of that edge? Much like the original example of @fils but with perhaps more specific predicates relating to the spatial data?

@smrgeoinfo
Copy link
Contributor

@datadavev yes, that is possible. @type": "http://www.opengis.net/ont/sf#Point" and "@type": "schema:GeoCoordinates", do provide SRS (as a property or default value). As far as I can tell none of schema:subjectOf, geosparql:hasGeometry, geoblob, or schema:spatialCoverage are explicit about the relationship of the reported location to the described feature. If they had more explicit definitions/specifications they could (in some future world... :) )

@smrgeoinfo
Copy link
Contributor

I guess while I'm at it @fils I'd suggest that instead of schema:subjectOf, schema:about would make more sense (to me...) as a link from a resource description to the location of the resource.

@datadavev
Copy link
Collaborator

http://schema.org/location?

@smrgeoinfo
Copy link
Contributor

that would certainly make sense, but the the domain of schema:location is {Action, Event, Organization}

@datadavev
Copy link
Collaborator

datadavev commented Jun 10, 2020

Yep, though the type of the object is http://igsn.org/core/v1/Sample which may be considered to be an Event (though that is not asserted anywhere) . Could also use spatial, locationCreated, spatialCoverage, contentLocation if the sample is considered more akin to a CreativeWork which seems to be implied by the later use of spatialCoverage.

Restating the original example would make it a bit clearer that all the various geometry expressions are different ways of providing spatial information for the spatialCoverage of the http://igsn.org/core/v1/Sample:

{
    "@context": {
        "@version": 1.1,
        "geoblob": {
            "@id": "http://example.com/vocab/json",
            "@type": "@json"
        },
        "description": "http://igsn.org/core/v1/description",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "schema": "http://schema.org/"
    },
    "@id": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0",
    "@type": "http://igsn.org/core/v1/Sample",
    "description": "A fake ID for testing",
    "schema:spatialCoverage": [
      {
        "@type": "schema:Place",
        "schema:geo": {
          "@type": "schema:GeoCoordinates",
          "schema:latitude": -18,
          "schema:longitude": -76
        }
      },
      {
        "geoblob":{
	        "type": "GeometryCollection",
    	    "geometries": [{
        	    "type": "Point",
            	"coordinates": [-76, -18]
	        }]
        }
      },
      {
        "@id": "_:N98e75cacc29f40deb555eb583cb162dc",
        "@type": "http://www.opengis.net/ont/sf#Point",
        "geosparql:asWKT": {
            "@type": "http://www.opengis.net/ont/geosparql#wktLiteral",
            "@value": "POINT(-76 -18)"
        },
        "geosparql:crs": {
            "@id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }
      },
      {
        "schema:url": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0.geojson",
        "@type": "schema:DigitalDocument",
        "schema:format": [
          "application/vnd.geo+json"
        ],
        "schema:conformsTo": "https://igsn.org/schema/spatial.schema.json"
      }
    ]
}

(edited to correct namespace)

@smrgeoinfo
Copy link
Contributor

Good point-- http://igsn.org/core/v1/Sample is not a schema.org entity, so a sample description profile for this @type should be able to define properties from any namespace, in which case this might be an approach:

{
    "@context": {
       .... same as above, but add dcat:  namespace
    },
    "@id": "https://samples.earth/id/do/bqs2dn2u6s73o70jdup0",
    "@type": "http://igsn.org/core/v1/Sample",
    "dcat:conformsTo": <URI for core sample description conventions>,  
    "rdfs:label": "BoreholeCoreID",
    "rdfs:description": "A fake ID for testing",
    "schema:location":
      {
        "@type": "schema:Place",
        "schema:geo": {
          "@type": "schema:GeoCoordinates",
          "schema:latitude": -18,
          "schema:longitude": -76,
        }
      },
...
}

I like location more that spatialCoverage... but either will work. Note that if you're describing core, you'd probably want to indicate the depth interval sampled (with appropriate vertical reference system information) as well. schema:elevation isn't very useful for this.

@jesserobertson
Copy link

@smrgeoinfo there’s a couple of subtleties here in that we’re trying to seperate the core metadata for samples (essentially just the igsn, the registrant and any related resources) from a community-driven descriptive metadata profile. Not all communities need spatial data (eg synthetic materials science samples, or comet samples).

A profile would be a coupled set of JSON-LD with a JSONSchema to make it easier for publishers who don’t want to bother with JSON-LD to do the right thing (which has the bonus that you can use JSONSchema tools for validation across publishers if you want).

We have this for our registration data and are developing it for the descriptions now. So any semantic definition for location would be down to the community. I could imagine from a sample POV the sampling event might be what you care about, but a museum use case might rather use location for the actual location of the specimen now.

The plan is to work on the description parts with the communities involved - so you’d have a geological samples profile with geo sample types like core or fossils or thin sections, and a bio samples profile, and a material samples profile etc etc.

I think that work might kick off soon - we recommended this to the steering committee of IGSN this week.

Apologies for taking the thread off track!

@dr-shorthair
Copy link
Collaborator

Please keep in mind that, in the context of IGSN, a sample is a physical sample. This contrasts with other kinds of samples - e.g. statistical samples from social sciences. So if you end up proposing it to schema.org the broader uses of the term should be considered/respected.

@smrgeoinfo
Copy link
Contributor

@jesserobertson yes, goes back to comment about the orthogonal considerations for what is located. These considerations are not restricted to physical samples.

In the original IGSN data scheme there is the concept of a Birth Certificate for a physical sample-- a minimal, cross domain profile binding an identifier with 'who, what, when', that I think is your 'core metadata'. The original intention was that different sample registration agents (like SESAR, GA...) would define content models extending this. The hope is/was that these extension profiles would share the same vocabulary for sample description, only adding new entities/properties where necessary for their particular community.

@jesserobertson
Copy link

It might be that the issues with coordinates and ordering might be a little less of an issue: https://www.w3.org/TR/json-ld/#example-84-coordinates-expressed-in-json-ld

Doesn't solve the CRS issues but might simplify the GeoJSON use case...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants