Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide how to modularize GAZ such that individual subsets can be managed in github #21

Open
cmungall opened this issue Apr 29, 2019 · 11 comments
Labels
question questions or discussion items (comnsider splitting) technical Anything regarding the build/release pipeline or requiring dev help

Comments

@cmungall
Copy link
Member

@cmungall cmungall added question questions or discussion items (comnsider splitting) technical Anything regarding the build/release pipeline or requiring dev help labels Apr 29, 2019
@cmungall
Copy link
Member Author

cmungall commented May 1, 2019

@rctauber how were you planning to split things into modules? I see you have breakdown by country just now. Do you include everything that is located in a country, including geographic features such as lakes, rivers and the like? What bout features that overlap two countries?

@beckyjackson
Copy link
Collaborator

The country modules are everything that is related by either located_in or subClassOf. I'm not sure how overlapping features are handled currently in GAZ, but the modules would reflect that.

We originally discussed starting with countries, then expanding to other subsets like oceans and seas.

But, if overlapping features appear in multiple modules (and I imagine there will be overlap between things like counties and oceans and seas), it will be hard to make sure things stay up-to-date if we are using the modules to develop...

@pbuttigieg
Copy link
Member

@rctauber

I'm not sure how overlapping features are handled currently in GAZ, but the modules would reflect that.

As long as they're of different types, I think there shouldn't be conflicts in the subClassOf hierarchies. The RO:overlaps relation and its subproperties can be (is?) used to assert this sort of mereotopology. Even if these are in different modules, this should hold as long as there are some checks in place to make sure classes/instances are present across modules.

On that note, @cmungall and I had several conversations over the years about the need to generalise spatial relations in ontologies like BSPO and RO to the planetary science case. I think GAZ will need these too. @cmungall time for an RO-geo subset? Branching off to #24

@beckyjackson
Copy link
Collaborator

As long as they're of different types, I think there shouldn't be conflicts in the subClassOf hierarchies.

What about the 'located in' hierarchies, though? The modules include subclasses and located in. For example, say a river is located in two countries and we need to update the label of that river. Even if we check for 'overlaps', how do we know which one is newer? I guess I could write a script that takes the changes from the most-recently updated modules but it may get complicated.

How do we determine the initial conversion is not lossy?

Before we tackle the above problem, I think this is the more important issue.

On another note, I can regenerate the modules from GAZ to keep them up-to-date, but I'm using a version of ROBOT that has a few unreleased features. The two main ones are improved templating and use of Jena's TDB feature to store a dataset on-disk (which makes querying infinitely faster). I'm pushing to get the updated templating merged in, then I need to make a PR for the Jena stuff. I don't want to include a custom ROBOT JAR in this repo since there are already many large files.

As soon as these features are released, I can add the rules to the Makefile to generate modules so that anybody can do this. That said, it doesn't solve our problem of using modules to actually build GAZ, but at least the modules can be kept up-to-date.

@cmungall
Copy link
Member Author

cmungall commented May 7, 2019

@rctauber

What about the 'located in' hierarchies, though? The modules include subclasses and located in. For example, say a river is located in two countries and we need to update the label of that river

Not sure if I am totally following. This issue is about modularization rather than labels, it sounds like you may also be making unique labels? (see #26).

But in answer to the main question, it should not be possible for an entity to be in RO:located-in two locations where those locations do not overlap (by definition). Thus if we choose non-overlapping units as the modules and placement in the modules is determined by located-in, then nothing should be in more than one module. But note:

  • There is no guarantee that located_in has been used in this strict RO sense in GAZ, or that mistakes have not crept in
  • due to these mereotopological properties there will be some entities that cannot be placed in a module. e.g. a river should not be located in 2 countries, instead the partial-overlaps will have been used. We could have some binning strategy where something gets binned up to the next level (e.g. continent, and then up to earth). But this starts getting complex

@cmungall
Copy link
Member Author

cmungall commented May 7, 2019

Let me also state a few assumptions to check I'm on the same page as everyone:

  • I assume there will a one-time conversion of the current gaz source into a modular RDF representation in github
  • Editors will edit individual module files using Protege
  • A custom release process will build a complete gaz.owl and gaz.obo file and these will be distributed by a mechanism OTHER than github raw files
  • There will be some kind of Makefile-automated QC sparql checks to make sure that editors creating new entities place them in a module that is consistent with the located-in axiom

@beckyjackson
Copy link
Collaborator

This issue is about modularization rather than labels, it sounds like you may also be making unique labels?

Sorry, I wasn't super clear. I was just using that as an example if we wanted to update the label of an entity that existed in two modules. This wouldn't be a problem if we are able to define non-overlapping modules, as you suggest above.

I agree with your stated assumptions.

@cmungall
Copy link
Member Author

@rctauber going back to your comment from May 6. What are your plans for robot templates here?

@beckyjackson
Copy link
Collaborator

I don't have templates for the modules right now, but I can always make them if need be. I'm starting to see that ROBOT is having some trouble with any entities that are both named individuals and classes. For example, GAZ:00005229:

<!-- http://purl.obolibrary.org/obo/GAZ_00005229 -->

<owl:Class rdf:about="http://purl.obolibrary.org/obo/GAZ_00005229">
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Vennesla</rdfs:label>
    <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/GAZ_00002718"/>
    <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A populated place.</obo:IAO_0000115>
    <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GAZ</oboInOwl:hasOBONamespace>
    <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GAZ:00005229</oboInOwl:id>
</owl:Class>

and...

<!-- http://purl.obolibrary.org/obo/GAZ_00005229 -->

<owl:NamedIndividual rdf:about="http://purl.obolibrary.org/obo/GAZ_00005229">
    <obo:RO_0001025 rdf:resource="http://purl.obolibrary.org/obo/GAZ_00012611"/>
</owl:NamedIndividual>

I'm trying to use robot filter to create a "bucket" of things missing from the country modules, but filter isn't working for these types of terms. We may need to resolve #20 before proceeding with modules.

@cmungall
Copy link
Member Author

cmungall commented Jul 9, 2019

I agree we should fix the punning first.

My question was more along the lines of what you thought was best for the overall strategy. One possibility would be to maintain the entire ontology as a TSV and generate via robot template. I thought you might be thinking along these lines. There would be some definite advantages here. But it could be awkward editing the relational graph. And having mixed mode TSV and OWL may just add more complexity to what is already likely to turn into quite a complex build.

It may be the case that we don't need to worry about templates just now and just focus on modularizing the OWL (but still, fixing the punning would be good)

@beckyjackson
Copy link
Collaborator

My plan was to modularize first, and then determine if we want to move to templates later. So I think we are in agreement there.

I think we should discuss #20 on our next GAZ call and (perhaps) move forward on converting all those into individuals. Then, I could work on building a "bucket" that contains all the terms not in one of the country modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question questions or discussion items (comnsider splitting) technical Anything regarding the build/release pipeline or requiring dev help
Projects
None yet
Development

No branches or pull requests

3 participants