Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

Open
justaddcoffee opened this issue Nov 8, 2022 · 2 comments

Comments

@justaddcoffee
Copy link
Collaborator

While we (@cmungall @caufieldjh @hrshdhgd) were working on oakx-grape, we observed that it'd be useful in this and other use cases to be able to deal with edge lists which contain references to nodes that are not present in the nodes list.

e.g.

nodes.tsv:

id    category
foo  biolink:Gene
bar  biolink:Protein

edges.tsv

subject      predicate                      object
foo          biolink:interacts_with         bar
foo          biolink:interacts_with         baz

Do either of these two possible behaviors seem reasonable/doable?

  1. add an argument to from_csv to ignore edges that reference nodes that are not in the nodes list (ignore_edges_with_unknown_nodes=False or some such?) - here we'd ignore the foo biolink:interacts_with baz edge

  2. add an argument to from_csv to instantiate nodes with the default node type when they are referenced in the edge file but not in the node file (autocreate_nodes_from_edge_list=False or some such?) - here we'd create a node baz with default_node_type

@LucaCappelletti94
Copy link
Member

We could add support for this, but the reasons such corner cases are intentionally not supported are that:

  1. These are malformed input files. The set of nodes should contain all nodes.
  2. Loading of the graph can no longer be parallel when such corner cases are present, making it much slower.
  3. Many assumptions in creating the graph data structure are no longer valid when edges can get thrown out because of malformations.

From our perspective, since we want the best and fastest experience loading graph objects from CSVs, the graph files should be fixed before loading, not as they are being loaded.

What are the reasons for having incomplete node lists?

@justaddcoffee
Copy link
Collaborator Author

justaddcoffee commented Nov 9, 2022

Thanks @LucaCappelletti94

Possibly a solution here would be to build a helper tool as part of Grape (or OAK or KG-Hub) that reads in node and edge files and does 1) and 2) above - either rejects edges or adds missing nodes to the nodes file, respectively. Maybe we could help write this if it's of interest?

What are the reasons for having incomplete node lists?

In the oakx-grape use case this happens, I believe, because "dangling edges" (edges that reference nodes that are not explicitly mentioned as entities/nodes) are permitted in the OWL specification - @cmungall I think can elaborate

In the KG-Hub-like use cases, this likely will sometimes happen if during an ingest the developer forgets to write out node information when ingesting and processing edges. Essentially a data bug. This isn't currently a problem when reading KG-Hub graphs into Grape, because KGX currently rejects edges like this during the merge step in our ETL pipeline (which is possibly not ideal, since it silently omits information from the final graph, but that's a separate discussion)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants