feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

justaddcoffee · 2022-11-08T21:26:49Z

While we (@cmungall @caufieldjh @hrshdhgd) were working on oakx-grape, we observed that it'd be useful in this and other use cases to be able to deal with edge lists which contain references to nodes that are not present in the nodes list.

e.g.

nodes.tsv:

id    category
foo  biolink:Gene
bar  biolink:Protein

edges.tsv

subject      predicate                      object
foo          biolink:interacts_with         bar
foo          biolink:interacts_with         baz

Do either of these two possible behaviors seem reasonable/doable?

add an argument to from_csv to ignore edges that reference nodes that are not in the nodes list (ignore_edges_with_unknown_nodes=False or some such?) - here we'd ignore the foo biolink:interacts_with baz edge
add an argument to from_csv to instantiate nodes with the default node type when they are referenced in the edge file but not in the node file (autocreate_nodes_from_edge_list=False or some such?) - here we'd create a node baz with default_node_type

The text was updated successfully, but these errors were encountered:

LucaCappelletti94 · 2022-11-09T11:24:12Z

We could add support for this, but the reasons such corner cases are intentionally not supported are that:

These are malformed input files. The set of nodes should contain all nodes.
Loading of the graph can no longer be parallel when such corner cases are present, making it much slower.
Many assumptions in creating the graph data structure are no longer valid when edges can get thrown out because of malformations.

From our perspective, since we want the best and fastest experience loading graph objects from CSVs, the graph files should be fixed before loading, not as they are being loaded.

What are the reasons for having incomplete node lists?

justaddcoffee · 2022-11-09T13:13:46Z

Thanks @LucaCappelletti94

Possibly a solution here would be to build a helper tool as part of Grape (or OAK or KG-Hub) that reads in node and edge files and does 1) and 2) above - either rejects edges or adds missing nodes to the nodes file, respectively. Maybe we could help write this if it's of interest?

What are the reasons for having incomplete node lists?

In the oakx-grape use case this happens, I believe, because "dangling edges" (edges that reference nodes that are not explicitly mentioned as entities/nodes) are permitted in the OWL specification - @cmungall I think can elaborate

In the KG-Hub-like use cases, this likely will sometimes happen if during an ingest the developer forgets to write out node information when ingesting and processing edges. Essentially a data bug. This isn't currently a problem when reading KG-Hub graphs into Grape, because KGX currently rejects edges like this during the merge step in our ETL pipeline (which is possibly not ideal, since it silently omits information from the final graph, but that's a separate discussion)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

justaddcoffee commented Nov 8, 2022

LucaCappelletti94 commented Nov 9, 2022

justaddcoffee commented Nov 9, 2022 •

edited

Loading

feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191

Comments

justaddcoffee commented Nov 8, 2022

LucaCappelletti94 commented Nov 9, 2022

justaddcoffee commented Nov 9, 2022 • edited Loading

justaddcoffee commented Nov 9, 2022 •

edited

Loading