-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request - add an option to deal with edge lists that mention nodes that are not in the node list file #191
Comments
We could add support for this, but the reasons such corner cases are intentionally not supported are that:
From our perspective, since we want the best and fastest experience loading graph objects from CSVs, the graph files should be fixed before loading, not as they are being loaded. What are the reasons for having incomplete node lists? |
Thanks @LucaCappelletti94 Possibly a solution here would be to build a helper tool as part of Grape (or OAK or KG-Hub) that reads in node and edge files and does 1) and 2) above - either rejects edges or adds missing nodes to the nodes file, respectively. Maybe we could help write this if it's of interest?
In the oakx-grape use case this happens, I believe, because "dangling edges" (edges that reference nodes that are not explicitly mentioned as entities/nodes) are permitted in the OWL specification - @cmungall I think can elaborate In the KG-Hub-like use cases, this likely will sometimes happen if during an ingest the developer forgets to write out node information when ingesting and processing edges. Essentially a data bug. This isn't currently a problem when reading KG-Hub graphs into Grape, because KGX currently rejects edges like this during the merge step in our ETL pipeline (which is possibly not ideal, since it silently omits information from the final graph, but that's a separate discussion) |
While we (@cmungall @caufieldjh @hrshdhgd) were working on oakx-grape, we observed that it'd be useful in this and other use cases to be able to deal with edge lists which contain references to nodes that are not present in the nodes list.
e.g.
nodes.tsv:
edges.tsv
Do either of these two possible behaviors seem reasonable/doable?
add an argument to
from_csv
to ignore edges that reference nodes that are not in the nodes list (ignore_edges_with_unknown_nodes=False
or some such?) - here we'd ignore thefoo biolink:interacts_with baz
edgeadd an argument to
from_csv
to instantiate nodes with the default node type when they are referenced in the edge file but not in the node file (autocreate_nodes_from_edge_list=False
or some such?) - here we'd create a nodebaz
withdefault_node_type
The text was updated successfully, but these errors were encountered: