Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphs to be added as automatic retrieval #51

Open
LucaCappelletti94 opened this issue Dec 29, 2020 · 13 comments
Open

Graphs to be added as automatic retrieval #51

LucaCappelletti94 opened this issue Dec 29, 2020 · 13 comments

Comments

@LucaCappelletti94
Copy link
Member

We would like to add some more graphs to the automatic retrieval mechanism.

Currently, we support only StringPPI (human version), CompleteStringPPI (cross-species) and KG-COVID-19.

Which graphs should we add to the list? The requirements for the graph are:

  1. Must be publicly available behind an URL that can be resolved with a wget.
  2. Must be a TSV/CSV/text file with separators.
  3. The server where it is hosted must be reasonably fast.
  4. Can be a zip, gzip, tar.gz or plain file.
@LucaCappelletti94
Copy link
Member Author

Karate

@LucaCappelletti94
Copy link
Member Author

LucaCappelletti94 commented Jan 14, 2021

Finish adding graphs from Network Repository,

Most of these are now available, we still need to add support for timestamp graphs and graphs with multi-labelled nodes.

@LucaCappelletti94
Copy link
Member Author

LucaCappelletti94 commented Jan 14, 2021

Added graphs from kghub with this pull request.

@callahantiff
Copy link

@LucaCappelletti94 - what about including any of the resources in this (also copied below) table? If you want, we could select a few to start with. I'd be happy to write some simple code that converts them from PyKeen format into the spec you provide above. Let me know what you think!

I put a ➡️ next to the ones I think are worth starting with and a ⭐ next others worth considering for future incorporation.

Name Reference Description
⭐ckg pykeen.datasets.CKG The Clinical Knowledge Graph (CKG) dataset from [santos2020]_.
➡️ codexlarge pykeen.datasets.CoDExLarge The CoDEx large dataset.
codexmedium pykeen.datasets.CoDExMedium The CoDEx medium dataset.
codexsmall pykeen.datasets.CoDExSmall The CoDEx small dataset.
➡️ conceptnet pykeen.datasets.ConceptNet The ConceptNet dataset from [speer2017]_.
⭐cskg pykeen.datasets.CSKG The CSKG dataset.
⭐drkg pykeen.datasets.DRKG The DRKG dataset.
fb15k pykeen.datasets.FB15k The FB15k dataset.
fb15k237 pykeen.datasets.FB15k237 The FB15k-237 dataset.
⭐ hetionet pykeen.datasets.Hetionet The Hetionet dataset is a large biological network.
➡️ kinships pykeen.datasets.Kinships The Kinships dataset.
nations pykeen.datasets.Nations The Nations dataset.
ogbbiokg pykeen.datasets.OGBBioKG The OGB BioKG dataset.
ogbwikikg pykeen.datasets.OGBWikiKG The OGB WikiKG dataset.
➡️ openbiolink pykeen.datasets.OpenBioLink The OpenBioLink dataset.
openbiolinkf1 pykeen.datasets.OpenBioLinkF1 The PyKEEN First Filtered OpenBioLink 2020 Dataset.
openbiolinkf2 pykeen.datasets.OpenBioLinkF2 The PyKEEN Second Filtered OpenBioLink 2020 Dataset.
openbiolinklq pykeen.datasets.OpenBioLinkLQ The low-quality variant of the OpenBioLink dataset.
umls pykeen.datasets.UMLS The UMLS dataset.
wn18 pykeen.datasets.WN18 The WN18 dataset.
wn18rr pykeen.datasets.WN18RR The WN18-RR dataset.
yago310 pykeen.datasets.YAGO310 The YAGO3-10 dataset is a subset of YAGO3 that only contains entities with at least 10 relations.

@callahantiff
Copy link

Also, see the datasets listed in this KG Embedding Review on the bottom of page 22. These are datasets that are most frequently used by people developing new KG embedding methods:

Screen Shot 2021-01-27 at 11 51 46

@LucaCappelletti94
Copy link
Member Author

LucaCappelletti94 commented Jan 28, 2021

Thank you, @callahantiff! We still need to add support for multi-class support for the nodes (that is, nodes with multiple classes such as a node that is both of class mammal and class cat). Even though we plan to add support for these and other node and edge features, we will surely work on them after finishing Grape. Do you know if these graphs have multiple classes per nodes or just nodes of multiple classes, with each node of a single class? [UPDATE 2021/04/19] We have support for multi-labeled nodes in graphs now!

If it's the second option, then we can surely support now all the considered graphs.
How hard is it to convert them into a CSV-like format? And more importantly, where could we host these? Maybe on kg-hub? Would that be an option @justaddcoffee?

@justaddcoffee
Copy link
Collaborator

My general feeling is that we can and should allow easy ingest of remote graphs as we are discussing here.

But, I think we should avoid hosting other people's graphs on KG-hub unless they are transformed versions that we are incorporating into our own knowledge graphs (like our ChEMBL transform that we include in KG-COVID-19).

Glad to discuss though

@LucaCappelletti94
Copy link
Member Author

Import graphs, after adding support for time intervals, from http://www.sociopatterns.org/

@hrshdhgd
Copy link

hrshdhgd commented Apr 27, 2021

Hey @LucaCappelletti94 , as discussed earlier here's the link for kg-microbe graphs: https://kg-hub.berkeleybop.io/kg-microbe/20210422/kg-microbe.tar.gz

@LucaCappelletti94
Copy link
Member Author

Thank you @hrshdhgd!

@LucaCappelletti94
Copy link
Member Author

Hi @hrshdhgd, sorry for the long wait, now all versions of KG-Microbe and KG-COVID are integrated in the automatic retrieval.

@hrshdhgd
Copy link

No problem @LucaCappelletti94 , thank you very much!

@LucaCappelletti94
Copy link
Member Author

I am now iterating once more on the graphs from the automatic graph retrieval (we are now at over 80K graphs downloadable). Do you have more suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants