This repository collects multiple Text-Attributed Graph (TAG) datasets from various sources and provides a unified approach for preprocessing and loading. We also offer a standardized task generation pipeline for evaluating the performance of GNN/LLM on these datasets. Technical report of the TAGLAS is available in arxiv. The project is still under construction, so please expect more datasets and features in the future. Stay tuned!
- 2024.06: First version release.
Here are currently included datasets:
Dataset (key) | Avg. #N | Avg. #E | #G | Task level | Task | Split (train/val/test) | Domain | description | Source |
---|---|---|---|---|---|---|---|---|---|
Cora_node (cora_node) | 2708 | 10556 | 1 | Node | 7-way classification | 140/500/2068 | Co-Citation | Predict the category of papers. | Graph-LLM, OFA |
Cora_link (cora_link) | 2708 | 10556 | 1 | Link | Binary classification | 17944/1056/2112 | Co-Citation | Predict whether two papers are co-cited by other papers. | Graph-LLM, OFA |
Pubmed_node (pubmed_node) | 19717 | 88648 | 1 | Node | 3-way classification | 60/500/19157 | Co-Citation | Predict the category of papers. | Graph-LLM, OFA |
Pubmed_link (pubmed_link) | 19717 | 88468 | 1 | Link | Binary classification | 150700/8866/17730 | Co-Citation | Predict whether two papers are co-cited by other papers. | Graph-LLM, OFA |
Arxiv (arxiv) | 169343 | 1166243 | 1 | Node | 40-way classification | 90941/29799/48603 | Citation | Predict the category of papers. | OGB, OFA |
WikiCS (wikics) | 11701 | 216123 | 1 | Node | 10-way classification | 580/1769/5847 | Wiki page | Predict the category of wiki pages. | PyG, OFA |
Product-subset (products) | 54025 | 144638 | 1 | Node | 47-way classification | 14695/1567/36982 | Co-purchase | Predict the category of products. | TAPE |
FB15K237 (fb15k237) | 14541 | 310116 | 1 | Link | 237-way classification | 272115/17535/20466 | Knowledge graph | Predict the relationship between two entities. | OFA |
WN18RR (wn18rr) | 40943 | 93003 | 1 | Link | 11-way classification | 86835/3034/3134 | Knowledge graph | Predict the relationship between two entities. | OFA |
MovieLens-1m (ml1m) | 9923 | 2000418 | 1 | Link | regression/5-way | 850177/50011/100021 | Movie rating | Predict the rating between users and movies. | PyG |
Chembl_pretrain (chemblpre) | 25.87 | 55.92 | 365065 | Graph | 1048-way binary classification | 341952/0/0 | molecular | Predict the effectiveness of molecule to multiple assays. | GIMLET, OFA |
PCBA (pcba) | 25.97 | 56.20 | 437929 | Graph | 128-way binary classification | 349854/43650/43588 | molecular | Predict the effectiveness of molecule to multiple assays. | GIMLET, OFA |
HIV (hiv) | 25.51 | 54.94 | 41127 | Graph | Binary classification | 32901/4113/4113 | molecular | Predict the effectiveness of molecule to hiv. | GIMLET, OFA |
BBBP (bbbp) | 24.06 | 51.91 | 2039 | Graph | Binary classification | 1631/204/204 | molecular | Predict the effectiveness of molecule to brain blood barrier. | GIMLET, OFA |
BACE (bace) | 34.09 | 73.72 | 1513 | Graph | Binary classification | 1210/151/152 | molecular | Predict the effectiveness of molecule to BACE1 protease. | GIMLET, OFA |
toxcast (toxcast) | 18.76 | 38.50 | 8575 | Graph | 588-way binary classification. | 6859/858/858 | molecular | Predict the effectiveness of molecule to multiple assays. | GIMLET, OFA |
esol (esol) | 13.29 | 27.35 | 1128 | Graph | Regression | 902/113/113 | molecular | Predict the solubility of the molecule. | GIMLET, OFA |
freesolv (freesolv) | 8.72 | 16.76 | 642 | Graph | Regression | 513/64/65 | molecular | Predict the free energy of hydration of the molecule. | GIMLET, OFA |
lipo (lipo) | 27.04 | 59.00 | 4200 | Graph | Regression | 3360/420/420 | molecular | Predict the lipophilicity of the molecule. | GIMLET, OFA |
cyp450 (cyp450) | 24.52 | 53.02 | 16896 | Graph | 5-way binary classification | 13516/1690/1690 | molecular | Predict the effectiveness of molecule to CYP450 enzyme family. | GIMLET, OFA |
tox21 (tox21) | 18.57 | 38.59 | 7831 | Graph | 12-way binary classification | 6264/783/784 | molecular | Predict the effectiveness of molecule to multiple assays. | GIMLET, OFA |
muv (muv) | 24.23 | 52.56 | 93087 | Graph | 17-way binary classification | 74469/9309/9309 | molecular | Predict the effectiveness of molecule to multiple assays. | GIMLET, OFA |
ExplaGraphs (expla_graph) | 5.17 | 4.25 | 2766 | Graph | Question Answering | 1659/553/554 | Commonsense | Common sense reasoning. | G-retriver |
SceneGraphs (scene_graph) | 19.13 | 68.44 | 100000 | Graph | Question Answering | 59978/19997/20025 | scene graph | Scene graph question answering. | G-retriver |
MAG240m-subset (mag240m) | 5875010 | 26434726 | 1 | Node | 153-way classification | 900722/63337/63338/132585 | Citation | Predict the category of papers. | OGB |
Ultrachat200k (ultrachat200k) | 3.72 | 2.72 | 449929 | Graph | Question Answering | 400000/20000/29929 | Conversation | Answer the question given previous conversation. | UltraChat200k |
PyG>=2.3
datasets
torch>=2.0.1
transformers>=4.36.2
huggingface_hub
rdkit
You can directly clone the repository into your working project by using the following command:
git clone https://github.com/JiaruiFeng/TAGLAS.git
We will provide a more user-friendly installation method in the future.
The basic way to load a dataset is by using its key. The dataset key can be found in the table above. For example, to load the Arxiv dataset:
from TAGLAS import get_dataset
dataset = get_dataset("arxiv")
You can also load multiple datasets at the same time:
from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"])
By default, all data files are be saved in the ./TAGDataset
directory root in the repository directory.
If you want to change the data path, you can set the root
parameter when loading the dataset:
from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"], root="your_path")
The above function will load the dataset in the default way, which is suitable for most cases. However, some datasets may have additional arguments. To have further control over the loading process, you can also load the dataset by directly add additional arguments:
from TAGLAS import get_dataset
dataset = get_dataset("fb15k237", to_undirected=False)
Finally, directly import from the dataset class is also supported:
from TAGLAS.datasets import Arxiv
dataset = Arxiv()
All data samples are stored in the dataset with class TAGData
, which is inherited from Data
class in
torch_geometric
package. Different information will be stored in different key. Most datasets contain the following keys:
x
: Text feature for all nodes. Usually alist
ornp.ndarray
.node_map
: A mapping from node index to node text feature. Usually atorch.LongTensor
.edge_attr
: Text feature for all edges. Usually alist
ornp.ndarray
.edge_map
: A mapping from edge index to edge text feature. Usually atorch.LongTensor
.label
: Text feature for all labels. Usually alist
ornp.ndarray
.label_map
: A mapping from label index to label text feature. Usually atorch.LongTensor
.edge_index
: The graph structure. Usually atorch.LongTensor
.
Some datasets may also contain:
x_original
: The vector feature of all nodes in the original data source. Usually atorch.Tensor
.edge_attr_orignal
: The vector feature of all edges in the original data source. Usually atorch.Tensor
.question
: Text feature of question for the QA tasks.question_map
: A mapping from question index to question text feature.answer
: Text feature of answer for the QA tasks.answer_map
: A mapping from answer index to answer text feature.
Here is a simple demonstration:
from TAGLAS import get_dataset
dataset = get_dataset("arxiv")
# Get node text feature for the whole dataset.
x = dataset.x
# Get the first graph sample in the dataset.
data = dataset[0]
# Get edge text feature for the sample.
edge_attr = data.edge_attr
For graph-level datasets, all _map
keys like node_map
or edge_map
will store the mapping to the global feature of
all data sample. The global features can be accessed by:
from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Get the global node text features.
dataset.x
# Get the global edge text features.
dataset.edge_attr
The feature for a specific sample can be obtained by:
from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Global node text features
x = dataset.x
data = dataset[0]
# Get node text feature for sample 0 by the global node_map key of the sample 0.
sample_x = [x[i] for i in data.node_map]
# We also provide direct access to the text feature of each sample by:
sample_x = dataset[0].x
For node/edge-level datasets, since they contain only one graph, the local map is also the global map, and the logic remains the same. The reason we store the features this way is to avoid repeated text features, especially for large datasets with only a few unique text features (like molecule datasets).
In this repository, we provide a unified way to generate tasks based on datasets. Currently, we support the following five task types:
default
: Thedefault
task implements the most common methods used in the graph community for node, edge, and graph-level tasks. Specifically, it returns the entire original graph for node and edge-level tasks and an original graph sample for graph-level tasks. Additionally, it uses the node and edge features from the original source if available; otherwise, it generates identical features. This type is mainly used for debugging and baseline evaluation.default_text
: The logic ofdefault_text
tasks is the same asdefault
, except that all features are replaced with text features. Additionally, we support converting all text features to sentence embeddings.subgraph
:The subgraph task converts node and edge-level tasks into subgraph-based tasks. Specifically, for the target node or edge, it will sample a subgraph around the target. Similar to the default task, it uses the original node and edge features.subgraph_text
: The logic ofsubgraph_text
tasks is the same assubgraph
except that all features are replaced with text features.QA
: The QA task converts all tasks into a question-answering format. Each sample will include a question and an answer key. By default, the QA tasks will sample subgraphs for node and edge-level tasks.
To load a specific task, simply call:
from TAGLAS import get_task
# Load default node-level task on cora
task = get_task("cora_node", "default")
# Load subgraph_text edge-level task on pubmed and val split
task = get_task("pubmed_link", "subgraph_text", split="val")
Similarly, you can load multiple task at the same time:
from TAGLAS import get_tasks
# Load QA tasks on all datasets.
tasks = get_tasks(["cora_node", "arxiv", "wn18rr", "scene_graph"], "QA")
# Specify task type for each dataset.
tasks = get_tasks(["cora_node", "arxiv"], ["QA", "subgraph_text"])
By default, all generated tasks will not be saved. For fast loading and repeat experiments, you can save and load the generated tasks by:
from TAGLAS import get_task
# save_data will save the generated task into corresponding folder. load_saved will try to load the saved task first before generate new task.
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True)
# In defualt, the saved task file will be named by used important arguments (like split, hop...). You can also specify it by yourself:
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True, save_name="your_name")
Directly construct task given dataset is also supported: Finally, directly import from the dataset class is also supported:
from TAGLAS.datasets import Arxiv
from TAGLAS.tasks import SubgraphTextNPTask
dataset = Arxiv()
# Load subgraph_text node-level task on Arxiv dataset.
task = SubgraphTextNPTask(dataset)
For default_text
, subgraph_text
, and QA
task types, we also provide function to convert raw text feature to sentence embedding:
from TAGLAS import get_task
from TAGLAS.tasks.text_encoder import SentenceEncoder
encoder_name = "ST"
encoder = SentenceEncoder(encoder_name)
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
arxiv_task.convert_text_to_embedding(encoder_name, encoder)
In TAGLAS, we implement several commonly used LLMs for sentence embedding, including ST
(Sentence Transformer),
BERT
(vanilla BERT), e5
(E5), llama2_7b
(Llama2-7b), and llama2_13b
(Llama2-13b). You can load different
models by inputting the respective model_key
into SentenceEncoder
. Additionally, you can implement your own
sentence embedding model as long as it has a __call__
function to convert input text lists into embeddings.
For all tasks in TAGLAS, we provide a unified collcate function. Specifically, call the collate function by:
from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
# Call collate function to get a batch of data
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])
The collate function is implemented based on torch_geometric.loader.dataloader.Collater
. However, there
is a major difference. For all text feature keys like x
and edge_attr
, it only stores the unique text
features in the batch. Additionally, all _map
keys store the map from the corresponding unique text
features in the batch to all elements.
from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])
# to get node text features for all nodes in the batch
x = batch.x[batch.node_map]
# to get edge text features for all edges in the batch
edge_attr = batch.edge_attr[batch.edge_map]
In this way, the batch data is more memory and computation efficient, as each unique text only needs to be encoded once.
"For each dataset and task, we provide a default evaluation tool for performance evaluation based on torchmetric
.
Specifically, for each dataset, we support two types of evaluation based on its supported task types."
default
: Used for all task types exceptQA
. It supports evaluation based on tensor output, which is commonly used.QA
: It supports evaluation based on text output.
To get an evaluator for a certain task, simply call:
from TAGLAS import get_evaluator, get_evaluators
# Get default evaluator for cora_node task. metric_name is a string indicate the name of metric.
metric_name, evaluator = get_evaluator("cora_node", "subgraph_text")
# Get QA evaluator for arxiv
metric_name, evaluator = get_evaluator("arxiv", "QA")
# Get evaluator for multiple input tasks.
metric_name_list, evaluator_list = get_evaluators(["cora_node", "arxiv"], "QA")
The project is still in development. If you encounter any issues or bugs while using it, please feel free to open an issue in the GitHub repository.
If you found the TAGLAS helpful in your project, consider cite it! Thank you!
@misc{taglas,
title={TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models},
author={Jiarui Feng and Hao Liu and Lecheng Kong and Yixin Chen and Muhan Zhang},
year={2024},
eprint={2406.14683},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2406.14683},
}