Graphormer multi label classification label input format #23697

techthiyanes · 2023-05-23T15:56:25Z

System Info

NA

Who can help?

@clefourrier

Kindly share the input format for multi label classification specially on the label side.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

NA

Expected behavior

NA

The text was updated successfully, but these errors were encountered:

clefourrier · 2023-05-25T09:14:04Z

Hi!

It's basically a list of ints.

You can see an example of a graph with multiple labels with the ogbg-molcpba dataset. There is a detailed explanation of the types needed as inputs of graph classification in the blog post on graph classification using transformers.

Can you please tell me what added information you need?

techthiyanes · 2023-05-25T11:31:27Z

Hi!

It's basically a list of ints.

You can see an example of a graph with multiple labels with the ogbg-molcpba dataset. There is a detailed explanation of the types needed as inputs of graph classification in the blog post on graph classification using transformers.

Can you please tell me what added information you need?

While trying to train, I'm getting the error message of

TypeError: _stack_dispatcher() got an unexpected keyword argument 'dim'.

At the same time, It is working for regression/Binary classification/multi class classification usecases.

clefourrier · 2023-05-25T11:33:59Z

Hi!
Could you provide your full stack trace please?

techthiyanes · 2023-05-25T12:16:14Z

st of ints.

Please find below stack trace:

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 1>:1 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1664 in train │
│ │
│ 1661 │ │ inner_training_loop = find_executable_batch_size( │
│ 1662 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1663 │ │ ) │
│ ❱ 1664 │ │ return inner_training_loop( │
│ 1665 │ │ │ args=args, │
│ 1666 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1667 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1909 in _inner_training_loop │
│ │
│ 1906 │ │ │ │ rng_to_sync = True │
│ 1907 │ │ │ │
│ 1908 │ │ │ step = -1 │
│ ❱ 1909 │ │ │ for step, inputs in enumerate(epoch_iterator): │
│ 1910 │ │ │ │ total_batched_samples += 1 │
│ 1911 │ │ │ │ if rng_to_sync: │
│ 1912 │ │ │ │ │ self._load_rng_state(resume_from_checkpoint) │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:633 in next │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:677 in _next_data │
│ │
│ 674 │ │
│ 675 │ def _next_data(self): │
│ 676 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 677 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │
│ 678 │ │ if self._pin_memory: │
│ 679 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │
│ 680 │ │ return data │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py:54 in fetch │
│ │
│ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_index] │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ ❱ 54 │ │ return self.collate_fn(data) │
│ 55 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/graphormer/collating_graphormer.py:1 │
│ 32 in call │
│ │
│ 129 │ │ │ else: # binary classification │
│ 130 │ │ │ │ batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in │
│ 131 │ │ else: # multi task classification, left to float to keep the NaNs │
│ ❱ 132 │ │ │ batch["labels"] = torch.from_numpy(np.stack([i["labels"] for i in features], │
│ 133 │ │ │
│ 134 │ │ return batch │
│ 135 │
│ in stack:179 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: _stack_dispatcher() got an unexpected keyword argument 'dim'

clefourrier · 2023-05-30T09:54:33Z

Hi @techthiyanes ,
Could you please provide the command you launched or a code snippet so I can make sure I'm working on the same thing as you?

techthiyanes · 2023-05-30T12:44:51Z

Hi @clefourrier ,

Thank you for your time and response.
Please find below code snippet that i have tried where num_classes are not passed inside arguments as it's multi label classification.

-- coding: utf-8 --

"""Untitled334.ipynb

Automatically generated by Colaboratory.

Original file is located at
https://colab.research.google.com/drive/1Xnz4vI75fkIdQVT6wKKiuDipoQzO4uZ1
"""

!pip install -q -U datasets transformers Cython accelerate

!pip install -q -U matplotlib networkx

from transformers.utils import is_cython_available
print("Cython is installed:", is_cython_available())

from datasets import load_dataset
dataset = load_dataset("OGB/ogbg-molpcba")
dataset['train'] = dataset['train'].select(list(range(1000)))
dataset['test'] = dataset['test'].select(list(range(100)))
dataset['validation'] = dataset['validation'].select(list(range(100)))
from datasets import load_metric
metric = load_metric("accuracy")
import networkx as nx
import matplotlib.pyplot as plt

We want to plot the first train graph

graph = dataset["train"][0]
edges = graph["edge_index"]
num_edges = len(edges[0])
num_nodes = graph["num_nodes"]

Conversion to networkx format

G = nx.Graph()
G.add_nodes_from(range(num_nodes))
G.add_edges_from([(edges[0][i], edges[1][i]) for i in range(num_edges)])

Plot

nx.draw(G)

dataset

from transformers.models.graphormer.collating_graphormer import preprocess_item, GraphormerDataCollator
dataset_processed = dataset.map(preprocess_item, batched=False)

split up training into training + validation

train_ds = dataset_processed['train']
val_ds = dataset_processed['validation']

from transformers import GraphormerForGraphClassification

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
model_checkpoint,
# num_classes=2, Commenting due to multi label
ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
"graph-classification",
logging_dir="graph-classification",
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
auto_find_batch_size=True, # batch size can be changed automatically to prevent OOMs
gradient_accumulation_steps=10,
dataloader_num_workers=4,
num_train_epochs=20,
evaluation_strategy="epoch",
logging_strategy="epoch",
# push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
data_collator=GraphormerDataCollator()
)

trainer.train()

!pip install -q -U datasets transformers Cython accelerate

!pip install -q -U matplotlib networkx

from transformers.utils import is_cython_available
print("Cython is installed:", is_cython_available())

from datasets import load_dataset
dataset = load_dataset("OGB/ogbg-molpcba")
dataset['train'] = dataset['train'].select(list(range(1000)))
dataset['test'] = dataset['test'].select(list(range(100)))
dataset['validation'] = dataset['validation'].select(list(range(100)))
from datasets import load_metric
metric = load_metric("accuracy")
import networkx as nx
import matplotlib.pyplot as plt

We want to plot the first train graph

graph = dataset["train"][0]
edges = graph["edge_index"]
num_edges = len(edges[0])
num_nodes = graph["num_nodes"]

Conversion to networkx format

G = nx.Graph()
G.add_nodes_from(range(num_nodes))
G.add_edges_from([(edges[0][i], edges[1][i]) for i in range(num_edges)])

Plot

nx.draw(G)

dataset

from transformers.models.graphormer.collating_graphormer import preprocess_item, GraphormerDataCollator
dataset_processed = dataset.map(preprocess_item, batched=False)

split up training into training + validation

train_ds = dataset_processed['train']
val_ds = dataset_processed['validation']

from transformers import GraphormerForGraphClassification

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
model_checkpoint,
# num_classes=2, Commenting due to multi label
ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
"graph-classification",
logging_dir="graph-classification",
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
auto_find_batch_size=True, # batch size can be changed automatically to prevent OOMs
gradient_accumulation_steps=10,
dataloader_num_workers=4,
num_train_epochs=20,
evaluation_strategy="epoch",
logging_strategy="epoch",
# push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
data_collator=GraphormerDataCollator()
)

trainer.train()

Thanks
Thiya

clefourrier · 2023-05-30T13:35:35Z

Ok, thank you very much for reporting!

I can reproduce your issue, I'll fix it asap

clefourrier · 2023-05-30T14:01:23Z

I fixed this problem in the PR above (now we need to wait for the fix to be merged, which will not be instantaneous). Thank you very much for reporting! 🤗

Note that for multi-label classification, you will also need to provide the correct number of labels (in this case 128) to num_classes, like so:

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
    model_checkpoint,
    num_classes=128, # HERE
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

clefourrier mentioned this issue May 30, 2023

Update collating_graphormer.py #23862

Merged

sgugger closed this as completed in #23862 May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphormer multi label classification label input format #23697

Graphormer multi label classification label input format #23697

techthiyanes commented May 23, 2023

clefourrier commented May 25, 2023

techthiyanes commented May 25, 2023

clefourrier commented May 25, 2023

techthiyanes commented May 25, 2023

clefourrier commented May 30, 2023

techthiyanes commented May 30, 2023

clefourrier commented May 30, 2023

clefourrier commented May 30, 2023 •

edited

Loading

Graphormer multi label classification label input format #23697

Graphormer multi label classification label input format #23697

Comments

techthiyanes commented May 23, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

clefourrier commented May 25, 2023

techthiyanes commented May 25, 2023

clefourrier commented May 25, 2023

techthiyanes commented May 25, 2023

clefourrier commented May 30, 2023

techthiyanes commented May 30, 2023

-- coding: utf-8 --

We want to plot the first train graph

Conversion to networkx format

Plot

split up training into training + validation

We want to plot the first train graph

Conversion to networkx format

Plot

split up training into training + validation

clefourrier commented May 30, 2023

clefourrier commented May 30, 2023 • edited Loading

clefourrier commented May 30, 2023 •

edited

Loading