Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphormer multi label classification label input format #23697

Closed
4 tasks
techthiyanes opened this issue May 23, 2023 · 8 comments · Fixed by #23862
Closed
4 tasks

Graphormer multi label classification label input format #23697

techthiyanes opened this issue May 23, 2023 · 8 comments · Fixed by #23862

Comments

@techthiyanes
Copy link

System Info

NA

Who can help?

@clefourrier

Kindly share the input format for multi label classification specially on the label side.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

NA

Expected behavior

NA

@clefourrier
Copy link
Member

Hi!

It's basically a list of ints.

You can see an example of a graph with multiple labels with the ogbg-molcpba dataset. There is a detailed explanation of the types needed as inputs of graph classification in the blog post on graph classification using transformers.

Can you please tell me what added information you need?

@techthiyanes
Copy link
Author

Hi!

It's basically a list of ints.

You can see an example of a graph with multiple labels with the ogbg-molcpba dataset. There is a detailed explanation of the types needed as inputs of graph classification in the blog post on graph classification using transformers.

Can you please tell me what added information you need?

While trying to train, I'm getting the error message of

TypeError: _stack_dispatcher() got an unexpected keyword argument 'dim'.

At the same time, It is working for regression/Binary classification/multi class classification usecases.

@clefourrier
Copy link
Member

Hi!
Could you provide your full stack trace please?

@techthiyanes
Copy link
Author

st of ints.

Please find below stack trace:

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 1>:1 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1664 in train │
│ │
│ 1661 │ │ inner_training_loop = find_executable_batch_size( │
│ 1662 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1663 │ │ ) │
│ ❱ 1664 │ │ return inner_training_loop( │
│ 1665 │ │ │ args=args, │
│ 1666 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1667 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1909 in _inner_training_loop │
│ │
│ 1906 │ │ │ │ rng_to_sync = True │
│ 1907 │ │ │ │
│ 1908 │ │ │ step = -1 │
│ ❱ 1909 │ │ │ for step, inputs in enumerate(epoch_iterator): │
│ 1910 │ │ │ │ total_batched_samples += 1 │
│ 1911 │ │ │ │ if rng_to_sync: │
│ 1912 │ │ │ │ │ self._load_rng_state(resume_from_checkpoint) │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:633 in next
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:677 in _next_data │
│ │
│ 674 │ │
│ 675 │ def _next_data(self): │
│ 676 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 677 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │
│ 678 │ │ if self._pin_memory: │
│ 679 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │
│ 680 │ │ return data │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py:54 in fetch │
│ │
│ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_index] │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ ❱ 54 │ │ return self.collate_fn(data) │
│ 55 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/graphormer/collating_graphormer.py:1 │
│ 32 in call
│ │
│ 129 │ │ │ else: # binary classification │
│ 130 │ │ │ │ batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in │
│ 131 │ │ else: # multi task classification, left to float to keep the NaNs │
│ ❱ 132 │ │ │ batch["labels"] = torch.from_numpy(np.stack([i["labels"] for i in features], │
│ 133 │ │ │
│ 134 │ │ return batch │
│ 135 │
│ in stack:179 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: _stack_dispatcher() got an unexpected keyword argument 'dim'

@clefourrier
Copy link
Member

Hi @techthiyanes ,
Could you please provide the command you launched or a code snippet so I can make sure I'm working on the same thing as you?

@techthiyanes
Copy link
Author

Hi @clefourrier ,

Thank you for your time and response.
Please find below code snippet that i have tried where num_classes are not passed inside arguments as it's multi label classification.

-- coding: utf-8 --

"""Untitled334.ipynb

Automatically generated by Colaboratory.

Original file is located at
https://colab.research.google.com/drive/1Xnz4vI75fkIdQVT6wKKiuDipoQzO4uZ1
"""

!pip install -q -U datasets transformers Cython accelerate

!pip install -q -U matplotlib networkx

from transformers.utils import is_cython_available
print("Cython is installed:", is_cython_available())

from datasets import load_dataset
dataset = load_dataset("OGB/ogbg-molpcba")
dataset['train'] = dataset['train'].select(list(range(1000)))
dataset['test'] = dataset['test'].select(list(range(100)))
dataset['validation'] = dataset['validation'].select(list(range(100)))
from datasets import load_metric
metric = load_metric("accuracy")
import networkx as nx
import matplotlib.pyplot as plt

We want to plot the first train graph

graph = dataset["train"][0]
edges = graph["edge_index"]
num_edges = len(edges[0])
num_nodes = graph["num_nodes"]

Conversion to networkx format

G = nx.Graph()
G.add_nodes_from(range(num_nodes))
G.add_edges_from([(edges[0][i], edges[1][i]) for i in range(num_edges)])

Plot

nx.draw(G)

dataset

from transformers.models.graphormer.collating_graphormer import preprocess_item, GraphormerDataCollator
dataset_processed = dataset.map(preprocess_item, batched=False)

split up training into training + validation

train_ds = dataset_processed['train']
val_ds = dataset_processed['validation']

from transformers import GraphormerForGraphClassification

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
model_checkpoint,
# num_classes=2, Commenting due to multi label
ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
"graph-classification",
logging_dir="graph-classification",
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
auto_find_batch_size=True, # batch size can be changed automatically to prevent OOMs
gradient_accumulation_steps=10,
dataloader_num_workers=4,
num_train_epochs=20,
evaluation_strategy="epoch",
logging_strategy="epoch",
# push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
data_collator=GraphormerDataCollator()
)

trainer.train()

!pip install -q -U datasets transformers Cython accelerate

!pip install -q -U matplotlib networkx

from transformers.utils import is_cython_available
print("Cython is installed:", is_cython_available())

from datasets import load_dataset
dataset = load_dataset("OGB/ogbg-molpcba")
dataset['train'] = dataset['train'].select(list(range(1000)))
dataset['test'] = dataset['test'].select(list(range(100)))
dataset['validation'] = dataset['validation'].select(list(range(100)))
from datasets import load_metric
metric = load_metric("accuracy")
import networkx as nx
import matplotlib.pyplot as plt

We want to plot the first train graph

graph = dataset["train"][0]
edges = graph["edge_index"]
num_edges = len(edges[0])
num_nodes = graph["num_nodes"]

Conversion to networkx format

G = nx.Graph()
G.add_nodes_from(range(num_nodes))
G.add_edges_from([(edges[0][i], edges[1][i]) for i in range(num_edges)])

Plot

nx.draw(G)

dataset

from transformers.models.graphormer.collating_graphormer import preprocess_item, GraphormerDataCollator
dataset_processed = dataset.map(preprocess_item, batched=False)

split up training into training + validation

train_ds = dataset_processed['train']
val_ds = dataset_processed['validation']

from transformers import GraphormerForGraphClassification

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
model_checkpoint,
# num_classes=2, Commenting due to multi label
ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
"graph-classification",
logging_dir="graph-classification",
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
auto_find_batch_size=True, # batch size can be changed automatically to prevent OOMs
gradient_accumulation_steps=10,
dataloader_num_workers=4,
num_train_epochs=20,
evaluation_strategy="epoch",
logging_strategy="epoch",
# push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
data_collator=GraphormerDataCollator()
)

trainer.train()

Thanks
Thiya

@clefourrier
Copy link
Member

Ok, thank you very much for reporting!

I can reproduce your issue, I'll fix it asap

@clefourrier
Copy link
Member

clefourrier commented May 30, 2023

I fixed this problem in the PR above (now we need to wait for the fix to be merged, which will not be instantaneous). Thank you very much for reporting! 🤗

Note that for multi-label classification, you will also need to provide the correct number of labels (in this case 128) to num_classes, like so:

model_checkpoint = "clefourrier/graphormer-base-pcqm4mv2" # pre-trained model from which to fine-tune

model = GraphormerForGraphClassification.from_pretrained(
    model_checkpoint,
    num_classes=128, # HERE
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants