Data collator class type integrity is not intact throughout the runtime #34830

kmehant · 2024-11-20T15:32:47Z

System Info

transformers: v4.46.3
python: 3.11

Who can help?

trainer: @muellerzr @SunMarc

Anyone else in the community is as well open to comment or suggest. Thank you.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The original data type / class of the data collator meta information is completely lost after the below step which happens at couple of places as listed below.

transformers/src/transformers/trainer.py

Line 975 in f297af5

    
           data_collator = self._get_collator_with_removed_columns(data_collator, description="training")

transformers/src/transformers/trainer.py

Line 1071 in f297af5

    
           data_collator = self._get_collator_with_removed_columns(data_collator, description="evaluation")

transformers/src/transformers/trainer.py

Line 1113 in f297af5

    
           data_collator = self._get_collator_with_removed_columns(data_collator, description="test")

The meta information loss happens since the collator is completely replaced with RemoveColumnsCollator wrapper collator class when the original dataset is NOT of type datasets.Dataset, mostly meant to support datasets.IterableDataset and others.

This raises issues for complex usecases where we wish to inject custom behaviour as part of the Trainer object when the collator is of certain class type and etc. Since the current code completely removes that piece of information and changes it to RemoveColumnsCollator, its really hard to know what was the original datacollator class. There are workarounds like writing case specific code handling RemoveColumnsCollator with special care, however, given the growing transformers code, things could change in the future and break such case specific code. On the other hand, it would be great to actually handle this situation better by preserving the original collator class information.

I propose the following to options

Dynamically modify the RemoveColumnsCollator class to subclass from the original data collator class that was passed.

This is bit of a fancy way of doing by creating a custom class/type using Python's type() API.

OR

Monkey patch the data collator object's caller functions (like __call__ etc) to include the remove columns logic on top of it. This would mean to remove RemoveColumnsCollator completely and do a simple monkey patch.

Example of monkey patch being already adopted in existing HF code making it a good option for this fix following the existing code style

https://github.com/huggingface/accelerate/blob/d7b1b368e9f484a18636a71600566b757d5cf87e/src/accelerate/utils/operations.py#L819

I am happy to discuss and raise a PR to fix this behaviour.

Expected behavior

The class type information of the original data collator has to be intact and preserved throughout the runtime.

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-11-21T17:21:45Z

Thanks for the report. Just a quick question, you should still have access to the original data_collator in the data_collator field of RemoveColumnsCollator. Other than that, I don't mind doing both options you suggested, maybe the monkeypatch will better. cc @muellerzr

kmehant added the bug label Nov 20, 2024

kmehant changed the title ~~Data collator class type integrity is not intact until the start/end of the training~~ Data collator class type integrity is not intact throughout the runtime Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data collator class type integrity is not intact throughout the runtime #34830

Data collator class type integrity is not intact throughout the runtime #34830

kmehant commented Nov 20, 2024 •

edited

Loading

SunMarc commented Nov 21, 2024

Data collator class type integrity is not intact throughout the runtime #34830

Data collator class type integrity is not intact throughout the runtime #34830

Comments

kmehant commented Nov 20, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Nov 21, 2024

kmehant commented Nov 20, 2024 •

edited

Loading