-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer automatically drops unused columns in nlp datasets #6449
Conversation
@@ -151,8 +162,6 @@ class Trainer: | |||
compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`): | |||
The function that will be used to compute metrics at evaluation. Must take a | |||
:class:`~transformers.EvalPrediction` and return a dictionary string to metric values. | |||
prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove this in #6426
src/transformers/trainer_utils.py
Outdated
class FinalActivation(ExplicitEnum): | ||
""" | ||
Possible values for the ``final_activation`` argument in :meth:`Trainer.from_nlp_dataset`. | ||
Useful for tab-completion in an IDE. | ||
""" | ||
|
||
NONE = "none" | ||
ARGMAX = "argmax" | ||
SIGMOID = "sigmoid" | ||
SOFTMAX = "softmax" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought it was better to have some kind of enum for possible final activation functions than letting the user provide their own.
Codecov Report
@@ Coverage Diff @@
## master #6449 +/- ##
==========================================
- Coverage 79.98% 77.86% -2.13%
==========================================
Files 153 153
Lines 28005 28031 +26
==========================================
- Hits 22401 21827 -574
- Misses 5604 6204 +600
Continue to review full report at Codecov.
|
src/transformers/trainer.py
Outdated
model (:class:`~transformers.PreTrainedModel`): | ||
The model to train, evaluate or use for predictions. | ||
args (:class:`~transformers.TrainingArguments`): | ||
The arguments to tweak training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "The arguments to tweak for training"
if metrics is not None: | ||
compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation) | ||
|
||
# Inspect model forward signature to keep only the arguments it accepts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great functionality! Love it
src/transformers/trainer.py
Outdated
cls, | ||
model: PreTrainedModel, | ||
args: TrainingArguments, | ||
dataset: "nlp.dataset_dict.DatasetDict", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it would prefer to not have dataset as an input argument but train_dataset
and eval_dataset
. IMO, this has two advantages:
- The user is not forced to use the whole dataset (e.g. if he wants to just evaluate it would be much less costly to just use the eval set for RAM)
- The class method signature stays the same then the init signature
also putting @lhoestq in cc here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative: we can support both by having dataset_dict
, train_dataset
and eval_dataset
as arguments (with an assert to check that not both are provided at the same type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a big advantage to having dataset_dict
as an input? It saves a couple of lines of code, but I don't really see a big argument for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most common use case of Trainer is train + eval, and the preprocessing is often the same for both training and evaluation, so IMO it's more useful to support dataset_dict
.
If we go down the road of "it just saves a few lines of code", then we can just say there is no need for a class method and the user can use the regular init ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would go with the more flexible train_dataset
and eval_dataset
, in particular since the internal names for these in a dict might be different (in particular test
can sometimes be validation
).
It doesn't even save you a line of code in most cases (you just split the dict in the call) so better to have it explicit imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that from the user's perspective it can be confusing to see that the input expects a dataset dict. Asking explicitly for the train/val/test datasets splits is clearer imo
src/transformers/trainer.py
Outdated
""" | ||
assert is_nlp_available(), "This method requires the nlp library: `pip install nlp`." | ||
if metrics is not None: | ||
compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I would say that the metric is responsible to choose which "final_activation" to use here, so I would move this into the metrics instead and think the user should not even have to specific it. Or is it very common that the same metric has multiple possible "final_activation" functions that can be applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nlp
metrics don't have an activation built in. IMO a metric is just a function, it should not have an idea of what the final activation is.
It's a task model that has an idea of the activation function, though it would not know if it needs to take argmax (for metrics that uses prediction indices like accuracy) or softmax (for metrics that use the probabilities like roc_score, auc_score). In the end, I think we could have a default that works most of the time by mapping models to a given final activation, but it still needs to be available as an argument to the user for all the cases that can go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
src/transformers/trainer.py
Outdated
signature_columns += ["label", "label_ids"] | ||
dataset_columns = dataset[list(dataset.keys())[0]].column_names | ||
columns = [k for k in signature_columns if k in dataset_columns] | ||
dataset.set_format(columns=columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also print a logger.warning
if some values are skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an info instead of a warning since it's what the API is supposed to do and users are scared of warnings. Also properly documented the API is doing that.
This is sweet! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall and from a user-API point of view I'm actually wondering if we really need a new classmethod
specifically designed for nlp
here.
In particular, given that we've built nlp
with the goal of being very explicit about everything and having a dataset that behaves like a normal python container.
As I see it there are a few elements added by this method:
- filtering the inputs of the model but this is already done by default in the tokenizers themselves (to avoid adding unnecessary inputs) and maybe could be just improved by testing the format of the dataset if the dataset is an instance of
nlp.Dataset
in the init method and setting the format if some columns cannot be handled by the model. Be aware that the format is a stateful property of the dataset so this actually has side effects (maybe we could change this). - handling metrics but maybe this could be obtained by working on the
nlp
side to add support for model outputs and on this side by allowing annlp.Metric
to be passed ascompute_metrics
in the normal init method.
Let's have a quick discussion about this maybe.
src/transformers/trainer.py
Outdated
train_dataset: "nlp.dataset_dict.Dataset" = None, | ||
eval_dataset: "nlp.dataset_dict.Dataset" = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably more nlp.Dataset
src/transformers/trainer.py
Outdated
eval_dataset: "nlp.dataset_dict.Dataset" = None, | ||
data_collator: Optional[DataCollator] = None, | ||
metrics: Optional[Union["nlp.Metric", List["nlp.Metric"]]] = None, | ||
final_activation: Optional[Union[str, FinalActivation, Callable]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would call this metrics_preprocessing
and accept either a callable or an activation string/enum
Removed all the changes linked to metrics and moved the column dropping to anywhere we pass a Dataset (init, evaluate and predict). As discussed, we'll propose an API for the metrics once we have changed all examples to use |
…ce#6449) * Add a classmethod to easily build a Trainer from nlp dataset and metric * Fix docstrings * Split train/eval * Formatting * Log dropped columns + docs * Authorize callable activations * Poc for auto activation * Be framework-agnostic * Formatting * Remove class method * Remove unnecessary code
Here is a basic example of use for evaluation on SST-2:
The goal is to then refine this new API by trying to use it in all examples.