Trainer automatically drops unused columns in nlp datasets #6449

sgugger · 2020-08-12T19:42:58Z

Here is a basic example of use for evaluation on SST-2:

from nlp import load_dataset, load_metric
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
dataset = load_dataset('glue', 'sst2')
metric = load_metric('glue', 'sst2')

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence'], padding=True), batched=True)
args = TrainingArguments(output_dir = "test")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions.argmax(axis=-1), labels)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    compute_metrics=compute_metrics,
)
trainer.evaluate()

The goal is to then refine this new API by trying to use it in all examples.

sgugger · 2020-08-12T19:49:33Z

src/transformers/trainer.py

@@ -151,8 +162,6 @@ class Trainer:
        compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
            The function that will be used to compute metrics at evaluation. Must take a
            :class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
-        prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):


Forgot to remove this in #6426

sgugger · 2020-08-12T19:51:00Z

src/transformers/trainer_utils.py

+class FinalActivation(ExplicitEnum):
+    """
+    Possible values for the ``final_activation`` argument in :meth:`Trainer.from_nlp_dataset`.
+    Useful for tab-completion in an IDE.
+    """
+
+    NONE = "none"
+    ARGMAX = "argmax"
+    SIGMOID = "sigmoid"
+    SOFTMAX = "softmax"


Thought it was better to have some kind of enum for possible final activation functions than letting the user provide their own.

codecov · 2020-08-12T19:52:20Z

Codecov Report

Merging #6449 into master will decrease coverage by 2.12%.
The diff coverage is 37.03%.

@@            Coverage Diff             @@
##           master    #6449      +/-   ##
==========================================
- Coverage   79.98%   77.86%   -2.13%     
==========================================
  Files         153      153              
  Lines       28005    28031      +26     
==========================================
- Hits        22401    21827     -574     
- Misses       5604     6204     +600

Impacted Files	Coverage Δ
src/transformers/__init__.py	`99.27% <ø> (ø)`
src/transformers/trainer.py	`37.28% <27.27%> (-0.57%)`	⬇️
src/transformers/file_utils.py	`82.16% <80.00%> (-0.03%)`	⬇️
src/transformers/optimization.py	`25.55% <0.00%> (-70.00%)`	⬇️
src/transformers/pipelines.py	`26.26% <0.00%> (-53.69%)`	⬇️
src/transformers/optimization_tf.py	`33.33% <0.00%> (-24.33%)`	⬇️
src/transformers/modeling_tf_auto.py	`48.79% <0.00%> (-18.08%)`	⬇️
src/transformers/modeling_auto.py	`64.16% <0.00%> (-14.46%)`	⬇️
src/transformers/data/processors/squad.py	`13.76% <0.00%> (-14.38%)`	⬇️
src/transformers/modeling_tf_gpt2.py	`65.68% <0.00%> (-6.16%)`	⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc82047...69d3ec5. Read the comment docs.

patrickvonplaten · 2020-08-13T12:51:09Z

src/transformers/trainer.py

+            model (:class:`~transformers.PreTrainedModel`):
+                The model to train, evaluate or use for predictions.
+            args (:class:`~transformers.TrainingArguments`):
+                The arguments to tweak training.


(nit) "The arguments to tweak for training"

patrickvonplaten · 2020-08-13T12:53:01Z

src/transformers/trainer.py

+        if metrics is not None:
+            compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation)
+
+        # Inspect model forward signature to keep only the arguments it accepts.


Great functionality! Love it

patrickvonplaten · 2020-08-13T13:03:09Z

src/transformers/trainer.py

+        cls,
+        model: PreTrainedModel,
+        args: TrainingArguments,
+        dataset: "nlp.dataset_dict.DatasetDict",


Think it would prefer to not have dataset as an input argument but train_dataset and eval_dataset. IMO, this has two advantages:

The user is not forced to use the whole dataset (e.g. if he wants to just evaluate it would be much less costly to just use the eval set for RAM)

The class method signature stays the same then the init signature

also putting @lhoestq in cc here

Alternative: we can support both by having dataset_dict, train_dataset and eval_dataset as arguments (with an assert to check that not both are provided at the same type).

Is there a big advantage to having dataset_dict as an input? It saves a couple of lines of code, but I don't really see a big argument for it

The most common use case of Trainer is train + eval, and the preprocessing is often the same for both training and evaluation, so IMO it's more useful to support dataset_dict.

If we go down the road of "it just saves a few lines of code", then we can just say there is no need for a class method and the user can use the regular init ;-)

I think I would go with the more flexible train_dataset and eval_dataset, in particular since the internal names for these in a dict might be different (in particular test can sometimes be validation).

It doesn't even save you a line of code in most cases (you just split the dict in the call) so better to have it explicit imo.

I agree that from the user's perspective it can be confusing to see that the input expects a dataset dict. Asking explicitly for the train/val/test datasets splits is clearer imo

patrickvonplaten · 2020-08-13T13:06:43Z

src/transformers/trainer.py

+        """
+        assert is_nlp_available(), "This method requires the nlp library: `pip install nlp`."
+        if metrics is not None:
+            compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation)


Intuitively I would say that the metric is responsible to choose which "final_activation" to use here, so I would move this into the metrics instead and think the user should not even have to specific it. Or is it very common that the same metric has multiple possible "final_activation" functions that can be applied?

nlp metrics don't have an activation built in. IMO a metric is just a function, it should not have an idea of what the final activation is.
It's a task model that has an idea of the activation function, though it would not know if it needs to take argmax (for metrics that uses prediction indices like accuracy) or softmax (for metrics that use the probabilities like roc_score, auc_score). In the end, I think we could have a default that works most of the time by mapping models to a given final activation, but it still needs to be available as an argument to the user for all the cases that can go wrong.

LysandreJik

Great, LGTM!

LysandreJik · 2020-08-13T14:53:49Z

src/transformers/trainer.py

+        signature_columns += ["label", "label_ids"]
+        dataset_columns = dataset[list(dataset.keys())[0]].column_names
+        columns = [k for k in signature_columns if k in dataset_columns]
+        dataset.set_format(columns=columns)


Could we also print a logger.warning if some values are skipped?

Did an info instead of a warning since it's what the API is supposed to do and users are scared of warnings. Also properly documented the API is doing that.

sshleifer · 2020-08-17T19:22:15Z

This is sweet!

thomwolf

Overall and from a user-API point of view I'm actually wondering if we really need a new classmethod specifically designed for nlp here.

In particular, given that we've built nlp with the goal of being very explicit about everything and having a dataset that behaves like a normal python container.

As I see it there are a few elements added by this method:

filtering the inputs of the model but this is already done by default in the tokenizers themselves (to avoid adding unnecessary inputs) and maybe could be just improved by testing the format of the dataset if the dataset is an instance of nlp.Dataset in the init method and setting the format if some columns cannot be handled by the model. Be aware that the format is a stateful property of the dataset so this actually has side effects (maybe we could change this).
handling metrics but maybe this could be obtained by working on the nlp side to add support for model outputs and on this side by allowing an nlp.Metric to be passed as compute_metrics in the normal init method.

Let's have a quick discussion about this maybe.

thomwolf · 2020-08-19T10:30:15Z

src/transformers/trainer.py

+        train_dataset: "nlp.dataset_dict.Dataset" = None,
+        eval_dataset: "nlp.dataset_dict.Dataset" = None,


probably more nlp.Dataset

thomwolf · 2020-08-19T10:33:16Z

src/transformers/trainer.py

+        eval_dataset: "nlp.dataset_dict.Dataset" = None,
+        data_collator: Optional[DataCollator] = None,
+        metrics: Optional[Union["nlp.Metric", List["nlp.Metric"]]] = None,
+        final_activation: Optional[Union[str, FinalActivation, Callable]] = None,


I think I would call this metrics_preprocessing and accept either a callable or an activation string/enum

sgugger · 2020-08-19T20:45:20Z

Removed all the changes linked to metrics and moved the column dropping to anywhere we pass a Dataset (init, evaluate and predict). As discussed, we'll propose an API for the metrics once we have changed all examples to use Trainer and nlp, so we know exactly what the API has to support.

…ce#6449) * Add a classmethod to easily build a Trainer from nlp dataset and metric * Fix docstrings * Split train/eval * Formatting * Log dropped columns + docs * Authorize callable activations * Poc for auto activation * Be framework-agnostic * Formatting * Remove class method * Remove unnecessary code

Add a classmethod to easily build a Trainer from nlp dataset and metric

dac67af

sgugger requested review from thomwolf, patrickvonplaten and LysandreJik August 12, 2020 19:42

Fix docstrings

73a2125

sgugger commented Aug 12, 2020

View reviewed changes

sgugger added 3 commits August 13, 2020 06:57

Split train/eval

d3d3e30

Formatting

86d4b6e

Log dropped columns + docs

b23ea4a

patrickvonplaten reviewed Aug 13, 2020

View reviewed changes

LysandreJik approved these changes Aug 13, 2020

View reviewed changes

sgugger added 4 commits August 17, 2020 10:51

Authorize callable activations

fc19924

Poc for auto activation

ee8a640

Be framework-agnostic

fe39a66

Formatting

c9e7a59

thomwolf reviewed Aug 19, 2020

View reviewed changes

Remove class method

f93c334

sgugger changed the title ~~Add a classmethod to easily build a Trainer from nlp dataset and metric~~ Trainer automatically drops unused columns in nlp datasets Aug 19, 2020

sgugger requested review from thomwolf, patrickvonplaten and LysandreJik August 19, 2020 20:45

Remove unnecessary code

69d3ec5

sgugger merged commit e5f4522 into master Aug 20, 2020

sgugger deleted the nlp_trainer branch August 20, 2020 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer automatically drops unused columns in nlp datasets #6449

Trainer automatically drops unused columns in nlp datasets #6449

sgugger commented Aug 12, 2020 •

edited

Loading

sgugger Aug 12, 2020

sgugger Aug 12, 2020

codecov bot commented Aug 12, 2020 •

edited

Loading

patrickvonplaten Aug 13, 2020

patrickvonplaten Aug 13, 2020

patrickvonplaten Aug 13, 2020

sgugger Aug 13, 2020

patrickvonplaten Aug 13, 2020

sgugger Aug 13, 2020

thomwolf Aug 19, 2020

lhoestq Aug 19, 2020

patrickvonplaten Aug 13, 2020

sgugger Aug 13, 2020

LysandreJik left a comment

LysandreJik Aug 13, 2020

sgugger Aug 13, 2020

sshleifer commented Aug 17, 2020

thomwolf left a comment •

edited

Loading

thomwolf Aug 19, 2020

thomwolf Aug 19, 2020

sgugger commented Aug 19, 2020

		train_dataset: "nlp.dataset_dict.Dataset" = None,
		eval_dataset: "nlp.dataset_dict.Dataset" = None,

Trainer automatically drops unused columns in nlp datasets #6449

Trainer automatically drops unused columns in nlp datasets #6449

Conversation

sgugger commented Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 12, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer commented Aug 17, 2020

thomwolf left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger commented Aug 19, 2020

sgugger commented Aug 12, 2020 •

edited

Loading

codecov bot commented Aug 12, 2020 •

edited

Loading

thomwolf left a comment •

edited

Loading