Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer automatically drops unused columns in nlp datasets #6449

Merged
merged 11 commits into from
Aug 20, 2020
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Aug 12, 2020

Here is a basic example of use for evaluation on SST-2:

from nlp import load_dataset, load_metric
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
dataset = load_dataset('glue', 'sst2')
metric = load_metric('glue', 'sst2')

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

encoded_dataset = dataset.map(lambda examples: tokenizer(examples['sentence'], padding=True), batched=True)
args = TrainingArguments(output_dir = "test")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions.argmax(axis=-1), labels)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    compute_metrics=compute_metrics,
)
trainer.evaluate()

The goal is to then refine this new API by trying to use it in all examples.

@@ -151,8 +162,6 @@ class Trainer:
compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
The function that will be used to compute metrics at evaluation. Must take a
:class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove this in #6426

Comment on lines 60 to 69
class FinalActivation(ExplicitEnum):
"""
Possible values for the ``final_activation`` argument in :meth:`Trainer.from_nlp_dataset`.
Useful for tab-completion in an IDE.
"""

NONE = "none"
ARGMAX = "argmax"
SIGMOID = "sigmoid"
SOFTMAX = "softmax"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought it was better to have some kind of enum for possible final activation functions than letting the user provide their own.

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #6449 into master will decrease coverage by 2.12%.
The diff coverage is 37.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6449      +/-   ##
==========================================
- Coverage   79.98%   77.86%   -2.13%     
==========================================
  Files         153      153              
  Lines       28005    28031      +26     
==========================================
- Hits        22401    21827     -574     
- Misses       5604     6204     +600     
Impacted Files Coverage Δ
src/transformers/__init__.py 99.27% <ø> (ø)
src/transformers/trainer.py 37.28% <27.27%> (-0.57%) ⬇️
src/transformers/file_utils.py 82.16% <80.00%> (-0.03%) ⬇️
src/transformers/optimization.py 25.55% <0.00%> (-70.00%) ⬇️
src/transformers/pipelines.py 26.26% <0.00%> (-53.69%) ⬇️
src/transformers/optimization_tf.py 33.33% <0.00%> (-24.33%) ⬇️
src/transformers/modeling_tf_auto.py 48.79% <0.00%> (-18.08%) ⬇️
src/transformers/modeling_auto.py 64.16% <0.00%> (-14.46%) ⬇️
src/transformers/data/processors/squad.py 13.76% <0.00%> (-14.38%) ⬇️
src/transformers/modeling_tf_gpt2.py 65.68% <0.00%> (-6.16%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc82047...69d3ec5. Read the comment docs.

model (:class:`~transformers.PreTrainedModel`):
The model to train, evaluate or use for predictions.
args (:class:`~transformers.TrainingArguments`):
The arguments to tweak training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) "The arguments to tweak for training"

if metrics is not None:
compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation)

# Inspect model forward signature to keep only the arguments it accepts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great functionality! Love it

cls,
model: PreTrainedModel,
args: TrainingArguments,
dataset: "nlp.dataset_dict.DatasetDict",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it would prefer to not have dataset as an input argument but train_dataset and eval_dataset. IMO, this has two advantages:

  1. The user is not forced to use the whole dataset (e.g. if he wants to just evaluate it would be much less costly to just use the eval set for RAM)
  2. The class method signature stays the same then the init signature

also putting @lhoestq in cc here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: we can support both by having dataset_dict, train_dataset and eval_dataset as arguments (with an assert to check that not both are provided at the same type).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a big advantage to having dataset_dict as an input? It saves a couple of lines of code, but I don't really see a big argument for it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common use case of Trainer is train + eval, and the preprocessing is often the same for both training and evaluation, so IMO it's more useful to support dataset_dict.

If we go down the road of "it just saves a few lines of code", then we can just say there is no need for a class method and the user can use the regular init ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would go with the more flexible train_dataset and eval_dataset, in particular since the internal names for these in a dict might be different (in particular test can sometimes be validation).

It doesn't even save you a line of code in most cases (you just split the dict in the call) so better to have it explicit imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that from the user's perspective it can be confusing to see that the input expects a dataset dict. Asking explicitly for the train/val/test datasets splits is clearer imo

"""
assert is_nlp_available(), "This method requires the nlp library: `pip install nlp`."
if metrics is not None:
compute_metrics = ComputeNLPMetrics(metrics, activation=final_activation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I would say that the metric is responsible to choose which "final_activation" to use here, so I would move this into the metrics instead and think the user should not even have to specific it. Or is it very common that the same metric has multiple possible "final_activation" functions that can be applied?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nlp metrics don't have an activation built in. IMO a metric is just a function, it should not have an idea of what the final activation is.
It's a task model that has an idea of the activation function, though it would not know if it needs to take argmax (for metrics that uses prediction indices like accuracy) or softmax (for metrics that use the probabilities like roc_score, auc_score). In the end, I think we could have a default that works most of the time by mapping models to a given final activation, but it still needs to be available as an argument to the user for all the cases that can go wrong.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

Comment on lines 306 to 309
signature_columns += ["label", "label_ids"]
dataset_columns = dataset[list(dataset.keys())[0]].column_names
columns = [k for k in signature_columns if k in dataset_columns]
dataset.set_format(columns=columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also print a logger.warning if some values are skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an info instead of a warning since it's what the API is supposed to do and users are scared of warnings. Also properly documented the API is doing that.

@sshleifer
Copy link
Contributor

This is sweet!

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall and from a user-API point of view I'm actually wondering if we really need a new classmethod specifically designed for nlp here.

In particular, given that we've built nlp with the goal of being very explicit about everything and having a dataset that behaves like a normal python container.

As I see it there are a few elements added by this method:

  • filtering the inputs of the model but this is already done by default in the tokenizers themselves (to avoid adding unnecessary inputs) and maybe could be just improved by testing the format of the dataset if the dataset is an instance of nlp.Dataset in the init method and setting the format if some columns cannot be handled by the model. Be aware that the format is a stateful property of the dataset so this actually has side effects (maybe we could change this).
  • handling metrics but maybe this could be obtained by working on the nlp side to add support for model outputs and on this side by allowing an nlp.Metric to be passed as compute_metrics in the normal init method.

Let's have a quick discussion about this maybe.

Comment on lines 252 to 253
train_dataset: "nlp.dataset_dict.Dataset" = None,
eval_dataset: "nlp.dataset_dict.Dataset" = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably more nlp.Dataset

eval_dataset: "nlp.dataset_dict.Dataset" = None,
data_collator: Optional[DataCollator] = None,
metrics: Optional[Union["nlp.Metric", List["nlp.Metric"]]] = None,
final_activation: Optional[Union[str, FinalActivation, Callable]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would call this metrics_preprocessing and accept either a callable or an activation string/enum

@sgugger sgugger changed the title Add a classmethod to easily build a Trainer from nlp dataset and metric Trainer automatically drops unused columns in nlp datasets Aug 19, 2020
@sgugger
Copy link
Collaborator Author

sgugger commented Aug 19, 2020

Removed all the changes linked to metrics and moved the column dropping to anywhere we pass a Dataset (init, evaluate and predict). As discussed, we'll propose an API for the metrics once we have changed all examples to use Trainer and nlp, so we know exactly what the API has to support.

@sgugger sgugger merged commit e5f4522 into master Aug 20, 2020
@sgugger sgugger deleted the nlp_trainer branch August 20, 2020 20:29
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
…ce#6449)

* Add a classmethod to easily build a Trainer from nlp dataset and metric

* Fix docstrings

* Split train/eval

* Formatting

* Log dropped columns + docs

* Authorize callable activations

* Poc for auto activation

* Be framework-agnostic

* Formatting

* Remove class method

* Remove unnecessary code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants