Replace `as_target` context managers by direct calls #18325

sgugger · 2022-07-27T19:47:10Z

What does this PR do?

This PR deprecates the context managers as_target_tokenizer and as_target_processor to the profit of passing more arguments to the __call__ method (or the pad method for certain processors).

Let's look at one example for a tokenizer in a seq2seq task. The current workflow is:

# Tokeniz inputs
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
# Tokenize labels inside the context manager
with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=128, truncation=True)
# Put everything together
model_inputs["labels"] = labels["input_ids"]

After this PR, this simply becomes:

model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

which is more natural and way easier.

It gets tricky if:

you want to tokenizer the targets with different keyword arguments
you want to add more than the input IDs for the targets to your model inputs.
In this case you still need to do two calls:

# Tokenize inputs
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
# Tokenize labels
labels = tokenizer(text_target=targets, max_length=64, truncation=True)
# Put everything together
model_inputs["labels"] = labels["input_ids"]
model_inputs["labels_mask"] = labels["attention_mask"]

Like before, if you forget to indicate to the tokenizer you are tokenizing labels (here by passing them as text_target=... (and before by tokenizing under the context manager), the labels will be tokenized like the inputs.

For processors, the same changes are done, except you can directly use modality names:

input_values = processor(ds[0]["audio"]["array"], return_tensors="pt")
with processor.as_target_processor():
    labels = processor(ds[0]["text"], return_tensors="pt")
input_values["labels"] = labels["input_ids"]

can now simply be:

input_values = processor(audio=ds[0]["audio"]["array"], text=processor(ds[0]["text"], return_tensors="pt")

Like before, you can also do it in two individual calls (with audio and text) to get the objects if you need to use different values of keyword arguments, or want to do a more complex merge than just taking the label input IDs.

Padding is also treated: previous code required to do something like this:

batch = self.processor.pad(input_features, padding=padding, return_tensors="pt")
with self.processor.as_target_processor():
    labels_batch = self.processor.pad(label_features, padding=self.padding, return_tensors="pt")
batch["labels"] = labels_batch["input_ids"]

This can now be done with:

batch = self.processor.pad(input_features, labels=label_features, padding=padding, return_tensors="pt")

or in two calls like before if something more involved (different keyword arguments for labels or accessing more than the labels input IDs) is needed.

This comes at no breaking change.

Current version does not touch any of the documentation, examples and tests (to double-check there is no breaking change), those will need to be adapted. This can be done in this PR or in followups if you prefer to read lighter diffs.

HuggingFaceDocBuilderDev · 2022-07-27T19:57:01Z

The documentation is not available anymore as the PR was closed or merged.

amyeroberts

All looks good to me!

patrickvonplaten

Checked the audio processors and it looks very nice to me! Thanks for doing the refactor here. Like the simple naming of "audio" and "text"

LysandreJik

Thanks for your efforts, this looks good!

LysandreJik · 2022-07-28T15:28:38Z

docs/source/en/tasks/asr.mdx

-...         batch = self.processor.pad(
-...             input_features,
-...             padding=self.padding,
-...             return_tensors="pt",
-...         )
-...         with self.processor.as_target_processor():
-...             labels_batch = self.processor.pad(
-...                 label_features,
-...                 padding=self.padding,
-...                 return_tensors="pt",
-...             )
+...         batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
+
+...         labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")


That's a clean API :)

amyeroberts

LGTM! Thanks for doing this refactor - the result looks 🤩

Only one comment where I think the call to the tokenizer might want to be different.

examples/flax/image-captioning/run_image_captioning_flax.py

Co-authored-by: amyeroberts <amy@huggingface.co>

* Preliminary work on tokenizers * Quality + fix tests * Treat processors * Fix pad * Remove all uses of in tests, docs and examples * Replace all as_target_tokenizer * Fix tests * Fix quality * Update examples/flax/image-captioning/run_image_captioning_flax.py Co-authored-by: amyeroberts <amy@huggingface.co> * Style Co-authored-by: amyeroberts <amy@huggingface.co>

sgugger added 4 commits July 27, 2022 12:35

Preliminary work on tokenizers

8a48df7

Quality + fix tests

f529476

Treat processors

8b2af72

Fix pad

5bfa57b

sgugger requested review from amyeroberts, patrickvonplaten and LysandreJik July 27, 2022 19:47

amyeroberts approved these changes Jul 28, 2022

View reviewed changes

patrickvonplaten approved these changes Jul 28, 2022

View reviewed changes

Remove all uses of in tests, docs and examples

7ea9d1d

LysandreJik approved these changes Jul 28, 2022

View reviewed changes

sgugger added 3 commits July 28, 2022 12:33

Replace all as_target_tokenizer

024483c

Fix tests

b2a8e18

Fix quality

1cabae4

amyeroberts approved these changes Jul 29, 2022

View reviewed changes

examples/flax/image-captioning/run_image_captioning_flax.py Outdated Show resolved Hide resolved

sgugger and others added 3 commits July 29, 2022 07:30

Update examples/flax/image-captioning/run_image_captioning_flax.py

1325125

Co-authored-by: amyeroberts <amy@huggingface.co>

Merge branch 'main' into as_target_context_managers

cd50ecd

Style

d939491

sgugger merged commit 986526a into main Jul 29, 2022

sgugger deleted the as_target_context_managers branch July 29, 2022 12:09

LysandreJik mentioned this pull request Aug 1, 2022

Summarisation example fails to run on given example. Missing positional argument TypeError #18381

Closed

4 tasks

ydshieh mentioned this pull request Aug 2, 2022

Change audio kwarg to images in TROCR processor #18421

Merged

NielsRogge mentioned this pull request Oct 17, 2022

'WhisperProcessor' object has no attribute 'as_target_processor' #19672

Closed

4 tasks

mart-r mentioned this pull request Feb 8, 2024

CU-8693t24ed: Add workaround for older DeID models in newer MedCAT CogStack/MedCAT#397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `as_target` context managers by direct calls #18325

Replace `as_target` context managers by direct calls #18325

sgugger commented Jul 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 27, 2022 •

edited

Loading

amyeroberts left a comment

patrickvonplaten left a comment •

edited

Loading

LysandreJik left a comment

LysandreJik Jul 28, 2022

amyeroberts left a comment

Replace as_target context managers by direct calls #18325

Replace as_target context managers by direct calls #18325

Conversation

sgugger commented Jul 27, 2022 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 27, 2022 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment • edited Loading

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jul 28, 2022

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Replace `as_target` context managers by direct calls #18325

Replace `as_target` context managers by direct calls #18325

sgugger commented Jul 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 27, 2022 •

edited

Loading

patrickvonplaten left a comment •

edited

Loading