Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace as_target context managers by direct calls #18325

Merged
merged 11 commits into from
Jul 29, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jul 27, 2022

What does this PR do?

This PR deprecates the context managers as_target_tokenizer and as_target_processor to the profit of passing more arguments to the __call__ method (or the pad method for certain processors).

Let's look at one example for a tokenizer in a seq2seq task. The current workflow is:

# Tokeniz inputs
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
# Tokenize labels inside the context manager
with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=128, truncation=True)
# Put everything together
model_inputs["labels"] = labels["input_ids"]

After this PR, this simply becomes:

model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

which is more natural and way easier.

It gets tricky if:

  1. you want to tokenizer the targets with different keyword arguments
  2. you want to add more than the input IDs for the targets to your model inputs.
    In this case you still need to do two calls:
# Tokenize inputs
model_inputs = tokenizer(inputs, max_length=128, truncation=True)
# Tokenize labels
labels = tokenizer(text_target=targets, max_length=64, truncation=True)
# Put everything together
model_inputs["labels"] = labels["input_ids"]
model_inputs["labels_mask"] = labels["attention_mask"]

Like before, if you forget to indicate to the tokenizer you are tokenizing labels (here by passing them as text_target=... (and before by tokenizing under the context manager), the labels will be tokenized like the inputs.

For processors, the same changes are done, except you can directly use modality names:

input_values = processor(ds[0]["audio"]["array"], return_tensors="pt")
with processor.as_target_processor():
    labels = processor(ds[0]["text"], return_tensors="pt")
input_values["labels"] = labels["input_ids"]

can now simply be:

input_values = processor(audio=ds[0]["audio"]["array"], text=processor(ds[0]["text"], return_tensors="pt")

Like before, you can also do it in two individual calls (with audio and text) to get the objects if you need to use different values of keyword arguments, or want to do a more complex merge than just taking the label input IDs.

Padding is also treated: previous code required to do something like this:

batch = self.processor.pad(input_features, padding=padding, return_tensors="pt")
with self.processor.as_target_processor():
    labels_batch = self.processor.pad(label_features, padding=self.padding, return_tensors="pt")
batch["labels"] = labels_batch["input_ids"]

This can now be done with:

batch = self.processor.pad(input_features, labels=label_features, padding=padding, return_tensors="pt")

or in two calls like before if something more involved (different keyword arguments for labels or accessing more than the labels input IDs) is needed.

This comes at no breaking change.

Current version does not touch any of the documentation, examples and tests (to double-check there is no breaking change), those will need to be adapted. This can be done in this PR or in followups if you prefer to read lighter diffs.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 27, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the audio processors and it looks very nice to me! Thanks for doing the refactor here. Like the simple naming of "audio" and "text"

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts, this looks good!

Comment on lines -149 to +150
... batch = self.processor.pad(
... input_features,
... padding=self.padding,
... return_tensors="pt",
... )
... with self.processor.as_target_processor():
... labels_batch = self.processor.pad(
... label_features,
... padding=self.padding,
... return_tensors="pt",
... )
... batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

... labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clean API :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for doing this refactor - the result looks 🤩

Only one comment where I think the call to the tokenizer might want to be different.

@sgugger sgugger merged commit 986526a into main Jul 29, 2022
@sgugger sgugger deleted the as_target_context_managers branch July 29, 2022 12:09
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
* Preliminary work on tokenizers

* Quality + fix tests

* Treat processors

* Fix pad

* Remove all uses of  in tests, docs and examples

* Replace all as_target_tokenizer

* Fix tests

* Fix quality

* Update examples/flax/image-captioning/run_image_captioning_flax.py

Co-authored-by: amyeroberts <amy@huggingface.co>

* Style

Co-authored-by: amyeroberts <amy@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants