-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New run_clm script #8105
New run_clm script #8105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Just a few comments and suggestion.
from transformers.trainer_utils import is_main_process | ||
|
||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use transformers
's library logging
? (cc @LysandreJik)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the script, we should use the regular one. @LysandreJik had a very long explanation of why that I don't remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gist of it is that imo the transformer
logging
utility should only be used to control the logging of the transformers
module, not of the users' scripts directly, as it is not made for that and would lead to very weird behavior.
In my opinion the control of logging in a user script should contain both:
import logging
from transformers import logging as hf_logging
hf_logging.set_verbosity_xxx()
logger = logging.getLogger(__name__)
# then do stuff with the logger without worrying about the HF logging which has already been managed before
logger.warn("xxx")
def tokenize_function(examples): | ||
return tokenizer(examples[text_column_name]) | ||
|
||
tokenized_datasets = datasets.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the two calls to map
(here and below) it could be nice to add a reference to multi-processing with num_proc
(and maybe a link to the doc: https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do the same thing to the run_glue
script too, in passing.
Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | ||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the copyright to Google AI and NVIDIA? Are there some snippets taken from their codebases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a bad copy paste.
""" | ||
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL) on a text file or a dataset. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From experience, users will understand that only GPT, GPT-2 and CTRL are supported by that script. I would put (GPT, GPT-2, CTRL, ...)
instead, and provide a link:
""" | |
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL) on a text file or a dataset. | |
""" | |
""" | |
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset. | |
Find the full list of model architectures that can be fine-tuned by this script on the documentation: | |
https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModelWithLMHead | |
""" |
But that might be a bit too much. Maybe adding a README would be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not AutomodelWithLMHead, just CausalLM, but I can add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the link is overkill, I just had an issue with (GPT, GPT-2, CTRL)
which seems to imply that only those three models are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or this link? https://huggingface.co/models?filter=lm-head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works too but this shows checkpoints, whereas this script can also train from scratch so showing architectures would probably be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this links shows all kinds of LM. The script will only work with a model that can be loaded with AutoModelForCausalLM
(since it uses that class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the other one is the deprecated one, will remove soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
from transformers.trainer_utils import is_main_process | ||
|
||
|
||
logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gist of it is that imo the transformer
logging
utility should only be used to control the logging of the transformers
module, not of the users' scripts directly, as it is not made for that and would lead to very weird behavior.
In my opinion the control of logging in a user script should contain both:
import logging
from transformers import logging as hf_logging
hf_logging.set_verbosity_xxx()
logger = logging.getLogger(__name__)
# then do stuff with the logger without worrying about the HF logging which has already been managed before
logger.warn("xxx")
logger.warning( | ||
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" | ||
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" | ||
) | ||
# Set the verbosity to info of the Transformers logger (on main process only): | ||
if is_main_process(training_args.local_rank): | ||
transformers.utils.logging.set_verbosity_info() | ||
logger.info("Training/evaluation parameters %s", training_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I'm talking about :)
if data_args.block_size <= 0: | ||
block_size = tokenizer.max_len | ||
else: | ||
block_size = min(data_args.block_size, tokenizer.max_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we print a warning here to tell the user their block_size
isn't going to be used if it's larger than the tokenizer's max length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* New run_clm script * Formatting * More comments * Remove unused imports * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Address review comments * Change link to the hub Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This reverts commit 3bd3960.
What does this PR do?
This PR adds an example of a causal language modeling fine-tuning (or training from scratch) using the 🤗 Datasets library. It supports loading a dataset via its name (from the hub) or local files. A test of training on a small text is added.