-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add-support-for-examples-scripts-to-run-on-sagemaker #9367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments before others chime in and review this.
Orthogonal to this PR, but in the HuggingFace SageMaker extension, would we be able to tweak it so those two lines are simpler:
estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='../../transformers/examples/text-classification',
)
maybe just
estimator = HuggingFace(
task_name='text-classification',
)
Or do you think that would be too different from the standard SageMaker convention?
if is_run_on_sagemaker(): | ||
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model | ||
trainer.state.save_to_json(os.path.join(os.environ["SM_MODEL_DIR"], "trainer_state.json")) | ||
else: | ||
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model | ||
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json")) | ||
#### $custom end #### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if
clause should not be needed if we tweak .output_dir
above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But we have to be careful since we have in Sagemaker 2 Output directories. One for logs and files (e.g. eval_result.txt
) and the other is for the model and associated files (config etc).
SM_MODEL_DIR
represents the model
directory and SM_OUTPUT_DATA_DIR
represents the directory for logs and files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that we can probably add "upstream" in TrainingArguments if we don't already have it (the possibility of having different output dirs for model and logs)
##### $custom #### | ||
if is_run_on_sagemaker(): | ||
trainer.save_model(os.environ["SM_MODEL_DIR"]) # Saves the tokenizer too for easy upload | ||
else: | ||
trainer.save_model() # Saves the tokenizer too for easy upload | ||
#### $custom end #### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here we shouldn't need this if, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
We may run into some trouble when changing the official example scripts for two reasons:
- They are supposed to run with a source install, not the latest versions, as we keep them up to date with master at all time (which is why you have a problem with the TrainerOutput metrics field here, it's not a bug, just a feature that is on master but not yet released)
- They are supposed to be as simple as possible so ideally, we don't want to add a lot of custom code just for SageMaker.
If we can't address 1, we will definitely need to have special versions of the example scripts for Sagemaker (adapted from the official examples at each release).
For 2, we can have a few extra lines here and there, but ideally would need to keep the specific sagemaker stuff to a minimum. From what I'm looking at, most of the stuff can be done in the function that prepares the arguments, if we properly set the output_dir
.
…_sagemaker_env_into_args
@sgugger's 1st point is a good point and we should probably either:
estimator = HuggingFace(
task_name="text-classification",
dataset="imdb",
from_model="distilbert-base-cased",
publish_model="my-fine-tuned-model",
huggingface_token="...",
) (then there's not even a need to have a free-standing script. My question on whether this is SageMaker-idiomatic still stands) |
That´s true both of you are right. We must be able to ensure that the correct script version is used for the correct transformers & datasets version within the container image. I would not bundle them into the official DLC container since there is always that need to have an So for this version, we could use the script from https://github.com/huggingface/transformers/tree/v4.1.1. |
Closed by stale bot. If this shouldn't have been closed, let me know. |
Hello Guys,
i am currently working on how we could edit/extend the fine-tuning scripts from
examples/
to work out-of-the-box within sagemaker. For that i adjusted therun_glue.py
script.To test it I created a custom huggingface extension for sagemaker where I created a sagemaker compatible docker container and a huggingface estimator.
The container was build with the
transformers==4.1.1
anddatasets==1.1.3
. That is also the reason why I only adjusted therun_glue.py
and not any other files. Therun_glue.py
can i dynamically pass into the Sagemaker Training Job, but when i would adjust any other files yet i would have to rebuild the container... . For all the functions, which would move to a different directory I added a comment# Should be moved to path_to_file/filename.py
.As an Example how you could use this to create a Sagemaker training job using the extension i build you would create an
HuggingFace()
Estimator and then call.fit()
. The example i used is demonstrated below or you can find it in this github repostiroyNote: Sagemaker Requirements
In Sagemaker you can define Hyperparameters, which are getting passed into the training script within the
HuggingFace(hyperparameters={})
dictonary. This parameter will be then passed into the training script as named arguments. So the hyperparameters from the example are going to look like this when the training script is executed.--do_eval True --do_train True --learning_rate 2e-05 --max_seq_length 128 --model_name_or_path distilbert-base-cased --num_train_epochs 3.0 --output_dir Not defined sagemaker --per_device_train_batch_size 32 --task_name MRPC
.How I proceeded
I created a function
is_run_on_sagemaker()
to determine if the script is running in a Sagemaker Runtime environment. This function should be move to thetransformers/src/transformers/file_utils.py
file.I had to adjust the
sys.argv
because:TrainingArguments
are expecting the parameteroutput_dir
, but in a Sagemaker Runtime the output_dir is defined from the enviroment variableSM_OUTPUT_DATA_DIR
.TrainingArguments
are expecting for boolean parameters not aTrue
as value. If--train_do
exist itsTrue
otherwise itsFalse
. In Sagemaker you cannot pass keys only so i removed allTrue
s from thesys.argv
at the beginning. A better solution could that we adjust the HfArgumentParser to accept'True'
for boolean arguments.Therefore i created an
parse_sagemaker_args()
function which:--output_dir
with the correct value for Sagemaker.fit()
which is on S3 and will be downloaded before the training starts. I added two options you can either add the the direct S3 uri to a file (e.g.s3://my-data-bucket/path/to/my/training/data.csv
) or you can add a path (e.g.s3://my-data-bucket/path/to/data
) and pass the file as hyperparameterstrain_file
.True
s from thesys.argv
for the boolean parameters.Adjusted all file saving and model saving section and added conditions if the script is run on Sagemaker.
Testing
I tested it using the jupyter notebook I provided at the top. The log of the training script is attached:
details:
For local testing you can ran this script. It will add all the required Sagemaker environment variables to the script.
I would love to receive suggestions for improvement.
If it looks okay for you I would move the
is_run_on_sagemaker()
to the correct path and we could merge it.P.S. i also added a fix for themistake from my sitetrain_result.metrics
https://discuss.huggingface.co/t/attributeerror-trainoutput-object-has-no-attribute-metrics-when-finetune-custom-dataset/2970