You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When attempting to train locally with a simple script using a huggingface training container (i.e. from here) I get the following error:
jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
To reproduce
The local training script is as follows:
This is run locally using python, leading to the error. I don't think the contents of the train.py file are relevant as this happens during setup of the training environment. The train and val files are tokenized tensors in arrow format output by calling .save_to_disk on a transformers Dataset object.
Note this same error occurred on previous versions of the training container also.
Expected behavior
For the training to complete as it does when running in SageMaker. The same configuration runs okay as part of a sagemaker pipeline running on sagemaker managed instances. This local training also worked previously, and we can't isolate what has changed to now lead to the above error.
Screenshots or logs
Full logs:
INFO:sagemaker:Creating training-job with name: training-2023-09-08-09-31-48-957
INFO:sagemaker.local.local_session:Starting training job
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
INFO:sagemaker.local.image:docker compose file:
networks:
sagemaker-local:
name: sagemaker-local
services:
algo-1-si1mk:
command: train
container_name: c1sy6fk9n1-algo-1-si1mk
environment:
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
- '[Masked]'
image: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
networks:
sagemaker-local:
aliases:
- algo-1-si1mk
stdin_open: true
tty: true
volumes:
- /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/input:/opt/ml/input
- /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output:/opt/ml/output
- /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output/data:/opt/ml/output/data
- /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/model:/opt/ml/model
- /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/train:/opt/ml/input/data/train
- /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/val:/opt/ml/input/data/val
version: '2.3'
INFO:sagemaker.local.image:docker command: docker-compose -f /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/docker-compose.yaml up --build --abort-on-container-exit
Creating network "sagemaker-local" with the default driver
Creating c1sy6fk9n1-algo-1-si1mk ...
Creating c1sy6fk9n1-algo-1-si1mk ... done
Attaching to c1sy6fk9n1-algo-1-si1mk
c1sy6fk9n1-algo-1-si1mk | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
c1sy6fk9n1-algo-1-si1mk | changehostname.c: In function ‘gethostname’:
c1sy6fk9n1-algo-1-si1mk | changehostname.c:15:21: error: expected expression before ‘;’ token
c1sy6fk9n1-algo-1-si1mk | 15 | const char *val = ;
c1sy6fk9n1-algo-1-si1mk | | ^
c1sy6fk9n1-algo-1-si1mk | gcc: error: changehostname.o: No such file or directory
c1sy6fk9n1-algo-1-si1mk | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
c1sy6fk9n1-algo-1-si1mk | Reporting training FAILURE
c1sy6fk9n1-algo-1-si1mk | Framework Error:
c1sy6fk9n1-algo-1-si1mk | Traceback (most recent call last):
c1sy6fk9n1-algo-1-si1mk | File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/trainer.py", line 70, in train
c1sy6fk9n1-algo-1-si1mk | env = environment.Environment()
c1sy6fk9n1-algo-1-si1mk | File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 576, in __init__
c1sy6fk9n1-algo-1-si1mk | resource_config = resource_config or read_resource_config()
c1sy6fk9n1-algo-1-si1mk | File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 254, in read_resource_config
c1sy6fk9n1-algo-1-si1mk | return _read_json(resource_config_file_dir)
c1sy6fk9n1-algo-1-si1mk | File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 201, in _read_json
c1sy6fk9n1-algo-1-si1mk | with open(path, "r") as f:
c1sy6fk9n1-algo-1-si1mk | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk |
c1sy6fk9n1-algo-1-si1mk | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk | Encountered exit_code 2
c1sy6fk9n1-algo-1-si1mk exited with code 2
Describe the bug
When attempting to train locally with a simple script using a huggingface training container (i.e. from here) I get the following error:
jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
To reproduce
The local training script is as follows:
This is run locally using python, leading to the error. I don't think the contents of the
train.py
file are relevant as this happens during setup of the training environment. The train and val files are tokenized tensors in arrow format output by calling.save_to_disk
on a transformers Dataset object.Note this same error occurred on previous versions of the training container also.
Expected behavior
For the training to complete as it does when running in SageMaker. The same configuration runs okay as part of a sagemaker pipeline running on sagemaker managed instances. This local training also worked previously, and we can't isolate what has changed to now lead to the above error.
Screenshots or logs
Full logs:
System information
A description of your system.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: