Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when training in local mode with huggingface training container #193

Open
ojturner opened this issue Sep 11, 2023 · 1 comment
Open

Comments

@ojturner
Copy link

Describe the bug
When attempting to train locally with a simple script using a huggingface training container (i.e. from here) I get the following error:

jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory

To reproduce
The local training script is as follows:

import os
from sagemaker.huggingface import HuggingFace
from sagemaker.local import LocalSession

sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

huggingface_estimator = HuggingFace(
    py_version=None,
    entry_point="train.py",
    image_uri="763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04",
    role="sagemaker-studio-user-prod",
    source_dir="scripts/pipeline_scripts",
    instance_type='local',
    instance_count=1,
    input_mode='File',
    output_path=f"file://{os.getcwd()}/tests/test_output_data/trained_model",
    code_location="path_to_s3_dir",
)

huggingface_estimator.fit({
    'train': f"file://{os.getcwd()}/tests/test_output_data/preprocessed/train",
    'val': f"file://{os.getcwd()}/tests/test_output_data/preprocessed/val",
})

This is run locally using python, leading to the error. I don't think the contents of the train.py file are relevant as this happens during setup of the training environment. The train and val files are tokenized tensors in arrow format output by calling .save_to_disk on a transformers Dataset object.
Note this same error occurred on previous versions of the training container also.

Expected behavior
For the training to complete as it does when running in SageMaker. The same configuration runs okay as part of a sagemaker pipeline running on sagemaker managed instances. This local training also worked previously, and we can't isolate what has changed to now lead to the above error.

Screenshots or logs
Full logs:

INFO:sagemaker:Creating training-job with name: training-2023-09-08-09-31-48-957
INFO:sagemaker.local.local_session:Starting training job
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
INFO:sagemaker.local.image:docker compose file:
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-si1mk:
    command: train
    container_name: c1sy6fk9n1-algo-1-si1mk
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
    networks:
      sagemaker-local:
        aliases:
        - algo-1-si1mk
    stdin_open: true
    tty: true
    volumes:
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/input:/opt/ml/input
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output:/opt/ml/output
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output/data:/opt/ml/output/data
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/model:/opt/ml/model
    - /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/train:/opt/ml/input/data/train
    - /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/val:/opt/ml/input/data/val
version: '2.3'

INFO:sagemaker.local.image:docker command: docker-compose -f /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/docker-compose.yaml up --build --abort-on-container-exit
Creating network "sagemaker-local" with the default driver
Creating c1sy6fk9n1-algo-1-si1mk ...
Creating c1sy6fk9n1-algo-1-si1mk ... done
Attaching to c1sy6fk9n1-algo-1-si1mk
c1sy6fk9n1-algo-1-si1mk | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
c1sy6fk9n1-algo-1-si1mk | changehostname.c: In function ‘gethostname’:
c1sy6fk9n1-algo-1-si1mk | changehostname.c:15:21: error: expected expression before ‘;’ token
c1sy6fk9n1-algo-1-si1mk |    15 |   const char *val = ;
c1sy6fk9n1-algo-1-si1mk |       |                     ^
c1sy6fk9n1-algo-1-si1mk | gcc: error: changehostname.o: No such file or directory
c1sy6fk9n1-algo-1-si1mk | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
c1sy6fk9n1-algo-1-si1mk | Reporting training FAILURE
c1sy6fk9n1-algo-1-si1mk | Framework Error:
c1sy6fk9n1-algo-1-si1mk | Traceback (most recent call last):
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/trainer.py", line 70, in train
c1sy6fk9n1-algo-1-si1mk |     env = environment.Environment()
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 576, in __init__
c1sy6fk9n1-algo-1-si1mk |     resource_config = resource_config or read_resource_config()
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 254, in read_resource_config
c1sy6fk9n1-algo-1-si1mk |     return _read_json(resource_config_file_dir)
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 201, in _read_json
c1sy6fk9n1-algo-1-si1mk |     with open(path, "r") as f:
c1sy6fk9n1-algo-1-si1mk | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk |
c1sy6fk9n1-algo-1-si1mk | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk | Encountered exit_code 2
c1sy6fk9n1-algo-1-si1mk exited with code 2

System information
A description of your system.

  • Sagemaker training version - 4.5.0
  • prebuild docker image url

Additional context
Add any other context about the problem here.

@nikhilKumarMarepally
Copy link

I am also facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants