Issue when training in local mode with huggingface training container #193

ojturner · 2023-09-11T09:27:39Z

Describe the bug
When attempting to train locally with a simple script using a huggingface training container (i.e. from here) I get the following error:

jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory

To reproduce
The local training script is as follows:

import os
from sagemaker.huggingface import HuggingFace
from sagemaker.local import LocalSession

sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

huggingface_estimator = HuggingFace(
    py_version=None,
    entry_point="train.py",
    image_uri="763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04",
    role="sagemaker-studio-user-prod",
    source_dir="scripts/pipeline_scripts",
    instance_type='local',
    instance_count=1,
    input_mode='File',
    output_path=f"file://{os.getcwd()}/tests/test_output_data/trained_model",
    code_location="path_to_s3_dir",
)

huggingface_estimator.fit({
    'train': f"file://{os.getcwd()}/tests/test_output_data/preprocessed/train",
    'val': f"file://{os.getcwd()}/tests/test_output_data/preprocessed/val",
})

This is run locally using python, leading to the error. I don't think the contents of the train.py file are relevant as this happens during setup of the training environment. The train and val files are tokenized tensors in arrow format output by calling .save_to_disk on a transformers Dataset object.
Note this same error occurred on previous versions of the training container also.

Expected behavior
For the training to complete as it does when running in SageMaker. The same configuration runs okay as part of a sagemaker pipeline running on sagemaker managed instances. This local training also worked previously, and we can't isolate what has changed to now lead to the above error.

Screenshots or logs
Full logs:

INFO:sagemaker:Creating training-job with name: training-2023-09-08-09-31-48-957
INFO:sagemaker.local.local_session:Starting training job
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
INFO:sagemaker.local.image:docker compose file:
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-si1mk:
    command: train
    container_name: c1sy6fk9n1-algo-1-si1mk
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
    networks:
      sagemaker-local:
        aliases:
        - algo-1-si1mk
    stdin_open: true
    tty: true
    volumes:
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/input:/opt/ml/input
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output:/opt/ml/output
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/algo-1-si1mk/output/data:/opt/ml/output/data
    - /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/model:/opt/ml/model
    - /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/train:/opt/ml/input/data/train
    - /Users/owenturner/dev/banquo-bert/tests/test_output_data/preprocessed/val:/opt/ml/input/data/val
version: '2.3'

INFO:sagemaker.local.image:docker command: docker-compose -f /private/var/folders/dr/n0xslz555m128480ykmyr0t40000gp/T/tmp5jg5mcvx/docker-compose.yaml up --build --abort-on-container-exit
Creating network "sagemaker-local" with the default driver
Creating c1sy6fk9n1-algo-1-si1mk ...
Creating c1sy6fk9n1-algo-1-si1mk ... done
Attaching to c1sy6fk9n1-algo-1-si1mk
c1sy6fk9n1-algo-1-si1mk | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
c1sy6fk9n1-algo-1-si1mk | changehostname.c: In function ‘gethostname’:
c1sy6fk9n1-algo-1-si1mk | changehostname.c:15:21: error: expected expression before ‘;’ token
c1sy6fk9n1-algo-1-si1mk |    15 |   const char *val = ;
c1sy6fk9n1-algo-1-si1mk |       |                     ^
c1sy6fk9n1-algo-1-si1mk | gcc: error: changehostname.o: No such file or directory
c1sy6fk9n1-algo-1-si1mk | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
c1sy6fk9n1-algo-1-si1mk | Reporting training FAILURE
c1sy6fk9n1-algo-1-si1mk | Framework Error:
c1sy6fk9n1-algo-1-si1mk | Traceback (most recent call last):
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/trainer.py", line 70, in train
c1sy6fk9n1-algo-1-si1mk |     env = environment.Environment()
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 576, in __init__
c1sy6fk9n1-algo-1-si1mk |     resource_config = resource_config or read_resource_config()
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 254, in read_resource_config
c1sy6fk9n1-algo-1-si1mk |     return _read_json(resource_config_file_dir)
c1sy6fk9n1-algo-1-si1mk |   File "/opt/conda/lib/python3.10/site-packages/sagemaker_training/environment.py", line 201, in _read_json
c1sy6fk9n1-algo-1-si1mk |     with open(path, "r") as f:
c1sy6fk9n1-algo-1-si1mk | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk |
c1sy6fk9n1-algo-1-si1mk | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
c1sy6fk9n1-algo-1-si1mk | Encountered exit_code 2
c1sy6fk9n1-algo-1-si1mk exited with code 2

System information
A description of your system.

Sagemaker training version - 4.5.0
prebuild docker image url

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

nikhilKumarMarepally · 2024-10-19T18:33:47Z

I am also facing the same issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when training in local mode with huggingface training container #193

Issue when training in local mode with huggingface training container #193

ojturner commented Sep 11, 2023

nikhilKumarMarepally commented Oct 19, 2024

Issue when training in local mode with huggingface training container #193

Issue when training in local mode with huggingface training container #193

Comments

ojturner commented Sep 11, 2023

nikhilKumarMarepally commented Oct 19, 2024