Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Train": executable file not found in $PATH #253

Open
celsofranssa opened this issue Oct 13, 2023 · 0 comments
Open

"Train": executable file not found in $PATH #253

celsofranssa opened this issue Oct 13, 2023 · 0 comments

Comments

@celsofranssa
Copy link

BUG Description
I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.

The code runs perfectly on the local machine.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior
The model is expected to start to train and log metrics and losses.

Screenshots or logs

Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-55row:
    command: train
    container_name: 1l7x1nzly6-algo-1-55row
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
    networks:
      sagemaker-local:
        aliases:
        - algo-1-55row
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
    - /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
    - /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
    - /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'

[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
 Container 1l7x1nzly6-algo-1-55row  Creating
 Container 1l7x1nzly6-algo-1-55row  Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
    _stream_output(process)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_on_sagemaker.py", line 28, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
    self._intercept_create_request(train_request, submit, self.train.__name__)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
    return create(request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
    self.sagemaker_client.create_training_job(**request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
    training_job.start(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
    self.model_artifacts = self.container.train(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker 2.192.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
  • Python version: Python 3.10
  • Docker: 24.0.6
  • Custom Docker image (Y/N): Yes, on ECR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant