We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG Description I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.
The code runs perfectly on the local machine.
To reproduce
role = "arn:..." estimator = PyTorch( image_uri="1...ecr...amazonaws.com/...:prototype", git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"}, entry_point="main.py", role=role, region="us-...", instance_type="local", # ml.g4dn.2xlarge instance_count=1, volume_size=225, hyperparameters=hparams ) estimator.fit()
Expected behavior The model is expected to start to train and log metrics and losses.
Screenshots or logs
Cloning into '/tmp/tmpycpzvkcn'... remote: Enumerating objects: 246, done. remote: Counting objects: 100% (246/246), done. remote: Compressing objects: 100% (190/190), done. remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0 Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done. Resolving deltas: 100% (40/40), done. Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'. Switched to a new branch 'sagemaker' [2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781 [2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI. [2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job [2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session [2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file: networks: sagemaker-local: name: sagemaker-local services: algo-1-55row: command: train container_name: 1l7x1nzly6-algo-1-55row environment: - '[Masked]' - '[Masked]' - '[Masked]' - '[Masked]' - '[Masked]' image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype networks: sagemaker-local: aliases: - algo-1-55row stdin_open: true tty: true volumes: - /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data - /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input - /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output - /tmp/tmpsvd2b_wm/model:/opt/ml/model version: '2.3' [2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network" Container 1l7x1nzly6-algo-1-55row Creating Container 1l7x1nzly6-algo-1-55row Created Attaching to 1l7x1nzly6-algo-1-55row Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown Error executing job with overrides: [] Traceback (most recent call last): File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train _stream_output(process) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output raise RuntimeError("Process exited with code: %s" % exit_code) RuntimeError: Process exited with code: 1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run_on_sagemaker.py", line 28, in run_on_sagemaker estimator.fit() File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper return run_func(*args, **kwargs) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new estimator.sagemaker_session.train(**train_args) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train self._intercept_create_request(train_request, submit, self.train.__name__) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request return create(request) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit self.sagemaker_client.create_training_job(**request) File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job training_job.start( File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start self.model_artifacts = self.container.train( File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train raise RuntimeError(msg) RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
System information A description of your system. Please provide:
The text was updated successfully, but these errors were encountered:
No branches or pull requests
BUG Description
I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.
The code runs perfectly on the local machine.
To reproduce
Expected behavior
The model is expected to start to train and log metrics and losses.
Screenshots or logs
System information
A description of your system. Please provide:
The text was updated successfully, but these errors were encountered: