Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi gpu support #127

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Multi gpu support #127

wants to merge 12 commits into from

Conversation

waytrue17
Copy link

@waytrue17 waytrue17 commented Aug 4, 2022

Description of changes:
Enabling multi gpu support. It passes context information to hander functions so that model/data can be assigned to multiple gpu devices.

  • To enable this feature, customer will need to change the custom handler declaration to add the context. For example, from input_fn(input_data, content_type) to input_fn(input_data, content_type, context).
  • For backward compatibility, this implementation will not break existing use cases where no context gets passed e.g. input_fn(input_data, content_type) will still work.
  • Tested the implementation on a 8 gpu instance with SAGEMAKER_MODEL_SERVER_WORKERS=9. The workers were assigned to different gpus as expected:
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    396466      C   /opt/conda/bin/python3.8         1729MiB |
|    1   N/A  N/A    396465      C   /opt/conda/bin/python3.8         1729MiB |
|    1   N/A  N/A    396467      C   /opt/conda/bin/python3.8         1729MiB |
|    2   N/A  N/A    396470      C   /opt/conda/bin/python3.8         1729MiB |
|    3   N/A  N/A    396462      C   /opt/conda/bin/python3.8         1729MiB |
|    4   N/A  N/A    396469      C   /opt/conda/bin/python3.8         1729MiB |
|    5   N/A  N/A    396463      C   /opt/conda/bin/python3.8         1729MiB |
|    6   N/A  N/A    396468      C   /opt/conda/bin/python3.8         1729MiB |
|    7   N/A  N/A    396464      C   /opt/conda/bin/python3.8         1729MiB |
+-----------------------------------------------------------------------------+

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 47d6f21
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 2a0dbda
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: d869551
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: c774dd0
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: b5c5037
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 1ab8249
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 7fad13e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 2d6ce94
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 0d293f4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ashishgupta023
Copy link

have you tested below inference use cases with the DLC container

  1. customer provides an inference script with context (new way) ?
  2. customer provides an inference script without the context (old way) ?

could you please attach the test details on the description ?

@ashishgupta023
Copy link

ashishgupta023 commented Aug 10, 2022

I think this change will also be required for the MXNET DLC containers with MMS, instead of adding a new transformer and adapting the handler service in the pytorch toolkit, could we consider adapting the transformer and handler service in the inference toolkit to work with the context so that this change could be applied to both ? It would also help make it less error prone in the future since the change would be at a single place.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: ef19cb7
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@waytrue17
Copy link
Author

I think this change will also be required for the MXNET DLC containers with MMS, instead of adding a new transformer and adapting the handler service in the pytorch toolkit, could we consider adapting the transformer and handler service in the inference toolkit to work with the context so that this change could be applied to both ? It would also help make it less error prone in the future since the change would be at a single place.

Makes sense. I will split the code and re-run some tests. Will post the test results afterward

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: ef19cb7
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: ef19cb7
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: ef19cb7
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants