Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker initialization #115

Open
romank87 opened this issue Jun 24, 2019 · 8 comments
Open

Worker initialization #115

romank87 opened this issue Jun 24, 2019 · 8 comments
Labels
type: bug Something isn't working

Comments

@romank87
Copy link

It seems like there is a bug in initialization logic.
Gunicorn processes are initialized not at container start but at a time of first request arrival.
The global app variable here is not shared between gunicorn processes, so each process will be initialized only at a request arrival.

This will cause a random behavior. If the request comes to a worker that was already initialized - it will be processed quickly. If the request comes to a worker that is not yet initialized - the response will be delayed for quite some time (>30 sec in my case).
This will even cause /ping requests to time out and inability to deploy a container to AWS.

@stale
Copy link

stale bot commented Jul 1, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 1, 2019
@icywang86rui
Copy link
Contributor

@romank87 Thanks for the feedback we have an internal backlog item tracking this issue. We will keep you updated with the progress.

@stale stale bot removed the stale label Jul 1, 2019
@laurenyu laurenyu added the type: bug Something isn't working label Jul 1, 2019
@scottpletcher
Copy link

scottpletcher commented Jul 9, 2019

Seeing this as well when trying to deploy a plain 1.1.0 container....fails health check and never completes deployment. Frustratingly enough, I'm trying to use this PyTorch container to complete an instructional video on how to submit custom models to the AWS Marketplace...

@chuyang-deng
Copy link
Contributor

Hi @scottpletcher, I apologize for the inconvenience. We have assigned a dedicated engineer to work on this issue.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime.

Thanks for your patience!

@nbeuchat
Copy link

I have been referred to this thread by AWS support.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime.
@ChuyangDeng Could you please give more information on how to do this? We are deploying our model through jupyter notebook on sagemaker.

from sagemaker.session import Session
from sagemaker.pytorch import PyTorchModel

model_data = Session().upload_data(path='model.tar.gz', key_prefix='model')

env = {
    "SAGEMAKER_REQUIREMENTS": "requirements.txt", # path relative to `source_dir` below.
}

model = PyTorchModel(model_data=model_data,
                     entry_point='generate.py',
                     role=role,
                     env=env,
                     source_dir='.',
                     name=endpoint_name,
                     framework_version='1.0.0')

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

In the requirements.txt, we have pytorch_pretrained_bert.

I tried to remove pytorch_pretrained_bert out of the requirements file and use !pip install pytorch_pretrained_bert in the notebook but I am still not able to deploy. I receive the following error:

ValueError: Error hosting endpoint bert-offer-type-multilang: Failed Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Note that I used a huge machine here as we had "No space left" issues in the logs (although initially, that very same model could be successfully deployed to a ml.t2.large instance).

@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Jul 12, 2019

@nbeuchat,

Is the requirements.txt installation not working for you?

The notebook environment that you installed pytorch_pretrained_bert is different from where your model is hosted. You will need to do an explicit install in your generate.py file, which is what gets persisted over.

However, that script execution happens after the ping (during worker initialization of the first request), so @romank87 would still have the same issue. There doesn't seem to be a nice solution other than to modify the docker container to contain your dependency, if you need to get passed worker initialization.

You would have to either modify the container at runtime or build it yourself with a modified Dockerfile.

Modify at runtime

  1. Login to our ECR repo
    • $(aws ecr get-login --no-include-email --registry-id 520713654638)
  2. Pull down the PyTorch image from ECR
    • docker pull 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3
  3. run your container with a bash session
    • docker run -it 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3 bash
  4. install your dependencies
    • pip install blah blah blah
  5. In another bash session commit the running container as another image
    • docker commit --change='ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]' $(docker ps -q) sagemaker-pytorch-container:1.0

The command above assumes there is only one running container in your Docker session, otherwise you will need to replace docker ps -q with the right container id.

Build

  1. Modify the Dockerfile to install your dependencies
  2. Follow the instructions here: https://github.com/aws/sagemaker-pytorch-container#building-your-image

Testing your new image

  1. Change image_base constructor parameter in PyTorchModel to our new image (sagemaker-pytorch-container:1.0)
  2. Change instance_type to 'local'.
  3. Deploy

This will run the container on your local machine, which should iterate a lot quicker than waiting for instances to provision. When the container runs as expected, then we can push the image to an ECR repo and deploy in SageMaker.

I apologize the all of the inconvenience. I think to a certain extent being required to do either one of the options listed above defeats the purpose of these images existing, since we want it to be an abstraction.

Please let me know if there is anything I can clarify.

@sivakhno
Copy link

@ChoiByungWook - thanks for clarification above. This is exactly what we are trying to do

sagemaker_serving = PyTorchModel(model_data=merged_models_file_path, source_dir='./', image=image_name, role=os.environ['SAGEMAKER_ROLE'], \
        framework_version='1.0.0', entry_point='serving.py', predictor_cls=utils.JSONPredictor)

where image_name is our custom (sagemaker-pytorch extended) image.
However as per https://github.com/aws/sagemaker-containers/blob/master/TRAINING_IN_DETAIL.rst it seems that

One difference between a Framework Container and a BYOC is  ... the former doesn't include the user entry point and needs to download it from S3

which in our case seem to lead to the creation of massive source.tar.gz file (I also could not find any documentation on how to control what goes into this file, it seem to contain snapshot of all files in the current directory and subdirectories).
Can we set up PyTorchModel such that entrypoint in the docker image is used instead?
Thanks!

@icywang86rui
Copy link
Contributor

@sivakhno
The PyTorchModel will tar up everything under source_dir and that makes the source.tar.gz file. So if you would like to control what does into it, you can create a folder with only the relevant files in it and point source_dir to that folder. And entry_point should points to the entry point script you would like the container to execute. In the case serving.py assuming it's directly under source_dir.

Hope this answers your question. Please let us know if you have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants