Worker initialization #115

romank87 · 2019-06-24T19:47:00Z

It seems like there is a bug in initialization logic.
Gunicorn processes are initialized not at container start but at a time of first request arrival.
The global app variable here is not shared between gunicorn processes, so each process will be initialized only at a request arrival.

This will cause a random behavior. If the request comes to a worker that was already initialized - it will be processed quickly. If the request comes to a worker that is not yet initialized - the response will be delayed for quite some time (>30 sec in my case).
This will even cause /ping requests to time out and inability to deploy a container to AWS.

The text was updated successfully, but these errors were encountered:

stale · 2019-07-01T20:05:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

icywang86rui · 2019-07-01T20:38:56Z

@romank87 Thanks for the feedback we have an internal backlog item tracking this issue. We will keep you updated with the progress.

scottpletcher · 2019-07-09T06:26:25Z

Seeing this as well when trying to deploy a plain 1.1.0 container....fails health check and never completes deployment. Frustratingly enough, I'm trying to use this PyTorch container to complete an instructional video on how to submit custom models to the AWS Marketplace...

chuyang-deng · 2019-07-09T17:27:54Z

Hi @scottpletcher, I apologize for the inconvenience. We have assigned a dedicated engineer to work on this issue.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime.

Thanks for your patience!

nbeuchat · 2019-07-12T17:32:27Z

I have been referred to this thread by AWS support.

One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime.
@ChuyangDeng Could you please give more information on how to do this? We are deploying our model through jupyter notebook on sagemaker.

from sagemaker.session import Session
from sagemaker.pytorch import PyTorchModel

model_data = Session().upload_data(path='model.tar.gz', key_prefix='model')

env = {
    "SAGEMAKER_REQUIREMENTS": "requirements.txt", # path relative to `source_dir` below.
}

model = PyTorchModel(model_data=model_data,
                     entry_point='generate.py',
                     role=role,
                     env=env,
                     source_dir='.',
                     name=endpoint_name,
                     framework_version='1.0.0')

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

In the requirements.txt, we have pytorch_pretrained_bert.

I tried to remove pytorch_pretrained_bert out of the requirements file and use !pip install pytorch_pretrained_bert in the notebook but I am still not able to deploy. I receive the following error:

ValueError: Error hosting endpoint bert-offer-type-multilang: Failed Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Note that I used a huge machine here as we had "No space left" issues in the logs (although initially, that very same model could be successfully deployed to a ml.t2.large instance).

ChoiByungWook · 2019-07-12T18:41:14Z

@nbeuchat,

Is the requirements.txt installation not working for you?

The notebook environment that you installed pytorch_pretrained_bert is different from where your model is hosted. You will need to do an explicit install in your generate.py file, which is what gets persisted over.

However, that script execution happens after the ping (during worker initialization of the first request), so @romank87 would still have the same issue. There doesn't seem to be a nice solution other than to modify the docker container to contain your dependency, if you need to get passed worker initialization.

You would have to either modify the container at runtime or build it yourself with a modified Dockerfile.

Modify at runtime

Login to our ECR repo
- $(aws ecr get-login --no-include-email --registry-id 520713654638)
Pull down the PyTorch image from ECR
- docker pull 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3
run your container with a bash session
- docker run -it 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1.0-cpu-py3 bash
install your dependencies
- pip install blah blah blah
In another bash session commit the running container as another image
- docker commit --change='ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]' $(docker ps -q) sagemaker-pytorch-container:1.0

The command above assumes there is only one running container in your Docker session, otherwise you will need to replace docker ps -q with the right container id.

Build

Modify the Dockerfile to install your dependencies
Follow the instructions here: https://github.com/aws/sagemaker-pytorch-container#building-your-image

Testing your new image

Change image_base constructor parameter in PyTorchModel to our new image (sagemaker-pytorch-container:1.0)
Change instance_type to 'local'.
Deploy

This will run the container on your local machine, which should iterate a lot quicker than waiting for instances to provision. When the container runs as expected, then we can push the image to an ECR repo and deploy in SageMaker.

I apologize the all of the inconvenience. I think to a certain extent being required to do either one of the options listed above defeats the purpose of these images existing, since we want it to be an abstraction.

Please let me know if there is anything I can clarify.

sivakhno · 2019-07-31T12:58:46Z

@ChoiByungWook - thanks for clarification above. This is exactly what we are trying to do

sagemaker_serving = PyTorchModel(model_data=merged_models_file_path, source_dir='./', image=image_name, role=os.environ['SAGEMAKER_ROLE'], \
        framework_version='1.0.0', entry_point='serving.py', predictor_cls=utils.JSONPredictor)

where image_name is our custom (sagemaker-pytorch extended) image.
However as per https://github.com/aws/sagemaker-containers/blob/master/TRAINING_IN_DETAIL.rst it seems that

One difference between a Framework Container and a BYOC is  ... the former doesn't include the user entry point and needs to download it from S3

which in our case seem to lead to the creation of massive source.tar.gz file (I also could not find any documentation on how to control what goes into this file, it seem to contain snapshot of all files in the current directory and subdirectories).
Can we set up PyTorchModel such that entrypoint in the docker image is used instead?
Thanks!

icywang86rui · 2019-08-02T17:42:25Z

@sivakhno
The PyTorchModel will tar up everything under source_dir and that makes the source.tar.gz file. So if you would like to control what does into it, you can create a folder with only the relevant files in it and point source_dir to that folder. And entry_point should points to the entry point script you would like the container to execute. In the case serving.py assuming it's directly under source_dir.

Hope this answers your question. Please let us know if you have any further questions.

stale bot added the stale label Jul 1, 2019

stale bot removed the stale label Jul 1, 2019

laurenyu added the type: bug Something isn't working label Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker initialization #115

Worker initialization #115

romank87 commented Jun 24, 2019

stale bot commented Jul 1, 2019

icywang86rui commented Jul 1, 2019

scottpletcher commented Jul 9, 2019 •

edited

Loading

chuyang-deng commented Jul 9, 2019

nbeuchat commented Jul 12, 2019

ChoiByungWook commented Jul 12, 2019 •

edited

Loading

sivakhno commented Jul 31, 2019

icywang86rui commented Aug 2, 2019

Worker initialization #115

Worker initialization #115

Comments

romank87 commented Jun 24, 2019

stale bot commented Jul 1, 2019

icywang86rui commented Jul 1, 2019

scottpletcher commented Jul 9, 2019 • edited Loading

chuyang-deng commented Jul 9, 2019

nbeuchat commented Jul 12, 2019

ChoiByungWook commented Jul 12, 2019 • edited Loading

sivakhno commented Jul 31, 2019

icywang86rui commented Aug 2, 2019

scottpletcher commented Jul 9, 2019 •

edited

Loading

ChoiByungWook commented Jul 12, 2019 •

edited

Loading