-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker initialization #115
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@romank87 Thanks for the feedback we have an internal backlog item tracking this issue. We will keep you updated with the progress. |
Seeing this as well when trying to deploy a plain 1.1.0 container....fails health check and never completes deployment. Frustratingly enough, I'm trying to use this PyTorch container to complete an instructional video on how to submit custom models to the AWS Marketplace... |
Hi @scottpletcher, I apologize for the inconvenience. We have assigned a dedicated engineer to work on this issue. One workaround you can try with is to load pre-installed modules to the container instead of installing dependencies at runtime. Thanks for your patience! |
I have been referred to this thread by AWS support.
In the requirements.txt, we have I tried to remove
Note that I used a huge machine here as we had "No space left" issues in the logs (although initially, that very same model could be successfully deployed to a ml.t2.large instance). |
Is the requirements.txt installation not working for you? The notebook environment that you installed pytorch_pretrained_bert is different from where your model is hosted. You will need to do an explicit install in your generate.py file, which is what gets persisted over. However, that script execution happens after the ping (during worker initialization of the first request), so @romank87 would still have the same issue. There doesn't seem to be a nice solution other than to modify the docker container to contain your dependency, if you need to get passed worker initialization. You would have to either modify the container at runtime or build it yourself with a modified Dockerfile. Modify at runtime
The command above assumes there is only one running container in your Docker session, otherwise you will need to replace docker ps -q with the right container id. Build
Testing your new image
This will run the container on your local machine, which should iterate a lot quicker than waiting for instances to provision. When the container runs as expected, then we can push the image to an ECR repo and deploy in SageMaker. I apologize the all of the inconvenience. I think to a certain extent being required to do either one of the options listed above defeats the purpose of these images existing, since we want it to be an abstraction. Please let me know if there is anything I can clarify. |
@ChoiByungWook - thanks for clarification above. This is exactly what we are trying to do
where image_name is our custom (sagemaker-pytorch extended) image.
which in our case seem to lead to the creation of massive |
@sivakhno Hope this answers your question. Please let us know if you have any further questions. |
It seems like there is a bug in initialization logic.
Gunicorn processes are initialized not at container start but at a time of first request arrival.
The global
app
variable here is not shared between gunicorn processes, so each process will be initialized only at a request arrival.This will cause a random behavior. If the request comes to a worker that was already initialized - it will be processed quickly. If the request comes to a worker that is not yet initialized - the response will be delayed for quite some time (>30 sec in my case).
This will even cause
/ping
requests to time out and inability to deploy a container to AWS.The text was updated successfully, but these errors were encountered: