Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template for deploying any HuggingFace pipeline for supported tasks with Torchserve #1818

Closed
tripathiarpan20 opened this issue Aug 27, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request triaged_wait Waiting for the Reporter's resp

Comments

@tripathiarpan20
Copy link

tripathiarpan20 commented Aug 27, 2022

Hi!
Lately I have been working on my repo that contains a template to deploy any Huggingface model supported by the pipeline, where pipeline is a simple-to-use abstraction provided by HF. The repo also includes copy-paste commands in READMEs for AWS EC2 instance.

Although I have only focused on deploying model with PyTorch backend as I would be adding scripts to deploy Torchscripted & LLM.Int8 pipeline models soon. Moreover, TF models present in an HF repo (example of an HF repo) can also be deployed by changing the framework attribute while initialising pipeline.

I have also tried to make the repo as beginner friendly as possible by including comments, references and compact code. There are also plans to integrate the HuggingFace optimum library that integrates elegantly with pipeline, so by extension, it would integrate well with my repo too with a few short scripts.

My repo could be useful to the open-source community and I believe it would reach a greater audience if added to the News section and/or examples/Huggingface_Transformers.

Thanks.

@msaroufim
Copy link
Member

Hi @tripathiarpan20 thank you for sharing this. I'm wondering how does the present state differ from the work here? https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers

Ultimately if we choose to merge in your work I'd like us to have 1 recommended way of deploying HF models

cc: @HamidShojanazeri

@msaroufim msaroufim added enhancement New feature or request triaged_wait Waiting for the Reporter's resp labels Aug 27, 2022
@tripathiarpan20
Copy link
Author

tripathiarpan20 commented Aug 27, 2022

Hi @msaroufim ,
The main difference is that examples/Huggingface_Transformers does not use the HuggingFace pipeline abstraction at all, and thus supports only a limited number of tasks, meanwhile pipeline can assert the required AutoTokenizer, AutoConfig and the AutoModel class corresponding to the selected task, out of any listed tasks with cleaner code. Hence, the work here can be used as a template to support the remaining tasks not supported by examples/Huggingface_Transformers with a few modifications to the handler, which are explained here.

Moreover, the examples/Huggingface_Transformers does not have the example of inference requests with Docker, which is important for production purposes, the work aims to provide simple copy-paste commads to handle the Docker part for the beginners (which many HuggingFace users are, given how easy-to-use HF is).

The work currently has code for image-classification task with MobileViT XX Small model which was deployable even on the weakest AWS EC2 instance (t2.micro) with average single inference time of 126 ms (when max_batch_delay for the registered model was set to 10ms).

@msaroufim
Copy link
Member

Ok that clarifies things, thank you. please feel free to make your PR directly. I think for now we can focus on improving the support for our HF models and once the PR is in we can work together to publicize the work. Please add @HamidShojanazeri and myself as reviewers for your PRs

@tripathiarpan20
Copy link
Author

tripathiarpan20 commented Aug 27, 2022

Thanks, how exactly do you suggest I should make the PR? like should I make a new folder in examples/Huggingface_Transformers called docker_integration or something?

Another point I missed is that the work utilises the shared memory feature of mounted volumes in Docker to keep the .mar file only have the size of the handler file (by passing in dummy/empty file to the --serialized-file attribute during torch-model-archiver and having the handler file utilize model checkpoints only from HF-models folder (Ctrl-F in this readme for "HF-models" for context)).

This might be important in scenarios with LLMs like BLOOM that might take large space in disk and cause problems on low-storage machines if a copy of model checkpoint file is made during model archiving process.

@msaroufim
Copy link
Member

We can split the work into

  1. Add support for pipeline in existing handler code, refactor and simplify as needed
  2. Docker support for large models with its own sub folder

@tripathiarpan20
Copy link
Author

I have raised the PR, we can plan how to integrate the pipeline in the code there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triaged_wait Waiting for the Reporter's resp
Projects
None yet
Development

No branches or pull requests

3 participants