Integration with https://github.com/michaelfeil/infinity #4793

michaelfeil · 2024-06-11T22:19:19Z

michaelfeil
Jun 11, 2024

Hi there,

I figured it would be cool if you want to add an example with using https://github.com/michaelfeil/infinity. Infinity is a fast backend for encoder LLMs (bert-large / text-embeddings etc).
If you have questions, will be at the SF VLLM Meetup tonight.

Here is how an infinity example could look like:

@bentoml.service(
    traffic={
        "timeout": 300,
    },
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-l4",
    },
)
class INFINITYEMB:
    def __init__(self) -> None:
        from infinity_emb import AsyncEmbeddingEngine, EngineArgs

        ENGINE_ARGS = EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch")
        self.engine = AsyncEmbeddingEngine.from_engine_args(ENGINE_ARGS)

    @bentoml.api
    async def embed(
        self,
        sentences: list[str] = ["Explain superconductors like I'm five years old"],
    ) -> list[list[float]]:
        if not engine.is_running:
            await engine.astart()
        embeddings, usage = await engine.embed(sentences=sentences)
        return embeddings

Here is the current example from your website for VLLM:

@bentoml.service(
    traffic={
        "timeout": 300,
    },
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-l4",
    },
)
class VLLM:
    def __init__(self) -> None:
        from vllm import AsyncEngineArgs, AsyncLLMEngine

        ENGINE_ARGS = AsyncEngineArgs(
            model='meta-llama/Llama-2-7b-chat-hf',
            max_model_len=MAX_TOKENS
        )
        
        self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS)

    @bentoml.api
    async def generate(
        self,
        prompt: str = "Explain superconductors like I'm five years old",
        max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
    ) -> AsyncGenerator[str, None]:
        from vllm import SamplingParams

        SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens)
        prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
        stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)

        cursor = 0
        async for request_output in stream:
            text = request_output.outputs[0].text
            yield text[cursor:]
            cursor = len(text)

Answered by ssheng

Jun 14, 2024

Would you like to create a BentoInfinity repo and we can link it from the BentoML GitHub page?

View full answer

ssheng · 2024-06-14T22:23:57Z

ssheng
Jun 14, 2024
Maintainer

Would you like to create a BentoInfinity repo and we can link it from the BentoML GitHub page?

0 replies

michaelfeil · 2024-11-15T19:59:54Z

michaelfeil
Nov 15, 2024
Author

https://github.com/bentoml/BentoInfinity/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Integration with https://github.com/michaelfeil/infinity #4793

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

BentoML

Integration with https://github.com/michaelfeil/infinity #4793

michaelfeil Jun 11, 2024

Replies: 2 comments

ssheng Jun 14, 2024 Maintainer

michaelfeil Nov 15, 2024 Author

michaelfeil
Jun 11, 2024

ssheng
Jun 14, 2024
Maintainer

michaelfeil
Nov 15, 2024
Author