Integration with https://github.com/michaelfeil/infinity #4793
Answered
by
ssheng
michaelfeil
asked this question in
Ideas
-
Hi there, I figured it would be cool if you want to add an example with using https://github.com/michaelfeil/infinity. Infinity is a fast backend for encoder LLMs (bert-large / text-embeddings etc). Here is how an infinity example could look like: @bentoml.service(
traffic={
"timeout": 300,
},
resources={
"gpu": 1,
"gpu_type": "nvidia-l4",
},
)
class INFINITYEMB:
def __init__(self) -> None:
from infinity_emb import AsyncEmbeddingEngine, EngineArgs
ENGINE_ARGS = EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch")
self.engine = AsyncEmbeddingEngine.from_engine_args(ENGINE_ARGS)
@bentoml.api
async def embed(
self,
sentences: list[str] = ["Explain superconductors like I'm five years old"],
) -> list[list[float]]:
if not engine.is_running:
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
return embeddings Here is the current example from your website for VLLM: @bentoml.service(
traffic={
"timeout": 300,
},
resources={
"gpu": 1,
"gpu_type": "nvidia-l4",
},
)
class VLLM:
def __init__(self) -> None:
from vllm import AsyncEngineArgs, AsyncLLMEngine
ENGINE_ARGS = AsyncEngineArgs(
model='meta-llama/Llama-2-7b-chat-hf',
max_model_len=MAX_TOKENS
)
self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS)
@bentoml.api
async def generate(
self,
prompt: str = "Explain superconductors like I'm five years old",
max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
) -> AsyncGenerator[str, None]:
from vllm import SamplingParams
SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens)
prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)
cursor = 0
async for request_output in stream:
text = request_output.outputs[0].text
yield text[cursor:]
cursor = len(text) |
Beta Was this translation helpful? Give feedback.
Answered by
ssheng
Jun 14, 2024
Replies: 2 comments
-
Would you like to create a BentoInfinity repo and we can link it from the BentoML GitHub page? |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
michaelfeil
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Would you like to create a BentoInfinity repo and we can link it from the BentoML GitHub page?