You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider a scenario where a large model is deployed in the cloud, and the application is deployed on a computationally limited embedded device.
If we want to support multimodal dialogue interaction with vision and language, each request would send an image (considering the dialogue history, there would be many images). Given network bandwidth and other factors, this would cause a lot of latency.
Therefore, if the VLM's image encoder and projector are deployed on the embedded device, and if we could send the encoded vector instead during requests, the data transmission volume would be much smaller. This would reduce latency and improve the user experience.
Alternatives
The suggestted usage method is as follow
# Refer to the HuggingFace repo for the correct format to useprompt="USER: <vector>\nWhat is the content of this image?\nASSISTANT:"# Image encoded vectorvector=np.array([x, x,x, x])
# Single prompt inferenceoutputs=llm.generate({
"prompt": prompt,
"multi_modal_data": {"vector": vector},
})
For this usage, deploying only a single-model LLM model could support multi-modal model usage, and the modality is not limited.
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Consider a scenario where a large model is deployed in the cloud, and the application is deployed on a computationally limited embedded device.
If we want to support multimodal dialogue interaction with vision and language, each request would send an image (considering the dialogue history, there would be many images). Given network bandwidth and other factors, this would cause a lot of latency.
Therefore, if the VLM's image encoder and projector are deployed on the embedded device, and if we could send the encoded vector instead during requests, the data transmission volume would be much smaller. This would reduce latency and improve the user experience.
Alternatives
The suggestted usage method is as follow
For this usage, deploying only a single-model LLM model could support multi-modal model usage, and the modality is not limited.
Additional context
No response
The text was updated successfully, but these errors were encountered: