Just Curious - Can we use it to deploy rag-based chatbots ? #80

Shivansh12t · 2024-10-24T05:08:23Z

Shivansh12t
Oct 24, 2024

Hi everyone,

We are currently building a Retrieval-Augmented Generation (RAG)-based chatbot for a large-scale event, Saturnalia, where we expect a significant surge in user traffic on the main day. We’re exploring options to accelerate inference on CPUs, and we are considering models like Mistral 7B or LLaMA 7B for this deployment.

Given that Microsoft Bitnet can help optimize LLM performance on CPU, I wanted to ask for advice regarding the CPU specifications we should consider for this setup. Specifically, we want to know:

What CPU configurations would be recommended to handle multiple concurrent chatbot instances (potentially a high number) using these models?
Any suggestions on number of cores and memory requirements to ensure smooth real-time interaction for users?
If anyone has experience with deploying similar models using CPU acceleration (particularly Mistral or LLaMA), what would be an ideal setup for handling high traffic without significant latency?

I'm fairly new to RAG and LLMs, and we're curious about the future scope and implications of deploying large models on CPUs. While our focus is on this particular event, we’re excited about the potential and how this technology could scale in other scenarios. Any insights on best practices, or even limitations, would be really helpful!

Thanks in advance for your guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just Curious - Can we use it to deploy rag-based chatbots ? #80

{{title}}

Replies: 0 comments

Select a reply

Just Curious - Can we use it to deploy rag-based chatbots ? #80

Shivansh12t Oct 24, 2024

Replies: 0 comments

Shivansh12t
Oct 24, 2024