Skip to content

Commit

Permalink
Adding the blog for the Phi3 Local RAG (#472)
Browse files Browse the repository at this point in the history
* Adding the blog for the Phi3 Local RAG

* correcting an image path
  • Loading branch information
tulikac authored Sep 6, 2024
1 parent e58c8ba commit 8a9a608
Show file tree
Hide file tree
Showing 3 changed files with 249 additions and 0 deletions.
249 changes: 249 additions & 0 deletions _posts/2024-09-03-Phi3-vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
---
title: "Implementing Local RAG using Phi-3 ONNX Runtime and Sidecar Pattern on Linux App Service"
author_name: "Tulika Chaudharie"
toc: true
toc_sticky: true
---

In our previous post, ["Phi-3 ONNX: Leveraging the Power of ONNX Runtime to Run SLM on CPU"](https://azure.github.io/AppService/2024/08/19/Phi-3-ONNX.html), we explored how to deploy and run the Phi-3 ONNX Runtime model on Linux App Service, enabling sophisticated language model capabilities without the need for a GPU. Building on that foundation, this blog post will take you through the implementation of Retrieval-Augmented Generation (RAG) using local resources.

To recap, Retrieval-Augmented Generation (RAG) is a hybrid approach that enhances the capabilities of a language model by incorporating external knowledge. This technique combines the strengths of traditional retrieval systems and generative models to provide more accurate and contextually relevant responses. In our case, we're applying this concept to build an intelligent Assistant for our Fashion Store.

The Assistant we're creating utilizes external knowledge stored in several markdown (.md) files. These files contain essential information about our store's products, as well as shipping, payment, and exchange policies. The Assistant will leverage this knowledge to answer customer queries with relevance.

Our sample application is composed of two main components:

1. **Frontend Chat Application**: Built using Python Flask, this serves as the user interface where customers can interact with the Assistant.

2. **Backend Application**: Developed in .NET, this application loads the markdown files into an in-memory vector database. Upon receiving a query, the backend performs a search within the vector database and passes the search results as context to the Phi-3 model. The generated response is then streamed back to the chat client.

For the in-memory vector database, we're using [Build5Nines.SharpVector](https://github.com/Build5Nines/SharpVector), an excellent open-source project by [Chris Pietschmann](https://pietschsoft.com/). SharpVector makes it easy to store and retrieve vectorized data, making it an ideal choice for our sample RAG implementation.

It's important to note that this setup is designed for demo purposes. For building enterprise-scale applications, you may need to consider using distributed solutions like Cosmos DB or other scalable data storage and retrieval systems to handle larger datasets and more complex operations.

### Prerequisites

To build and run this demo, ensure you have the following:

- **.NET 8 SDK**: Required for the backend application. [Download .NET 8](https://dotnet.microsoft.com/download/dotnet/8.0).
- **Visual Studio Code (VS Code)**: For editing and developing project files. [Get VS Code](https://code.visualstudio.com/).
- **Docker Desktop**: Necessary for containerizing the applications. [Install Docker](https://www.docker.com/products/docker-desktop).
- **Python 3.9 or higher**: Used for the Python Flask frontend. [Download Python](https://www.python.org/downloads/).
- **Git**: To clone the sample repository. [Get Git](https://git-scm.com/downloads).

Clone the sidecar samples repository to get started:

```bash
git clone https://github.com/Azure-Samples/sidecar-samples
```
Here’s a structured way to present the setup of the .NET backend in your blog:

---

### Setting Up the .NET Backend

Once you have cloned the application, you’ll find the code for the backend in the `phi3-vector-api` directory. This backend is responsible for loading and processing the markdown files into an in-memory vector database and then generating responses using the Phi-3 ONNX model. Most of the key logic is contained within the `PhiController.cs` file, which you can view [here](https://github.com/Azure-Samples/sidecar-samples/blob/main/phi3-vector/phi3-vector-api/Controllers/Phi3Controller.cs).

#### Key Dependencies

We’ve added three important packages to the project:

```csharp
using Build5Nines.SharpVector;
using Build5Nines.SharpVector.Data;
using Microsoft.ML.OnnxRuntimeGenAI;
```

These packages enable us to work with vector databases and integrate the ONNX runtime for AI-based generation.

#### Loading the Model and Creating the Tokenizer

The model is loaded from the filesystem, and a tokenizer is created to handle text processing:

```csharp
_model = new Model("/app/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4");
_tokenizer = new Tokenizer(_model);
```

#### Loading Documents into the Vector Database

Next, we load all the `.md` files from the local filesystem into the in-memory vector database. In this demo, we’re using paragraph chunking, which breaks the text into paragraphs. However, you could opt for other chunking methods, such as line-based or character limit chunking:

```csharp
private async Task LoadAdditionalDocuments(string directoryPath)
{
var files = Directory.GetFiles(directoryPath, "*.*", SearchOption.AllDirectories)
.Where(f => f.EndsWith(".txt", StringComparison.OrdinalIgnoreCase) ||
f.EndsWith(".md", StringComparison.OrdinalIgnoreCase) ||
f.EndsWith(".mdx", StringComparison.OrdinalIgnoreCase)).ToArray();

var vectorDataLoader = new TextDataLoader<int, string>(_vectorDatabase);
var tasks = files.Select(async file =>
{
Console.WriteLine($"Loading {file}");
if (System.IO.File.Exists(file))
{
var fileContents = await System.IO.File.ReadAllTextAsync(file);
await vectorDataLoader.AddDocumentAsync(fileContents, new TextChunkingOptions<string>
{
Method = TextChunkingMethod.Paragraph,
RetrieveMetadata = (chunk) => file
});
}
});

await Task.WhenAll(tasks);
}
```

#### Generating Responses

When a user sends a query, we search the vector database for relevant information and use that as context for generating a response with the Phi-3 model. The response is then streamed back to the client as it's generated:

```csharp
[HttpPost("generate-response")]
public async Task GenerateResponse([FromBody] string userPrompt)
{
Console.WriteLine("Generate method called with : " + userPrompt);
string resultText = await SearchVectorDatabase(_vectorDatabase, userPrompt);
Console.WriteLine("Vector search returned : " + resultText);
var fullPrompt = $"{_systemPrompt}\n\n{resultText}\n\n{userPrompt}";

Response.ContentType = "text/plain";
await foreach (var token in GenerateAiResponse(fullPrompt))
{
if(token == null || token == "") {
break;
}
await Response.WriteAsync(token);
await Response.Body.FlushAsync(); // Flush the response stream to send the token immediately
}
}
```

#### AI Response Generation

Finally, in the `GenerateAiResponse` method, we call the generator and continuously produce the next token in the sequence:

```csharp
private async IAsyncEnumerable<string> GenerateAiResponse(string fullPrompt)
{
var tokens = _tokenizer.Encode(fullPrompt);
var generatorParams = new GeneratorParams(_model);
generatorParams.SetSearchOption("max_length", 4096);
generatorParams.SetInputSequences(tokens);
var generator = new Generator(_model, generatorParams);
Console.WriteLine("Generator created.");

while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var output = GetOutputTokens(generator, _tokenizer);
Console.WriteLine("Generating next token..."+output);
if (output == null || output=="")
{
break;
}
yield return output; // Yield each token as it's generated
}
}
```

**Note: This demo focuses on the basic setup and functionality. We haven’t applied any optimizations to the response generation process. As a result, sometimes the output might be incorrect. For more advanced scenarios, consider tuning ONNX runtime parameters like `top_k`, `top_p`, `temperature`, and others to refine the model’s output. Refer to the documentation [here](https://onnxruntime.ai/docs/genai/reference/config.html)**

#### Creating the container image for the backend

Once the backend code is ready, the final step is to containerize the application for deployment. We’ve included a Dockerfile in the repository, which you can find [here](https://github.com/Azure-Samples/sidecar-samples/blob/main/phi3-vector/phi3-vector-api/Dockerfile). This Dockerfile is set up to build and run the .NET backend in a container.

To containerize the backend, navigate to the `phi3-vector-api` directory and build the Docker image:

```bash
docker build -t your-dockerhub-username/phi3-vector-api:latest .
```

Replace `your-registry-url` with the name of your container registry, such as Azure Container Registry, Docker Hub, or a private registry.

After building the image, push it to your container registry:

```bash
docker push your-registry-url/phi3-vector-api:latest
```

### Creating the Frontend Chat Application

To enable interaction with the backend API, we’ll set up a simple Python Flask app as the chat interface. One of the advantages of the Sidecar pattern is that it allows you to run multiple containers using different languages and frameworks side by side, seamlessly integrating them into a single application.

#### Flask Chat Application

Navigate to the `flask-chat-app` folder where the code resides in `app.py`. This method sends user input to the backend API and streams the response in real-time:

```python
def generate():
start_time = time.time()
response = requests.post(API_URL, json=input_text, stream=True)
print(f"Time taken: {time.time() - start_time:.2f} seconds")

if response.status_code == 200:
for chunk in response.iter_content(chunk_size=1):
if chunk:
yield chunk.decode('utf-8')
else:
yield 'Error: ' + response.text

return Response(generate(), content_type='text/plain')
```

#### Containerizing the Flask Application

To containerize the app, use the provided [Dockerfile](https://github.com/Azure-Samples/sidecar-samples/blob/main/phi3-vector/flask-chat-app/Dockerfile) in the `flask-chat-app` directory. Build and push the image with:

```bash
docker build -t your-registry-url/flask-chat-app:latest .
docker push your-registry-url/flask-chat-app:latest
```

**Note: The Sidecar feature is currently enabled for custom-container scenarios on Linux App Service, with plans to support code-based scenarios soon.**

## Deploying the Application to Linux App Service

1. **Create a New Linux Web App in Azure**

Create a new Linux Web App from the portal and choose the options for Container and Linux.

![Create web app]({{site.baseurl}}/media/2024/07/CreateWebApp.jpg)

On the Container tab, make sure that Sidecar support is Enabled.

Specify the details of your application image (Flask app).

![Create web app]({{site.baseurl}}/media/2024/07/AddContainer.jpg)

*Make sure to set the port to 5000*

*Note: We strongly recommend enabling [Managed Identity](https://learn.microsoft.com/azure/app-service/overview-managed-identity?tabs=portal%2Chttp) for your Azure resources.*

2. **Add Phi-3 Vector API Sidecar**

Go to the Deployment Center for your application and add a sidecar container.

![Add Phi-3 Vector API sidecar]({{site.baseurl}}/media/2024/09/vector-api.jpg)

*Make sure to set the port to 5045*

This [document](https://learn.microsoft.com/azure/app-service/tutorial-custom-container-sidecar) tells you how to add sidecars, step-by-step.

3. **Verify the deployment**

Once your deployment is complete, you can browse to your application URL and see the chat frontend.

![Website output]({{site.baseurl}}/media/2024/09/phi-vector-output.jpg)


*Note: Since we are deploying a language model, please be aware that the application might take a little longer to start up the first time. This delay is due to the initial setup and loading of the Phi-3 model, which ensures that it is ready to handle requests efficiently. Subsequent startups should be faster once the model is properly initialized.*

## In Summary

In this blog, we demonstrated how to implement a Retrieval-Augmented Generation (RAG) system using the Phi-3 ONNX Runtime on Linux App Service with the Sidecar pattern. We walked through setting up the .NET backend to load markdown files into an in-memory vector database, generating responses using the Phi-3 model. We then built a Python Flask frontend to interact with the backend, showcasing how different languages and frameworks can seamlessly work together using Sidecars. Finally, we containerized both the backend and frontend for deployment.

We’d love to hear about your experiences and what you're building with the Sidecar feature. Your feedback is crucial as we continue to expand its capabilities and add more scenarios.
Binary file added media/2024/09/phi-vector-output.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/2024/09/vector-api.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8a9a608

Please sign in to comment.