Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Symfony docs chat bot (RAG) #19445

Open
etienneleba opened this issue Jan 20, 2024 · 20 comments
Open

[Feature] Symfony docs chat bot (RAG) #19445

etienneleba opened this issue Jan 20, 2024 · 20 comments

Comments

@etienneleba
Copy link

etienneleba commented Jan 20, 2024

Hi everyone,

I'm thinking about creating a chatbot (or at least a query system with enhanced result) based on the Symfony/PHP documentation.

Tell me what you think, if you see any blockers or if you have any suggestions. Happy coding !

1. issues

  • Mainstream chatbots are based on outdated data (2021/2022). When we are looking for some information in a technical doc we need up to date data.
  • Mainstream chatbots most of the time doesn't give their sources
  • Mainstream chabots are mainstream, a specialized one could be more efficient

2. Solution

Of course, I'm not crazy, I don't want to trained a model, it's quite expensive and I don't have the skills. My idea is to create a RAG. By indexing multiple documentations of the Symfony ecosystem in a vector database, and then use a small open-source model (Mistral 7b) to analyse and enhanced the result, I think I can create something really nice.

3. Steps

The POC

V1

  • index Symfony LTS docs until 4.4, PHP docs until 7.4 and famous PHP/Symfony librairies (APIP, Doctrine, PHPUnit...)
  • module AI written in Python around weaviate and serve by fastAPI
  • Back-end to handle user, caching, auto-completion, rate-limiting (Symfony)
  • Simple console client to request the back-end (Symfony)
  • Dockerize and script everything

V2

  • from a query system to a chatbot (historic and context)
  • Simple website interface with Symfony UX

More feature

  • possibility to give some code with the question
  • index stack overflow question-answer
  • put a link in the profiler, so in one click we can send the error to the chatbot

4. Business model

Hosting the infrastructure and using the mistral 7b model will have a cost which I can afford as a side-project but not if people start using it a lot. I'm open to any suggestion.

Sources

@etienneleba etienneleba changed the title Symfony docs chat bot (RAG) [Feature] Symfony docs chat bot (RAG) Jan 20, 2024
@chr-hertel
Copy link
Contributor

Yes, that would be awesome, but maybe not as trivial. @OskarStark and I already started this before, but the first results were actually not better than plain GPT, since the model already knows a lot about Symfony. But you're right - it is not up to date.

Were you already able to combine Symfony Docs, RAG and the coding "skills" of a model? Do you have a repo already to have a look?

@etienneleba
Copy link
Author

Ok interesting, which model did you use ?

No, it's just the beginning, I only played with the Mistral 7b model and I was quite impressed that a small and open source model gave me good results. With some up to date docs it should give great results I think. A GPT-4 will always be better for analysis and programming but I think having a tool able to give pertinent results and links to the doc could add a lot of value to the current symfony documentation 🤔

@chr-hertel
Copy link
Contributor

We used ada for embeddings, pgvector for retrieval and gpt-4 as model

@etienneleba
Copy link
Author

And a RAG with GPT-4 was weaker than GPT-4 alone ?

So weird 🤔

But maybe creating a chatbot is a bit too ambitious. Let's start with a simple RAG to simplify the doc navigation

@etienneleba
Copy link
Author

And what was your strategy to split the chunk ?

Did you add metadata to the chunks ?

@etienneleba
Copy link
Author

etienneleba commented Jan 21, 2024

Feedback POC

I created a POC able to retrieve/parse/chunk a Symfony doc page. I used the page scheduler cause the scheduler is a really new component.

vector database : weaviate
vectorizer : text2vec_contextionary
generative model : mistralai/Mixtral-8x7B-Instruct-v0.1

Prompt :

"You are an expert in the programming language PHP and the PHP framework Symfony. With the help of the following documents, your knowledge about PHP and the Symfony framework, please answer the question/instruction : 'How to create a custom trigger with the scheduler component ?'. Include code examples of PHP and Symfony. You will format you answer in markdown"

Answer :

Creating a custom trigger with the Symfony scheduler component involves implementing the TriggerInterface
in a custom class. This class will be responsible for determining the next run date for the message.

Here's an example of how to create a custom trigger that runs a message every other day:

  1. Create a new class CustomTrigger that implements the TriggerInterface:
// src/Scheduler/Trigger/CustomTrigger.php
namespace App\Scheduler\Trigger;

use Symfony\Component\Scheduler\TriggerInterface;
use Symfony\Component\Scheduler\Trigger\TriggerableInterface;

class CustomTrigger implements TriggerInterface
{
    public function getNextRunDate(TriggerableInterface $triggerable, \DateTimeImmutable $run): 
?\DateTimeImmutable
    {
        $nextRun = $run->modify('+2 days');

        return $nextRun;
    }

    public function __toString(): string
    {
        return 'Custom Trigger';
    }
}
  1. Define your recurring message:
use App\Scheduler\Message\MyMessage;
use App\Scheduler\Trigger\CustomTrigger;

RecurringMessage::trigger(
    new CustomTrigger(),
    new MyMessage()
);
  1. Attach the recurring message to a schedule:
// src/Scheduler/MyScheduleProvider.php
namespace App\Scheduler;

use App\Scheduler\Message\MyMessage;
use App\Scheduler\Trigger\CustomTrigger;
use Symfony\Component\Scheduler\Schedule;
use Symfony\Component\Scheduler\ScheduleProviderInterface;

#[AsSchedule('uptoyou')]
class SaleTaskProvider implements ScheduleProviderInterface
{
    public function getSchedule(): Schedule
    {
        return $this->schedule ??= (new Schedule())
            ->with(
                RecurringMessage::trigger(
                    new CustomTrigger(),
                    new MyMessage()
                )
            );
    }
}
  1. Consume the messages (run the worker):
$ php bin/console messenger:consume scheduler_nameofyourschedule

This is a simple example of creating a custom trigger that runs a message every other day. You can modify
the getNextRunDate method to implement any custom schedule you need.


The result is quite nice I think. With the simple mistal-7b model the results were not good. I asked the same prompt to the model without any document and it was able to answer a bit but not as rich as the RAG (I don't know how he knows about the scheduler whereas it's a new component, maybe he is confusing with the older bundle 🤔)

@etienneleba
Copy link
Author

etienneleba commented Jan 22, 2024

Just see this : https://youtu.be/zFJgRd05Noo?si=eHgUklO5-iW11Fcf

@yoanbernabeu
Copy link

Just see this : https://youtu.be/zFJgRd05Noo?si=eHgUklO5-iW11Fcf

This is a POC for educational purposes, to test the LLPhant library. Personally, I've tested it by indexing the entirety of the documentation, and the results are quite good, although GPT-4 is already not too bad by default.

The main advantage is noticeable in the freshness of the information.

@etienneleba
Copy link
Author

etienneleba commented Jan 23, 2024

Feedback POC

I indexed all the current documentation. With a basic user question, the vector database was not really able to retrieve the most interesting documents... A way to improve it was to ask the model to rephrase the question but at some the point, the most efficient way was to ask the model to answer the question then look for similarities in the DB. For this first step it's what I'm going to do => a model that answer a question and then I retrieve the most pertinent documents thanks to the answer. Like this the user can look further directly with the documentation (with each document I have the link to the documentation)

@chr-hertel
Copy link
Contributor

So retrieval basically is a similarity search for further reading based on the plain models response?

@Guikingone
Copy link
Contributor

I might be wrong but as the documentation already use Meilisearch as its primary search engine and as MS introduced recently the vector search feature, maybe the team can work around it to build a "tailor-made" bot?

Documentation: https://www.meilisearch.com/docs/learn/experimental/vector_search

@etienneleba
Copy link
Author

etienneleba commented Jan 24, 2024

So retrieval basically is a similarity search for further reading based on the plain models response?

Yeah exactly. I tested 4 different approaches and this one seems to be the best.

1. Retrieval with the user question

This one gave really bad result, the vector database seems to be lost, it's like there is not enough context.

2. Retrieval with the keywords extract from the question

I was asking the model to extract the main keywords/concepts of the user question. It was much better but I was losing the user intention and the vector database was still missing some context. For example, if the model return the word 'Symfony' at some point, all the retrieval was about installing or local web server....

3 Retrieval with the context around the question

I was asking the model to give keywords, main concepts and context around the user question. This was working great but at some point the model was just answering the question...

4 Retrieval with the answer of the model

In all the previous approaches, I was giving the retrieve documents as context for the model to answer the question. As @chr-hertel said previously the models (GPT-4, Mistral 8x7b) are really good and I was not seeing significant improvements. So I decided to let the model answer (Of course I'm still guiding with some prompt engineering) and then use the answer to retrieve the documents. This approach is the most efficient and really cheap, since I'm not giving documents as context, querying the model is much cheaper. With this approach I have a specialized Symfony bot (Thanks to prompt engineering) and if the user want to see further or get up to date data, the most related doc sections are complementing the model answer thanks to the retrieval

@etienneleba
Copy link
Author

I might be wrong but as the documentation already use Meilisearch as its primary search engine and as MS introduced recently the vector search feature, maybe the team can work around it to build a "tailor-made" bot?

Documentation: https://www.meilisearch.com/docs/learn/experimental/vector_search

Interesting, maybe there is an easy bridge to make 🤔

@javiereguiluz
Copy link
Member

Thanks for this proposal. I find this discussion super interesting. Same for the talk that @chr-hertel delivered in the recent Symfony online conference.

However, there's a big problem that will prevent us from using this: money. Using a custom solution hosted by us is out of question, but using embeddings with Meilisearch won't be possible either because of this:

It is mandatory to pass an OpenAI API key through the OPENAI_API_KEY environment variable or the apiKey field when using an OpenAI embedder.

We can't pay for this ... so we need to hope that GPT-4 and others crawl Symfony Docs again and their contents update with the latest changes. Thanks for understanding 🙏

@etienneleba
Copy link
Author

Last update

Screencast.from.25.01.2024.21.03.56.webm

@etienneleba
Copy link
Author

Thanks for this proposal. I find this discussion super interesting. Same for the talk that @chr-hertel delivered in the recent Symfony online conference.

However, there's a big problem that will prevent us from using this: money. Using a custom solution hosted by us is out of question, but using embeddings with Meilisearch won't be possible either because of this:

It is mandatory to pass an OpenAI API key through the OPENAI_API_KEY environment variable or the apiKey field when using an OpenAI embedder.

We can't pay for this ... so we need to hope that GPT-4 and others crawl Symfony Docs again and their contents update with the latest changes. Thanks for understanding 🙏

The indexing part is not really an issue, with the text2vec_context model of Weaviate it's free. Anyway, I will start the project as a side project and if it works, let's talk again how we can bridge it with the Symfony docs

@KevinMartinsDevUbi
Copy link

Hello, excellent idea! I've been thinking about applying a similar concept to internal documentation. I'm very interested in your progress in this area. Although I lack experience in this field, I'm happy to contribute. Please don't hesitate to contact me if you need any help.

@florentdestremau
Copy link
Contributor

Hi 👋 Have you also looked at a custom chat gpt where you can upload docs as permanent context ? https://help.openai.com/en/articles/8554397-creating-a-gpt This would need to be fed with a big text version of the docs but it could work

@smnandre
Copy link
Member

I created one dedicated to Symfony UX and gave it all the source code + documentation + some blog posts... and the results are sometimes impressive, sometimes 100% off-topic. And it has this weird habit to persist in mistakes.

This one has a lot of "custom-made" instructions, tailored for my specific needs during the experiment, but i'll recreate and open one more "standard" next month i hope, if anyone want to play with.

@WebMamba
Copy link
Contributor

@florentdestremau I made one but the thing is you have to be premium to access it: https://chat.openai.com/g/g-hpapJXDZg-symfonychatbot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants