Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Powered Search Engine #3503

Open
t83714 opened this issue Feb 16, 2024 · 0 comments
Open

LLM Powered Search Engine #3503

t83714 opened this issue Feb 16, 2024 · 0 comments
Assignees
Milestone

Comments

@t83714
Copy link
Contributor

t83714 commented Feb 16, 2024

LLM Powered Search Engine

This epic is about adding the LLM (Large language model) powered Search Engine to open-sourced Magda code space in addition to the existing keyword-based search engines.

This ticket is an epic that provides an overview of the problem that we are trying to solve.

1. Motivation

We need a vector store/searching engine to facilitate LLM embedding-based indexing & searching.

2. Indexing Strategy

  • We will be able to locate the most relevant information for building context for LLM without LLM being involved in the search process.
  • We will need a flexible indexing framework that can support various data sources/formats.
    • not only indexing text-based metadata fields but also relevant data files
    • For text-based data files (e.g. PDF, words), we can index as a large chunk of text
      • Although we need to decide how to include the metadata:
        • option 1: include in the text. e.g. document author name etc
        • option 2: extra metadata fields are used to recover the full context from the text chunk. e.g. chunk position
      • Also, need to design a protocol for handling non-text content within text-based documents
        • Scenario 1: graphic item. e.g. chart, graph etc.: we can index:
          • name (e.g. fig1) & a short description of the graphic item (usually underneath )
          • also, pick N chunk of text that mentions the graphic item name (e.g. fig1)
    • For non-text-based data files, we need to design the indexing strategy for each data format
      • e.g. tabular data CSV, we should at least index the list of column name
        • if any data dictionary information is available, we should index it as well
        • If the indexing module has the capability where the data dictionary information is missing, we should try to guess the column data type as well

3. Vector Store

We will use OpenSearch 2.x knn-vector field. Why?

4. Indexing Module / Microservice

We need to introduce a new module to our platform based on Magda's minion framework.

  • How it works
    • Magda's registry metadata store can notify the minion of any metadata changes
    • the minion should wake up and perform the indexing tasks based on the changes
      • The minion module should implement an extendable code base so that we can adopt the changes of the Indexing Strategy (see section 2 above) while it evolves.
      • The minion can be written in typescript or any other language. But requires the capability of calling other language modules as a forked process.
      • The minion should store the indexing result in our internal open search instance according to our open search indexing design
    • The minion should support a recrawl interface to redo the index globally.
      • This offers us an option to re-build the index after index design changes

Some of the design has been covered by tickets I created for AI4M data-sharing platform:

But this ticket is for more generic use cases and will become the common base/facility for all Magda based projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant