You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This epic is about adding the LLM (Large language model) powered Search Engine to open-sourced Magda code space in addition to the existing keyword-based search engines.
This ticket is an epic that provides an overview of the problem that we are trying to solve.
1. Motivation
We need a vector store/searching engine to facilitate LLM embedding-based indexing & searching.
2. Indexing Strategy
We will be able to locate the most relevant information for building context for LLM without LLM being involved in the search process.
We will need a flexible indexing framework that can support various data sources/formats.
not only indexing text-based metadata fields but also relevant data files
For text-based data files (e.g. PDF, words), we can index as a large chunk of text
Although we need to decide how to include the metadata:
option 1: include in the text. e.g. document author name etc
option 2: extra metadata fields are used to recover the full context from the text chunk. e.g. chunk position
Also, need to design a protocol for handling non-text content within text-based documents
Scenario 1: graphic item. e.g. chart, graph etc.: we can index:
name (e.g. fig1) & a short description of the graphic item (usually underneath )
also, pick N chunk of text that mentions the graphic item name (e.g. fig1)
For non-text-based data files, we need to design the indexing strategy for each data format
e.g. tabular data CSV, we should at least index the list of column name
if any data dictionary information is available, we should index it as well
If the indexing module has the capability where the data dictionary information is missing, we should try to guess the column data type as well
Magda's registry metadata store can notify the minion of any metadata changes
the minion should wake up and perform the indexing tasks based on the changes
The minion module should implement an extendable code base so that we can adopt the changes of the Indexing Strategy (see section 2 above) while it evolves.
The minion can be written in typescript or any other language. But requires the capability of calling other language modules as a forked process.
The minion should store the indexing result in our internal open search instance according to our open search indexing design
The minion should support a recrawl interface to redo the index globally.
This offers us an option to re-build the index after index design changes
Some of the design has been covered by tickets I created for AI4M data-sharing platform:
LLM Powered Search Engine
This epic is about adding the LLM (Large language model) powered Search Engine to open-sourced Magda code space in addition to the existing keyword-based search engines.
This ticket is an epic that provides an overview of the problem that we are trying to solve.
1. Motivation
We need a vector store/searching engine to facilitate LLM embedding-based indexing & searching.
2. Indexing Strategy
3. Vector Store
We will use OpenSearch 2.x knn-vector field. Why?
4. Indexing Module / Microservice
We need to introduce a new module to our platform based on Magda's minion framework.
Indexing Strategy
(see section 2 above) while it evolves.recrawl
interface to redo the index globally.Some of the design has been covered by tickets I created for AI4M data-sharing platform:
But this ticket is for more generic use cases and will become the common base/facility for all Magda based projects
The text was updated successfully, but these errors were encountered: