RAG-based Specification Management System

Quick Start

Installation

Configure python environment (ex. conda create -n llm_prj_env python=3.11, conda activate llm_prj_env)
Install requirements
- pip install -r requirements.txt
Create pinecone index
- from console: pinecone-console
- from code: documentation
Set secret environments
- set API Keys to .streamlit/secrets.toml (already registered in .gitignore)
- OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_INDEX_NAME, UPSTAGE_API_KEY, ...
- set AWS envs if you want to use S3
- see .streamlit/secrets_sample.toml

Configuration Guide

{
    "global": {
        "lang": {
            "user": "Korean",
            "source": "English",
            "assistant": "Korean"
        },
        "context_hierarchy": true
    },
    "chat": {

    },
    "rag": {
        "ingestion": {
            "ingestor": "pinecone-multivector", // one of IngestorManager.ingestors
            "embeddings": "text-embedding-3-small",
            "namespace": "parent", // pinecone index namespace
            "sub_namespace": "child" // pinecone index namespace
        },
        "transformation": {
            "model": "gpt-4o-mini",
            "enable": {
                "translation": true,
                "rewriting": true,
                "expansion": false,
                "hyde": true
            }
        },
        "retrieval": {
            "retriever": ["pinecone-multivector"], // sub list of RetrieverManager.retrievers
            "namespace": "parent", // pinecone index namespace
            "sub_namespace": "child", // pinecone index namespace
            "embeddings": "text-embedding-3-small",
            "top_k": 6
        },
        "generation": {
            "model": "gpt-4o"
        },
        "fact_verification": {
            "model": "gpt-4o",
            "enable": false
        }
    }
}

Ingestions

General Ingestion Guide

Configure your own BaseRAGLoader
Run ingest.py with your loader

Run python ingest.py -h for further information

Upstage Ingestion Guide

python ingest.py -l upstage_layout -s [source_dir] -b [backup_dir] -a -d

-l: loader 종류. upstage_layout or upstage_backup or pypdf
-s: source directory
-b: backup directory
-a: all. 설정하면, download 시(-d가 enabled), S3에서 모든 파일을 다운로드. Layout analyze 시 모든 파일을 다시 analyze함
-d: download. 설정하면, 설정한 source directory로 S3에서 파일을 다운로드

Set UPSTAGE_API_KEY
Prepare source documents
- Default source directory: ./source_documents/*
- You can set your own directory by running ingest.py with -s [source_dir] option
- To attach metadata, place [file_name].metadata.json in the same location as the original document.
- Example documents:
- Metadata
- Note: if you want to download from S3, use -d option. But, it will take a lot of time.
Set backup directory
- Analyzing layout is expensive task. You can cache the result by specifying backup_dir with -b [backup_dir] option
If you want to ingest the entire documents, add -a option. If not set, ingestor will scan ingestor_logs.txt and ingest only missing files. Default backup directory is set to ./backup/*
Run python ingest.py with -l upstage_layout option.

If you want to ingest from backup directory, use -l upstage_backup loader with proper -b [backup_dir]
python ingest.py -l upstage_backup -b [backup_dir] -a

Run App

Run streamlit run chat.py
If you want to deploy the streamlit app, see link

Project Structure

주의 사항

Default PyPDFLoader를 사용하게 되면 (loader name: pypdf), doc_id에 local path가 그대로 들어가게 됨 (ex. /home/fadu/prj/source_documents/major.pdf)

Summary

Cold Start

python ingest.py -l upstage_layout -a -d (S3에서 전체 문서 download -> 전체 문서를 텍스트로 변환 -> Pinecone에 insert)
streamlit run chat.py

Cold Start 도중 오류 발생 시 (재시작)

python ingest.py -l upstage_loader (upstage_loadr가 backup된 markdown 문서가 있는지 확인 후, 이미 parse한 markdown 문서들 우선 ingest)
python ingest.py -l upstage_backup (paging issue로 인해 parse됐지만 ingest가 안된 소수의 페이지가 남아있을 수 있음. 해당 문서들 확인 후 ingest)

특정 문서 추가 시

S3에 .pdf와 .metadata.json 업로드
python ingest.py -l upstage_layout -d (S3에서 추가 문서 download -> 추가 문서를 텍스트로 변환 -> Pinecone에 insert)
streamlit run chat.py

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.streamlit		.streamlit
config		config
evaluate		evaluate
frontend		frontend
imgs		imgs
rag		rag
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
ingest.py		ingest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-based Specification Management System

Quick Start

Installation

Configuration Guide

Ingestions

General Ingestion Guide

Upstage Ingestion Guide

Run App

Project Structure

주의 사항

Summary

Cold Start

Cold Start 도중 오류 발생 시 (재시작)

특정 문서 추가 시

About

Releases

Packages

Languages

PJH6029/llm-rag-project

Folders and files

Latest commit

History

Repository files navigation

RAG-based Specification Management System

Quick Start

Installation

Configuration Guide

Ingestions

General Ingestion Guide

Upstage Ingestion Guide

Run App

Project Structure

주의 사항

Summary

Cold Start

Cold Start 도중 오류 발생 시 (재시작)

특정 문서 추가 시

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages