- Configure python environment (ex.
conda create -n llm_prj_env python=3.11
,conda activate llm_prj_env
) - Install requirements
pip install -r requirements.txt
- Create pinecone index
- from console: pinecone-console
- from code: documentation
- Set secret environments
- set API Keys to
.streamlit/secrets.toml
(already registered in.gitignore
) OPENAI_API_KEY
,PINECONE_API_KEY
,PINECONE_INDEX_NAME
,UPSTAGE_API_KEY
, ...- set AWS envs if you want to use S3
- see
.streamlit/secrets_sample.toml
- set API Keys to
{
"global": {
"lang": {
"user": "Korean",
"source": "English",
"assistant": "Korean"
},
"context_hierarchy": true
},
"chat": {
},
"rag": {
"ingestion": {
"ingestor": "pinecone-multivector", // one of IngestorManager.ingestors
"embeddings": "text-embedding-3-small",
"namespace": "parent", // pinecone index namespace
"sub_namespace": "child" // pinecone index namespace
},
"transformation": {
"model": "gpt-4o-mini",
"enable": {
"translation": true,
"rewriting": true,
"expansion": false,
"hyde": true
}
},
"retrieval": {
"retriever": ["pinecone-multivector"], // sub list of RetrieverManager.retrievers
"namespace": "parent", // pinecone index namespace
"sub_namespace": "child", // pinecone index namespace
"embeddings": "text-embedding-3-small",
"top_k": 6
},
"generation": {
"model": "gpt-4o"
},
"fact_verification": {
"model": "gpt-4o",
"enable": false
}
}
}
- Configure your own
BaseRAGLoader
- Run
ingest.py
with your loader
Run
python ingest.py -h
for further information
python ingest.py -l upstage_layout -s [source_dir] -b [backup_dir] -a -d
-l: loader 종류. upstage_layout or upstage_backup or pypdf
-s: source directory
-b: backup directory
-a: all. 설정하면, download 시(-d가 enabled), S3에서 모든 파일을 다운로드. Layout analyze 시 모든 파일을 다시 analyze함
-d: download. 설정하면, 설정한 source directory로 S3에서 파일을 다운로드
- Set
UPSTAGE_API_KEY
- Prepare source documents
- Default source directory:
./source_documents/*
- You can set your own directory by running
ingest.py
with-s [source_dir]
option - To attach metadata, place
[file_name].metadata.json
in the same location as the original document. - Example documents:
- Metadata
- Note: if you want to download from S3, use
-d
option. But, it will take a lot of time.
- Default source directory:
- Set backup directory
- Analyzing layout is expensive task. You can cache the result by specifying
backup_dir
with-b [backup_dir]
option
- Analyzing layout is expensive task. You can cache the result by specifying
- If you want to ingest the entire documents, add
-a
option. If not set, ingestor will scaningestor_logs.txt
and ingest only missing files. Default backup directory is set to./backup/*
- Run
python ingest.py
with-l upstage_layout
option.
If you want to ingest from backup directory, use
-l upstage_backup
loader with proper-b [backup_dir]
python ingest.py -l upstage_backup -b [backup_dir] -a
- Run
streamlit run chat.py
- If you want to deploy the streamlit app, see link
- Default PyPDFLoader를 사용하게 되면 (loader name:
pypdf
), doc_id에 local path가 그대로 들어가게 됨 (ex./home/fadu/prj/source_documents/major.pdf
)
python ingest.py -l upstage_layout -a -d
(S3에서 전체 문서 download -> 전체 문서를 텍스트로 변환 -> Pinecone에 insert)streamlit run chat.py
python ingest.py -l upstage_loader
(upstage_loadr
가 backup된 markdown 문서가 있는지 확인 후, 이미 parse한 markdown 문서들 우선 ingest)python ingest.py -l upstage_backup
(paging issue로 인해 parse됐지만 ingest가 안된 소수의 페이지가 남아있을 수 있음. 해당 문서들 확인 후 ingest)
- S3에
.pdf
와.metadata.json
업로드 python ingest.py -l upstage_layout -d
(S3에서 추가 문서 download -> 추가 문서를 텍스트로 변환 -> Pinecone에 insert)streamlit run chat.py