Kor-xiliuxiliu

Kor-xiliuxiliu is a text extraction tool based on langchain and kor. It is designed to extract text from various sources and provide a clean and structured output.

Features

Text extraction from different file formats, including PDF, Word documents, and HTML.
Support for multiple languages, allowing extraction of text in different languages.
Automatic language detection for efficient processing.
Clean and structured output, removing unnecessary formatting and preserving the original document structure.

Installation

To install Kor-xiliuxiliu, follow these steps:

Clone the repository: git clone https://github.com/dcuplover/kor-xiliuxiliu.git
Navigate to the project directory: cd kor-xiliuxiliu
Install the required dependencies: pip install -r requirements.txt

Usage

编辑config.py文件，填写必要的参数
执行命令：

python get_data.py --model_name "模型名称，可以参考config" \
--schema_name "选择schema，schema文件放在schemas文件夹下" \
--data_type "url 数据类型是通过url获取"
--data "url地址"
--chunk_size "分割文本时的最大字符数"

DEMO

python get_data.py --model_name "gpt-3.5-turbo" \
--schema_name "PersonDialogue" \
--data_type "url" \
--data "http://www.gudianmingzhu.com/guji/hongloumeng/11369.html" \
--chunk_size 500

使用api是需要花钱的哟，各位要慎重考虑和使用。

不同模型有不同的提取效果，我测试了一些模型的提取效果，可参考

测试不同大模型从非结构化信息中提取结构化信息的能力

TodoList

支持更多的文档格式
api
webui
优化多个url文档的获取方式

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
llms		llms
schemas		schemas
.gitignore		.gitignore
README.md		README.md
config.py		config.py
extract.py		extract.py
get_data.py		get_data.py
llm.py		llm.py
requirements.txt		requirements.txt
schema.py		schema.py
text.py		text.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kor-xiliuxiliu

Features

Installation

Usage

使用api是需要花钱的哟，各位要慎重考虑和使用。

TodoList

About

Releases

Packages

Languages

dcuplover/kor-xiliuxiliu

Folders and files

Latest commit

History

Repository files navigation

Kor-xiliuxiliu

Features

Installation

Usage

使用api是需要花钱的哟，各位要慎重考虑和使用。

TodoList

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages