Kor-xiliuxiliu is a text extraction tool based on langchain and kor. It is designed to extract text from various sources and provide a clean and structured output.
- Text extraction from different file formats, including PDF, Word documents, and HTML.
- Support for multiple languages, allowing extraction of text in different languages.
- Automatic language detection for efficient processing.
- Clean and structured output, removing unnecessary formatting and preserving the original document structure.
To install Kor-xiliuxiliu, follow these steps:
- Clone the repository:
git clone https://github.com/dcuplover/kor-xiliuxiliu.git
- Navigate to the project directory:
cd kor-xiliuxiliu
- Install the required dependencies:
pip install -r requirements.txt
- 编辑config.py文件,填写必要的参数
- 执行命令:
python get_data.py --model_name "模型名称,可以参考config" \
--schema_name "选择schema,schema文件放在schemas文件夹下" \
--data_type "url 数据类型是通过url获取"
--data "url地址"
--chunk_size "分割文本时的最大字符数"
DEMO
python get_data.py --model_name "gpt-3.5-turbo" \
--schema_name "PersonDialogue" \
--data_type "url" \
--data "http://www.gudianmingzhu.com/guji/hongloumeng/11369.html" \
--chunk_size 500
不同模型有不同的提取效果,我测试了一些模型的提取效果,可参考
- 支持更多的文档格式
- api
- webui
- 优化多个url文档的获取方式