diff --git a/pipelines/examples/unsupervised-question-answering/README.md b/pipelines/examples/unsupervised-question-answering/README.md new file mode 100644 index 000000000000..71f801016603 --- /dev/null +++ b/pipelines/examples/unsupervised-question-answering/README.md @@ -0,0 +1,224 @@ +# 无监督智能检索问答系统 + +## 1. 场景概述 + +智能问答(QA)是获取信息和只是的更直接、更高效的方式之一,传统的信息检索系统只能找到相关的文档,而问答系统能够直接找到精准的答案,极大的节省了人们获取信息的时间。问答系统中最关键的挑战之一是标记数据的稀缺性,这是因为对目标领域获取问答对或常见问答对(FAQ)的成本很高,需要消耗大量的人力和时间。由于上述制约,这导致检索式问答系统落地困难,解决此问题的一种方法是依据上下文或大量非结构化文本自动生成的QA问答对。 + +本项目,即无监督智能检索问答(问答对自动生成智能检索式问答),基于PaddleNLP问题生成、UIE、检索式问答,支持以非结构化文本形式为上下文自动生成QA问答对,生成的问答对语料可以通过无监督的方式构建检索式问答系统。 + +
+ +
+ +若开发者已有FAQ语料,请参考FAQ检索式问答。 +## 2. 产品功能介绍 + +本项目提供了低成本搭建问答对自动生成智能检索问答系统的能力。开发者只需要提供非结构化的纯文本,就可以使用本项目预制的问答对生成模块生成大量的问答对,并基于此快速搭建一个针对自己业务的检索问答系统,并可以提供Web可视化产品服务。Web可视化产品服务支持问答检索、在线问答对生成,在线文件上传和解析,在线索引库更新等功能,用户也可根据需要自行调整。 + +**【注意】** 以下教程使用预置模型,如果用户想训练并接入自己训练的模型,可以参考[intelligent-qa](paddle/paddlenlp/unsupervised_qa_pipelines/PaddleNLP/applications/question_answering)。 + +### 2.1 系统特色 ++ 低成本 + + 可通过自动生成的方式快速大量合成QA语料,大大降低人力成本 + + 可控性好,合成语料和语义检索解耦合,可以人工筛查和删除合成的问答对,也可以添加人工标注的问答对 ++ 端到端 + + 提供包括问答语料生成、索引库构建、模型服务部署、WebUI可视化一整套端到端智能问答系统能力 + + 支持对Txt、Word、PDF、Image多源数据上传,同时支持离线、在线QA语料生成和ANN数据库更新 ++ 效果好 + + 可通过自动问答对生成提升问答对语料覆盖度,缓解中长尾问题覆盖较少的问题 + + 依托百度领先的NLP技术,预置效果领先的深度学习模型 + +## 3. 快速开始: 快速搭建无监督智能检索问答系统 + +以下是针对mac和linux的搭建流程。 + +### 3.1 运行环境和安装说明 + +本项目采用了以下的运行环境进行,详细说明如下,用户也可以在自己的GPU硬件环境进行: + +a. 软件环境: +- python >= 3.7.0 +- paddlenlp >= 2.4.3 +- paddlepaddle-gpu >=2.3 +- CUDA Version: 10.2 +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) + +b. 硬件环境: + +- NVIDIA Tesla V100 16GB x4卡 +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + +c. 依赖安装: +首先需要安装PaddlePaddle,PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)。然后需要安装paddle-pipelines依赖,使用pip安装命令如下: +```bash +# pip一键安装 +pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple +``` +或者进入pipelines目录下,针对源码进行安装: +```bash +# 源码进行安装 +cd PaddleNLP/pipelines/ +pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple +python setup.py install +``` +**【注意】** 以下的所有的流程都只需要在`pipelines`根目录下进行,不需要跳转目录 + +### 3.2 数据说明 +我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例,系统将每一条都视为一个上下文并基于此生成多个问答对,并基于此构建索引库,该文件可直接下载放入./data,开发者也可以使用自己的文件。 + + +### 3.3 一键体验无监督智能检索问答系统 + +开发者可以通过如下命令快速体验无监督智能检索问答系统的效果,系统将自动根据提供的纯文本文件构建问答对语料库,并基于生成的问答对语料库构造检索数据库。 +我们建议在GPU环境下运行本示例,运行速度较快,运行命令如下: +```bash +# GPU环境下运行示例 +# 设置1个空闲的GPU卡,此处假设0卡为空闲GPU +export CUDA_VISIBLE_DEVICES=0 +python examples/unsupervised-question-answering/unsupervised_question_answering_example.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16 +``` +关键参数释义如下: +- `device`: 使用的设备,默认为'gpu',可选择['cpu', 'gpu']。 +- `source_file`: 源文件路径,指定该路径将自动为其生成问答对至`doc_dir`。 +- `doc_dir`: 生成的问答对语料保存的位置,系统将根据该位置自动构建检索数据库,默认为'data/my_data'。 +- `index_name`: FAISS的ANN索引名称,默认为'faiss_index'。 +- `retriever_batch_size`: 构建ANN索引时的批量大小,默认为16。 + +如果只有CPU机器,可以通过--device参数指定cpu即可, 运行耗时较长,运行命令如下: +```bash +# CPU环境下运行示例 +unset CUDA_VISIBLE_DEVICES +python examples/unsupervised-question-answering/unsupervised_question_answering_example.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data +``` +**【注意】** `unsupervised_question_answering_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md) + +### 3.4 构建Web可视化无监督智能检索问答系统 + +整个Web可视化无监督智能检索问答系统主要包含3大组件: +1. 基于ElasticSearch的ANN服务搭建在线索引库 +2. 基于RestAPI构建模型后端服务 +3. 基于Streamlit构建前端WebUI + +接下来我们依次搭建这些个服务,得到可视化、可交互的无监督智能检索问答系统。 + + +#### 3.4.1 离线生成问答对语料 +执行以下命令将自动根据提供的纯文本文件离线构建问答对语料库: +```bash +# GPU环境下运行示例 +# 设置1个空闲的GPU卡,此处假设0卡为空闲GPU +export CUDA_VISIBLE_DEVICES=0 +python examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data +``` +关键参数释义如下: +- `device`: 使用的设备,默认为'gpu',可选择['cpu', 'gpu']。 +- `source_file`: 源文件路径,指定该路径将自动为其生成问答对至`doc_dir`。 +- `doc_dir`: 生成的问答对语料保存的位置,系统将根据该位置自动构建检索数据库,默认为'data/my_data'。 + + +如果只有CPU机器,可以通过--device参数指定cpu即可, 运行耗时较长,运行命令如下: +```bash +# CPU环境下运行示例 +unset CUDA_VISIBLE_DEVICES +python examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data +``` + +#### 3.4.2 启动ElasticSearch ANN服务 +1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。 +2. 启动ElasticSearch服务。 + +首先修改`config/elasticsearch.yml`的配置: +``` +xpack.security.enabled: false +``` +然后启动elasticsearch: +```bash +./bin/elasticsearch +``` +3. 检查确保ElasticSearch服务启动成功。 + +执行以下命令,如果ElasticSearch里面没有数据,结果会输出为空,即{ }。 +```bash +curl http://localhost:9200/_aliases?pretty=true +``` + +备注:ElasticSearch服务默认开启端口为 9200 + +#### 3.4.3 ANN索引库构建 +执行以下命令建立ANN索引库: +``` +python utils/offline_ann.py --index_name my_data \ + --doc_dir data/my_data \ + --split_answers \ + --delete_index +``` +参数含义说明 +* `index_name`: 索引的名称 +* `doc_dir`: txt文本数据的路径 +* `host`: Elasticsearch的IP地址 +* `port`: Elasticsearch的端口号 +* `split_answers`: 是否切分每一行的数据为query和answer两部分 +* `delete_index`: 是否删除现有的索引和数据,用于清空es的数据,默认为false + +执行以下命令打印几条数据,检测ANN索引库是否构建成功: +``` +curl http://localhost:9200/my_data/_search +``` +如果索引库正常会输出类似如下的结果: +``` +{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":5,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"my_data","_id":"fb308738f2767626d72282f5a35402e5","_score":1.0,"_source":{"content":...... +``` + +#### 3.4.4 启动RestAPI模型后端 +```bash +export CUDA_VISIBLE_DEVICES=0 +# 指定无监督智能检索问答系统的Yaml配置文件 +export PIPELINE_YAML_PATH=rest_api/pipeline/unsupervised_qa.yaml +# 使用端口号8896启动模型服务 +python rest_api/application.py 8896 +``` +Linux 用户推荐采用Shell脚本来启动服务:: + +```bash +sh examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh +``` +启动后可以使用curl命令验证是否成功运行: +``` +curl -X POST -k http://localhost:8896/query -H 'Content-Type: application/json' -d '{"query": "企业如何办理养老保险?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}' +``` +如果成功运行,则会返回结果。 + +#### 3.4.5 启动Streamlit WebUI前端 +```bash +# 配置模型服务地址 +export API_ENDPOINT=http://127.0.0.1:8896 +# 在指定端口 8502 启动 WebUI +python -m streamlit run ui/webapp_unsupervised_question_answering.py --server.port 8508 +``` +Linux 用户推荐采用 Shell 脚本来启动服务:: + +```bash +sh examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh +``` + +到这里您就可以打开浏览器访问地址 http://127.0.0.1:8508 体验无监督智能检索问答系统服务了。 + + + +**【注意】** 如果安装遇见问题可以查看[FAQ文档](../../FAQ.md) + +## Reference +[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137 + +[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191 + +[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142 + +[4]Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020). + +## Acknowledge + +我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计,在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。 + +We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community. diff --git a/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py b/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py new file mode 100644 index 000000000000..1c604799d1ab --- /dev/null +++ b/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py @@ -0,0 +1,80 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +from pprint import pprint + +import paddle +from pipelines.nodes import AnswerExtractor, QAFilter, QuestionGenerator +from pipelines.nodes import ErnieRanker, DensePassageRetriever +from pipelines.document_stores import FAISSDocumentStore +from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http, print_documents +from pipelines.pipelines import QAGenerationPipeline, SemanticSearchPipeline + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.") +parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer piars file to be loaded when building ANN index.") +parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.") + +args = parser.parse_args() +# yapf: enable + + +def offline_qa_generation(): + answer_extractor = AnswerExtractor( + model='uie-base-answer-extractor-v1', + device=args.device, + schema=['答案'], + position_prob=0.01, + ) + + question_generator = QuestionGenerator( + model='unimo-text-1.0-question-generator-v1', + device=args.device, + ) + + qa_filter = QAFilter( + model='uie-base-qa-filter-v1', + device=args.device, + schema=['答案'], + position_prob=0.1, + ) + + pipe = QAGenerationPipeline(answer_extractor=answer_extractor, + question_generator=question_generator, + qa_filter=qa_filter) + pipeline_params = {"QAFilter": {"is_filter": True}} + + if args.source_file: + meta = [] + with open(args.source_file, 'r', encoding='utf-8') as rf: + for line in rf: + meta.append(line.strip()) + prediction = pipe.run(meta=meta, params=pipeline_params) + prediction = prediction['filtered_cqa_triples'] + if not os.path.exists(args.doc_dir): + os.makedirs(args.doc_dir) + with open(os.path.join(args.doc_dir, 'generated_qa_pairs.txt'), + 'w', + encoding='utf-8') as wf: + for pair in prediction: + wf.write(pair['synthetic_question'].strip() + '\t' + + pair['synthetic_answer'].strip() + '\n') + + +if __name__ == "__main__": + offline_qa_generation() diff --git a/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh new file mode 100644 index 000000000000..07fe42ebc70e --- /dev/null +++ b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh @@ -0,0 +1,22 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# 环境变量设置 +export CUDA_VISIBLE_DEVICES=0 + +# 指定语义检索系统的Yaml配置文件 +export PIPELINE_YAML_PATH=rest_api/pipeline/unsupervised_qa.yaml + +# 使用端口号 8896 启动模型服务 +python rest_api/application.py 8896 \ No newline at end of file diff --git a/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh new file mode 100644 index 000000000000..da69ca06d019 --- /dev/null +++ b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh @@ -0,0 +1,23 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# 环境变量设置 +unset http_proxy && unset https_proxy +export CUDA_VISIBLE_DEVICES=0 + +# 配置模型服务地址 +export API_ENDPOINT=http://127.0.0.1:8896 + +# 在指定端口8896启动WebUI +python -m streamlit run ui/webapp_unsupervised_question_answering.py --server.port 8508 \ No newline at end of file diff --git a/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py b/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py new file mode 100644 index 000000000000..890e61c4c9a0 --- /dev/null +++ b/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py @@ -0,0 +1,158 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import logging +import os +from pprint import pprint + +import paddle +from pipelines.nodes import AnswerExtractor, QAFilter, QuestionGenerator +from pipelines.nodes import ErnieRanker, DensePassageRetriever +from pipelines.document_stores import FAISSDocumentStore +from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http, print_documents +from pipelines.pipelines import QAGenerationPipeline, SemanticSearchPipeline + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.") +parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.") +parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.") +parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.") +parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.") +parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer piars file to be loaded when building ANN index.") +parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.") + +args = parser.parse_args() +# yapf: enable + + +def dense_faq_pipeline(): + use_gpu = True if args.device == 'gpu' else False + faiss_document_store = "faiss_document_store.db" + if os.path.exists(args.index_name) and os.path.exists(faiss_document_store): + # connect to existed FAISS Index + document_store = FAISSDocumentStore.load(args.index_name) + retriever = DensePassageRetriever( + document_store=document_store, + query_embedding_model="rocketqa-zh-dureader-query-encoder", + passage_embedding_model="rocketqa-zh-dureader-query-encoder", + max_seq_len_query=args.max_seq_len_query, + max_seq_len_passage=args.max_seq_len_passage, + batch_size=args.retriever_batch_size, + use_gpu=use_gpu, + embed_title=False, + ) + else: + dicts = convert_files_to_dicts(dir_path=args.doc_dir, + split_paragraphs=True, + split_answers=True, + encoding='utf-8') + + if os.path.exists(args.index_name): + os.remove(args.index_name) + if os.path.exists(faiss_document_store): + os.remove(faiss_document_store) + + document_store = FAISSDocumentStore(embedding_dim=768, + faiss_index_factory_str="Flat") + document_store.write_documents(dicts) + + retriever = DensePassageRetriever( + document_store=document_store, + query_embedding_model="rocketqa-zh-dureader-query-encoder", + passage_embedding_model="rocketqa-zh-dureader-query-encoder", + max_seq_len_query=args.max_seq_len_query, + max_seq_len_passage=args.max_seq_len_passage, + batch_size=args.retriever_batch_size, + use_gpu=use_gpu, + embed_title=False, + ) + + # update Embedding + document_store.update_embeddings(retriever) + + # save index + document_store.save(args.index_name) + + ### Ranker + ranker = ErnieRanker( + model_name_or_path="rocketqa-zh-dureader-cross-encoder", + use_gpu=use_gpu) + + pipe = SemanticSearchPipeline(retriever, ranker) + + pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}} + prediction = pipe.run(query="世界上最早的地雷发明者是谁?", params=pipeline_params) + + print_documents(prediction, print_name=False, print_meta=True) + + +def qa_generation_pipeline(): + answer_extractor = AnswerExtractor( + model='uie-base-answer-extractor', + device=args.device, + schema=['答案'], + max_answer_candidates=3, + position_prob=0.01, + ) + + question_generator = QuestionGenerator( + model='unimo-text-1.0-question-generation', + device=args.device, + num_return_sequences=2, + ) + + qa_filter = QAFilter( + model='uie-base-qa-filter', + device=args.device, + schema=['答案'], + position_prob=0.1, + ) + + pipe = QAGenerationPipeline(answer_extractor=answer_extractor, + question_generator=question_generator, + qa_filter=qa_filter) + pipeline_params = {"QAFilter": {"is_filter": True}} + + # list example + meta = [ + "世界上最早的电影院是美国洛杉矶的“电气剧场”,建于1902年。", + "以脸书为例,2020年时,54%的成年人表示,他们从该平台获取新闻。而现在,这个数字下降到了44%。与此同时,YouTube在过去几年里一直保持平稳,约有三分之一的用户在该平台上获取新闻。" + ] + prediction = pipe.run(meta=meta, params=pipeline_params) + prediction = prediction['filtered_cqa_triples'] + pprint(prediction) + + # file example + if args.source_file: + meta = [] + with open(args.source_file, 'r', encoding='utf-8') as rf: + for line in rf: + meta.append(line.strip()) + prediction = pipe.run(meta=meta, params=pipeline_params) + prediction = prediction['filtered_cqa_triples'] + if not os.path.exists(args.doc_dir): + os.makedirs(args.doc_dir) + with open(os.path.join(args.doc_dir, 'generated_qa_pairs.txt'), + 'w', + encoding='utf-8') as wf: + for pair in prediction: + wf.write(pair['synthetic_question'].strip() + '\t' + + pair['synthetic_answer'].strip() + '\n') + + +if __name__ == "__main__": + qa_generation_pipeline() + dense_faq_pipeline() diff --git a/pipelines/pipelines/__init__.py b/pipelines/pipelines/__init__.py index 4e4edb9dba82..a704e7362caa 100644 --- a/pipelines/pipelines/__init__.py +++ b/pipelines/pipelines/__init__.py @@ -37,11 +37,9 @@ from pipelines.schema import Document, Answer, Label, Span from pipelines.nodes import BaseComponent from pipelines.pipelines import Pipeline -from pipelines.pipelines.standard_pipelines import (BaseStandardPipeline, - ExtractiveQAPipeline, - SemanticSearchPipeline, - DocPipeline, - TextToImagePipeline) +from pipelines.pipelines.standard_pipelines import ( + BaseStandardPipeline, ExtractiveQAPipeline, SemanticSearchPipeline, + TextToImagePipeline, QAGenerationPipeline, DocPipeline) import pandas as pd pd.options.display.max_colwidth = 80 diff --git a/pipelines/pipelines/nodes/__init__.py b/pipelines/pipelines/nodes/__init__.py index 4b2a2e02aacb..ab753d078255 100644 --- a/pipelines/pipelines/nodes/__init__.py +++ b/pipelines/pipelines/nodes/__init__.py @@ -31,3 +31,5 @@ from pipelines.nodes.retriever import BaseRetriever, DensePassageRetriever from pipelines.nodes.document import DocOCRProcessor, DocPrompter from pipelines.nodes.text_to_image_generator import ErnieTextToImageGenerator +from pipelines.nodes.answer_extractor import AnswerExtractor, QAFilter, AnswerExtractorPreprocessor, QAFilterPostprocessor +from pipelines.nodes.question_generator import QuestionGenerator diff --git a/pipelines/pipelines/nodes/answer_extractor/__init__.py b/pipelines/pipelines/nodes/answer_extractor/__init__.py new file mode 100644 index 000000000000..87089b190666 --- /dev/null +++ b/pipelines/pipelines/nodes/answer_extractor/__init__.py @@ -0,0 +1,17 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pipelines.nodes.answer_extractor.answer_extractor import AnswerExtractor +from pipelines.nodes.answer_extractor.answer_extractor_preprocessor import AnswerExtractorPreprocessor +from pipelines.nodes.answer_extractor.qa_filter import QAFilter +from pipelines.nodes.answer_extractor.qa_filter_postprocessor import QAFilterPostprocessor diff --git a/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py b/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py new file mode 100644 index 000000000000..01b9affe8f29 --- /dev/null +++ b/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py @@ -0,0 +1,192 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 deepset GmbH. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import json +import sys +import argparse +import re +from tqdm import tqdm + +import paddle +from paddlenlp import Taskflow +from pipelines.nodes.base import BaseComponent +from paddlenlp.utils.env import PPNLP_HOME +from paddlenlp.taskflow.utils import download_file +from paddle.dataset.common import md5file + + +class AnswerExtractor(BaseComponent): + """ + Answer Extractor based on Universal Information Extraction. + """ + resource_files_names = { + "model_state": "model_state.pdparams", + "model_config": "model_config.json", + "vocab_file": "vocab.txt", + "special_tokens_map": "special_tokens_map.json", + "tokenizer_config": "tokenizer_config.json" + } + + resource_files_urls = { + "uie-base-answer-extractor": { + "model_state": [ + "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/model_state.pdparams", + "c8619f631a0c20434199840d34bb8b8c" + ], + "model_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/model_config.json", + "74f033ab874a1acddb3aec9b9c4d9cde" + ], + "vocab_file": [ + "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/vocab.txt", + "1c1c1f4fd93c5bed3b4eebec4de976a8" + ], + "special_tokens_map": [ + "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/special_tokens_map.json", + "8b3fb1023167bb4ab9d70708eb05f6ec" + ], + "tokenizer_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/tokenizer_config.json", + "3e623b57084882fd73e17f544bdda47d" + ] + }, + } + + return_no_answers: bool + outgoing_edges = 1 + query_count = 0 + query_time = 0 + + def __init__(self, + model='uie-base-answer-extractor', + schema=['答案'], + task_path=None, + device="gpu", + batch_size=64, + position_prob=0.01, + max_answer_candidates=5): + paddle.set_device(device) + self.model = model + self._from_taskflow = False + self._custom_model = False + if task_path: + self._task_path = task_path + self._custom_model = True + else: + if model in ["uie-base"]: + self._task_path = None + self._from_taskflow = True + else: + self._task_path = os.path.join( + PPNLP_HOME, "pipelines", "unsupervised_question_answering", + self.model) + self._check_task_files() + self.batch_size = batch_size + self.max_answer_candidates = max_answer_candidates + self.schema = schema + self.answer_generator = Taskflow( + "information_extraction", + model=self.model if self._from_taskflow else "uie-base", + schema=schema, + task_path=self._task_path, + batch_size=batch_size, + position_prob=position_prob) + + def _check_task_files(self): + """ + Check files required by the task. + """ + for file_id, file_name in self.resource_files_names.items(): + path = os.path.join(self._task_path, file_name) + url = self.resource_files_urls[self.model][file_id][0] + md5 = self.resource_files_urls[self.model][file_id][1] + + downloaded = True + if not os.path.exists(path): + downloaded = False + else: + if not self._custom_model: + if os.path.exists(path): + # Check whether the file is updated + if not md5file(path) == md5: + downloaded = False + if file_id == "model_state": + self._param_updated = True + else: + downloaded = False + if not downloaded: + download_file(self._task_path, file_name, url, md5) + + def answer_generation_from_paragraphs(self, + paragraphs, + batch_size=16, + model=None, + max_answer_candidates=5, + schema=None, + wf=None): + """Generate answer from given paragraphs.""" + result = [] + buffer = [] + i = 0 + len_paragraphs = len(paragraphs) + for paragraph_tobe in tqdm(paragraphs): + buffer.append(paragraph_tobe) + if len(buffer) == batch_size or (i + 1) == len_paragraphs: + predicts = model(buffer) + paragraph_list = buffer + buffer = [] + for predict_dict, paragraph in zip(predicts, paragraph_list): + answers = [] + probabilitys = [] + for prompt in schema: + if prompt in predict_dict: + answer_dicts = predict_dict[prompt] + answers += [ + answer_dict['text'] + for answer_dict in answer_dicts + ] + probabilitys += [ + answer_dict['probability'] + for answer_dict in answer_dicts + ] + else: + answers += [] + probabilitys += [] + candidates = sorted(list( + set([(a, p) for a, p in zip(answers, probabilitys)])), + key=lambda x: -x[1]) + if len(candidates) > max_answer_candidates: + candidates = candidates[:max_answer_candidates] + outdict = { + 'context': paragraph, + 'answer_candidates': candidates, + } + if wf: + wf.write(json.dumps(outdict, ensure_ascii=False) + "\n") + result.append(outdict) + i += 1 + return result + + def run(self, meta): + print('createing synthetic answers...') + synthetic_context_answer_pairs = self.answer_generation_from_paragraphs( + meta, + batch_size=self.batch_size, + model=self.answer_generator, + max_answer_candidates=self.max_answer_candidates, + schema=self.schema, + wf=None) + results = {"ca_pairs": synthetic_context_answer_pairs} + return results, "output_1" diff --git a/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py b/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py new file mode 100644 index 000000000000..b5cec953a86e --- /dev/null +++ b/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py @@ -0,0 +1,34 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 deepset GmbH. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pipelines.nodes.base import BaseComponent +import paddle + + +class AnswerExtractorPreprocessor(BaseComponent): + """ + Answer Extractor Preprocessor used to preprocess the result of textconvert. + """ + return_no_answers: bool + outgoing_edges = 1 + query_count = 0 + query_time = 0 + + def __init__(self, device="gpu"): + paddle.set_device(device) + + def run(self, documents): + results = {"meta": [document['content'] for document in documents]} + return results, "output_1" diff --git a/pipelines/pipelines/nodes/answer_extractor/qa_filter.py b/pipelines/pipelines/nodes/answer_extractor/qa_filter.py new file mode 100644 index 000000000000..6a74dabc2d70 --- /dev/null +++ b/pipelines/pipelines/nodes/answer_extractor/qa_filter.py @@ -0,0 +1,227 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 deepset GmbH. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import json +import sys +import argparse +import re +from tqdm import tqdm + +import paddle +from paddlenlp import Taskflow +from pipelines.nodes.base import BaseComponent +from paddlenlp.utils.env import PPNLP_HOME +from paddlenlp.taskflow.utils import download_file +from paddle.dataset.common import md5file + + +class QAFilter(BaseComponent): + """ + Question Answer Pairs Filter based on Universal Information Extraction. + """ + resource_files_names = { + "model_state": "model_state.pdparams", + "model_config": "model_config.json", + "vocab_file": "vocab.txt", + "special_tokens_map": "special_tokens_map.json", + "tokenizer_config": "tokenizer_config.json" + } + + resource_files_urls = { + "uie-base-qa-filter": { + "model_state": [ + "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/model_state.pdparams", + "feb2d076fa2f78a0d3c3e3d20e9d5dc5" + ], + "model_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/model_config.json", + "74f033ab874a1acddb3aec9b9c4d9cde" + ], + "vocab_file": [ + "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/vocab.txt", + "1c1c1f4fd93c5bed3b4eebec4de976a8" + ], + "special_tokens_map": [ + "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/special_tokens_map.json", + "8b3fb1023167bb4ab9d70708eb05f6ec" + ], + "tokenizer_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/tokenizer_config.json", + "3e623b57084882fd73e17f544bdda47d" + ] + }, + } + + return_no_answers: bool + outgoing_edges = 1 + query_count = 0 + query_time = 0 + + def __init__( + self, + model='uie-base-qa-filter', + schema=['答案'], + task_path=None, + device="gpu", + batch_size=64, + position_prob=0.1, + ): + paddle.set_device(device) + self.model = model + self._custom_model = False + self._from_taskflow = False + if task_path: + self._task_path = task_path + self._custom_model = True + else: + if model in ["uie-base"]: + self._task_path = None + self._from_taskflow = True + else: + self._task_path = os.path.join( + PPNLP_HOME, "pipelines", "unsupervised_question_answering", + self.model) + self._check_task_files() + self.batch_size = batch_size + self.schema = schema + self.filtration_model = Taskflow( + "information_extraction", + model=self.model if self._from_taskflow else "uie-base", + schema=schema, + task_path=self._task_path, + batch_size=batch_size, + position_prob=position_prob) + + def _check_task_files(self): + """ + Check files required by the task. + """ + for file_id, file_name in self.resource_files_names.items(): + path = os.path.join(self._task_path, file_name) + url = self.resource_files_urls[self.model][file_id][0] + md5 = self.resource_files_urls[self.model][file_id][1] + + downloaded = True + if not os.path.exists(path): + downloaded = False + else: + if not self._custom_model: + if os.path.exists(path): + # Check whether the file is updated + if not md5file(path) == md5: + downloaded = False + if file_id == "model_state": + self._param_updated = True + else: + downloaded = False + if not downloaded: + download_file(self._task_path, file_name, url, md5) + + def filtration(self, + paragraphs, + batch_size=16, + model=None, + schema=None, + wf=None, + wf_debug=None): + result = [] + buffer = [] + valid_num, invalid_num = 0, 0 + i = 0 + len_paragraphs = len(paragraphs) + for paragraph_tobe in tqdm(paragraphs): + buffer.append(paragraph_tobe) + if len(buffer) == batch_size or (i + 1) == len_paragraphs: + model_inputs = [] + for d in buffer: + context = d['context'] + synthetic_question = d['synthetic_question'] + prefix = '问题:' + synthetic_question + '上下文:' + content = prefix + context + model_inputs.append(content) + predicts = model(model_inputs) + paragraph_list = buffer + buffer = [] + for predict_dict, paragraph in zip(predicts, paragraph_list): + context = paragraph['context'] + synthetic_question = paragraph['synthetic_question'] + synthetic_question_probability = paragraph[ + 'synthetic_question_probability'] + synthetic_answer = paragraph['synthetic_answer'] + synthetic_answer_probability = paragraph[ + 'synthetic_answer_probability'] + + answers = [] + probabilitys = [] + for prompt in schema: + if prompt in predict_dict: + answer_dicts = predict_dict[prompt] + answers += [ + answer_dict['text'] + for answer_dict in answer_dicts + ] + probabilitys += [ + answer_dict['probability'] + for answer_dict in answer_dicts + ] + else: + answers += [] + probabilitys += [] + candidates = [ + an for an, pro in sorted([( + a, p) for a, p in zip(answers, probabilitys)], + key=lambda x: -x[1]) + ] + out_dict = { + 'context': + context, + 'synthetic_answer': + synthetic_answer, + 'synthetic_answer_probability': + synthetic_answer_probability, + 'synthetic_question': + synthetic_question, + 'synthetic_question_probability': + synthetic_question_probability, + } + if synthetic_answer in candidates: + if wf: + wf.write( + json.dumps(out_dict, ensure_ascii=False) + "\n") + result.append(out_dict) + valid_num += 1 + else: + if wf_debug: + wf_debug.write( + json.dumps(out_dict, ensure_ascii=False) + "\n") + invalid_num += 1 + i += 1 + print('valid synthetic question-answer pairs number:', valid_num) + print('invalid sythetic question-answer pairs numbewr:', invalid_num) + return result + + def run(self, cqa_triples, is_filter=True): + if is_filter: + print('filtering synthetic question-answer pairs...') + filtered_cqa_triples = self.filtration(cqa_triples, + batch_size=self.batch_size, + model=self.filtration_model, + schema=self.schema) + print('filter synthetic question-answer pairs successfully!') + else: + filtered_cqa_triples = cqa_triples + + results = {"filtered_cqa_triples": filtered_cqa_triples} + return results, "output_1" diff --git a/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py b/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py new file mode 100644 index 000000000000..4177870c2fd5 --- /dev/null +++ b/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py @@ -0,0 +1,44 @@ +# coding:utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pipelines.nodes.base import BaseComponent +import paddle + + +class QAFilterPostprocessor(BaseComponent): + """ + QA Filter Postprocessor used to postprocess the result of qa filter. + """ + + return_no_answers: bool + outgoing_edges = 1 + query_count = 0 + query_time = 0 + + def __init__(self, device="gpu"): + paddle.set_device(device) + + def run(self, filtered_cqa_triples): + results = { + "documents": [{ + 'content': triple['synthetic_question'], + 'content_type': 'text', + 'meta': { + 'answer': triple['synthetic_answer'], + '_split_id': 0 + } + } for triple in filtered_cqa_triples] + } + return results, "output_1" diff --git a/pipelines/pipelines/nodes/question_generator/__init__.py b/pipelines/pipelines/nodes/question_generator/__init__.py new file mode 100644 index 000000000000..9d52d4007895 --- /dev/null +++ b/pipelines/pipelines/nodes/question_generator/__init__.py @@ -0,0 +1,15 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pipelines.nodes.question_generator.question_generator import QuestionGenerator diff --git a/pipelines/pipelines/nodes/question_generator/question_generator.py b/pipelines/pipelines/nodes/question_generator/question_generator.py new file mode 100644 index 000000000000..b50c60ad5235 --- /dev/null +++ b/pipelines/pipelines/nodes/question_generator/question_generator.py @@ -0,0 +1,279 @@ +# coding:utf-8 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import json +import sys +import argparse +import re +from tqdm import tqdm + +import paddle +from paddlenlp import Taskflow +from pipelines.nodes.base import BaseComponent +from paddlenlp.utils.env import PPNLP_HOME +from paddlenlp.taskflow.utils import download_file +from paddle.dataset.common import md5file + + +class QuestionGenerator(BaseComponent): + """ + Question Generator based on Unimo Text. + """ + resource_files_names = { + "model_state": "model_state.pdparams", + "model_config": "model_config.json", + "vocab_file": "vocab.txt", + "special_tokens_map": "special_tokens_map.json", + "tokenizer_config": "tokenizer_config.json" + } + + resource_files_urls = { + "unimo-text-1.0-question-generator": { + "model_state": [ + "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/model_state.pdparams", + "856a2980f83dc227a8fed4ecd730696d" + ], + "model_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/model_config.json", + "b5bab534683d9f0ef82fc84803ee6f3d" + ], + "vocab_file": [ + "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/vocab.txt", + "ea3f8a8cc03937a8df165d2b507c551e" + ], + "special_tokens_map": [ + "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/special_tokens_map.json", + "8b3fb1023167bb4ab9d70708eb05f6ec" + ], + "tokenizer_config": [ + "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/tokenizer_config.json", + "ef261f5d413a46ed1d6f071aed6fb345" + ] + }, + } + + return_no_answers: bool + outgoing_edges = 1 + query_count = 0 + query_time = 0 + + def __init__(self, + model='unimo-text-1.0-question-generation', + task_path=None, + device="gpu", + batch_size=16, + output_scores=True, + is_select_from_num_return_sequences=False, + max_length=50, + decode_strategy="sampling", + temperature=1.0, + top_k=5, + top_p=1.0, + num_beams=6, + num_beam_groups=1, + diversity_rate=0.0, + num_return_sequences=1, + template=1): + paddle.set_device(device) + self.model = model + self._from_taskflow = False + self._custom_model = False + if task_path: + self._task_path = task_path + self._custom_model = True + else: + if model in [ + "unimo-text-1.0", "unimo-text-1.0-dureader_qg", + "unimo-text-1.0-question-generation", + "unimo-text-1.0-question-generation-dureader_qg" + ]: + self._task_path = None + self._from_taskflow = True + else: + self._task_path = os.path.join( + PPNLP_HOME, "pipelines", "unsupervised_question_answering", + self.model) + self._check_task_files() + self.model = "unimo-text-1.0" + self.num_return_sequences = num_return_sequences + self.batch_size = batch_size + if self._from_taskflow: + self.question_generation = Taskflow( + "question_generation", + model=self.model if self._from_taskflow else "unimo-text-1.0", + output_scores=True, + max_length=max_length, + is_select_from_num_return_sequences= + is_select_from_num_return_sequences, + num_return_sequences=num_return_sequences, + batch_size=batch_size, + decode_strategy=decode_strategy, + num_beams=num_beams, + num_beam_groups=num_beam_groups, + diversity_rate=diversity_rate, + top_k=top_k, + top_p=top_p, + temperature=temperature, + template=1) + else: + self.question_generation = Taskflow( + "question_generation", + model=self.model if self._from_taskflow else "unimo-text-1.0", + task_path=self._task_path, + output_scores=True, + max_length=max_length, + is_select_from_num_return_sequences= + is_select_from_num_return_sequences, + num_return_sequences=num_return_sequences, + batch_size=batch_size, + decode_strategy=decode_strategy, + num_beams=num_beams, + num_beam_groups=num_beam_groups, + diversity_rate=diversity_rate, + top_k=top_k, + top_p=top_p, + temperature=temperature, + template=1) + + def _check_task_files(self): + """ + Check files required by the task. + """ + for file_id, file_name in self.resource_files_names.items(): + path = os.path.join(self._task_path, file_name) + url = self.resource_files_urls[self.model][file_id][0] + md5 = self.resource_files_urls[self.model][file_id][1] + + downloaded = True + if not os.path.exists(path): + downloaded = False + else: + if not self._custom_model: + if os.path.exists(path): + # Check whether the file is updated + if not md5file(path) == md5: + downloaded = False + if file_id == "model_state": + self._param_updated = True + else: + downloaded = False + if not downloaded: + download_file(self._task_path, file_name, url, md5) + + def create_question(self, + json_file_or_pair_list, + out_json=None, + num_return_sequences=1, + all_sample_num=None, + batch_size=8): + if out_json: + wf = open(out_json, 'w', encoding='utf-8') + if isinstance(json_file_or_pair_list, list): + all_lines = json_file_or_pair_list + else: + rf = open(json_file_or_pair_list, 'r', encoding='utf-8') + all_lines = [] + for json_line in rf: + line_dict = json.loads(json_line) + all_lines.append(line_dict) + rf.close() + num_all_lines = len(all_lines) + output = [] + context_buffer = [] + answer_buffer = [] + answer_probability_buffer = [] + true_question_buffer = [] + i = 0 + for index, line_dict in enumerate(tqdm(all_lines)): + if 'question' in line_dict: + q = line_dict['question'] + else: + q = '' + c = line_dict['context'] + assert 'answer_candidates' in line_dict + answers = line_dict['answer_candidates'] + if not answers: + continue + for j, pair in enumerate(answers): + a, p = pair + context_buffer += [c] + answer_buffer += [a] + answer_probability_buffer += [p] + true_question_buffer += [q] + if (i + 1) % batch_size == 0 or ( + all_sample_num and + (i + 1) == all_sample_num) or ((index + 1) == num_all_lines + and j == len(answers) - 1): + result_buffer = self.question_generation([{ + 'context': context, + 'answer': answer + } for context, answer in zip(context_buffer, answer_buffer) + ]) + context_buffer_temp, answer_buffer_temp, answer_probability_buffer_temp, true_question_buffer_temp = [], [], [], [] + for context, answer, answer_probability, true_question in zip( + context_buffer, answer_buffer, + answer_probability_buffer, true_question_buffer): + context_buffer_temp += [context] * num_return_sequences + answer_buffer_temp += [answer] * num_return_sequences + answer_probability_buffer_temp += [ + answer_probability + ] * num_return_sequences + true_question_buffer_temp += [true_question + ] * num_return_sequences + result_one_two_buffer = [ + (one, two) + for one, two in zip(result_buffer[0], result_buffer[1]) + ] + for context, answer, answer_probability, true_question, result in zip( + context_buffer_temp, answer_buffer_temp, + answer_probability_buffer_temp, + true_question_buffer_temp, result_one_two_buffer): + fake_quesitons_tokens = [result[0]] + fake_quesitons_scores = [result[1]] + for fake_quesitons_token, fake_quesitons_score in zip( + fake_quesitons_tokens, fake_quesitons_scores): + out_dict = { + 'context': context, + 'synthetic_answer': answer, + 'synthetic_answer_probability': + answer_probability, + 'synthetic_question': fake_quesitons_token, + 'synthetic_question_probability': + fake_quesitons_score, + 'true_question': true_question, + } + if out_json: + wf.write( + json.dumps(out_dict, ensure_ascii=False) + + "\n") + output.append(out_dict) + context_buffer = [] + answer_buffer = [] + true_question_buffer = [] + if all_sample_num and (i + 1) >= all_sample_num: + break + i += 1 + if out_json: + wf.close() + return output + + def run(self, ca_pairs): + print('createing synthetic question-answer pairs...') + synthetic_context_answer_question_triples = self.create_question( + ca_pairs, None, self.num_return_sequences, None, self.batch_size) + print('create synthetic question-answer pairs successfully!') + results = {"cqa_triples": synthetic_context_answer_question_triples} + return results, "output_1" diff --git a/pipelines/pipelines/pipelines/__init__.py b/pipelines/pipelines/pipelines/__init__.py index 40f7714d6b5c..3929f9061228 100644 --- a/pipelines/pipelines/pipelines/__init__.py +++ b/pipelines/pipelines/pipelines/__init__.py @@ -13,8 +13,12 @@ # limitations under the License. from pipelines.pipelines.base import Pipeline, RootNode -from pipelines.pipelines.standard_pipelines import (BaseStandardPipeline, - ExtractiveQAPipeline, - SemanticSearchPipeline, - DocPipeline, - TextToImagePipeline) \ No newline at end of file + +from pipelines.pipelines.standard_pipelines import ( + BaseStandardPipeline, + ExtractiveQAPipeline, + SemanticSearchPipeline, + DocPipeline, + TextToImagePipeline, + QAGenerationPipeline, +) diff --git a/pipelines/pipelines/pipelines/base.py b/pipelines/pipelines/pipelines/base.py index 2ef81c4d5f49..63bad9ad17c4 100644 --- a/pipelines/pipelines/pipelines/base.py +++ b/pipelines/pipelines/pipelines/base.py @@ -262,8 +262,9 @@ def load_from_yaml(cls, variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an `_` sign must be used to specify nested hierarchical properties. """ - pipeline_config = read_pipeline_config_from_yaml(path) + print(pipeline_config) + print(pipeline_name) if pipeline_config["version"] != __version__: logger.warning( f"YAML version ({pipeline_config['version']}) does not match with pipelines version ({__version__}). " @@ -823,10 +824,13 @@ def load_from_config(cls, overwrite_with_env_variables=overwrite_with_env_variables) pipeline = cls() - + print(pipeline_definition) components: dict = {} # instances of component objects. for node in pipeline_definition["nodes"]: + print('node', node) name = node["name"] + if name == 'QAFilterPostprocessor': + print('exit') component = cls._load_or_get_component( name=name, definitions=component_definitions, diff --git a/pipelines/pipelines/pipelines/standard_pipelines.py b/pipelines/pipelines/pipelines/standard_pipelines.py index 04e0a87816db..7f8ed86fbb1c 100644 --- a/pipelines/pipelines/pipelines/standard_pipelines.py +++ b/pipelines/pipelines/pipelines/standard_pipelines.py @@ -25,6 +25,8 @@ from pipelines.nodes.retriever import BaseRetriever from pipelines.document_stores import BaseDocumentStore from pipelines.nodes.text_to_image_generator import ErnieTextToImageGenerator +from pipelines.nodes.answer_extractor import AnswerExtractor, QAFilter +from pipelines.nodes.question_generator import QuestionGenerator from pipelines.pipelines import Pipeline from pipelines.nodes.base import BaseComponent @@ -331,3 +333,41 @@ def run_batch( params=params, debug=debug) return output + + +class QAGenerationPipeline(BaseStandardPipeline): + """ + Pipeline for semantic search. + """ + + def __init__(self, answer_extractor: AnswerExtractor, + question_generator: QuestionGenerator, qa_filter: QAFilter): + """ + :param retriever: Retriever instance + """ + self.pipeline = Pipeline() + self.pipeline.add_node(component=answer_extractor, + name="AnswerExtractor", + inputs=["Query"]) + self.pipeline.add_node(component=question_generator, + name="QuestionGenerator", + inputs=["AnswerExtractor"]) + self.pipeline.add_node(component=qa_filter, + name="QAFilter", + inputs=["QuestionGenerator"]) + + def run(self, + meta: List[str], + params: Optional[dict] = None, + debug: Optional[bool] = None): + """ + :param query: the query string. + :param params: params for the `retriever` and `reader`. For instance, params={"Retriever": {"top_k": 10}} + :param debug: Whether the pipeline should instruct nodes to collect debug information + about their execution. By default these include the input parameters + they received and the output they generated. + All debug information can then be found in the dict returned + by this method under the key "_debug" + """ + output = self.pipeline.run(meta=meta, params=params, debug=debug) + return output diff --git a/pipelines/rest_api/config.py b/pipelines/rest_api/config.py index ab86d26d9855..d2b968c6ad42 100644 --- a/pipelines/rest_api/config.py +++ b/pipelines/rest_api/config.py @@ -20,6 +20,8 @@ str((Path(__file__).parent / "pipeline" / "pipelines.yaml").absolute())) QUERY_PIPELINE_NAME = os.getenv("QUERY_PIPELINE_NAME", "query") INDEXING_PIPELINE_NAME = os.getenv("INDEXING_PIPELINE_NAME", "indexing") +INDEXING_QA_GENERATING_PIPELINE_NAME = os.getenv( + "INDEXING_QA_GENERATING_PIPELINE_NAME", "indexing_qa_generating") FILE_UPLOAD_PATH = os.getenv( "FILE_UPLOAD_PATH", str((Path(__file__).parent / "file-upload").absolute())) diff --git a/pipelines/rest_api/controller/file_upload.py b/pipelines/rest_api/controller/file_upload.py index 56f36c5ea260..3af543fab41f 100644 --- a/pipelines/rest_api/controller/file_upload.py +++ b/pipelines/rest_api/controller/file_upload.py @@ -28,7 +28,7 @@ from pipelines.pipelines.base import Pipeline from pipelines.pipelines.config import get_component_definitions, get_pipeline_definition, read_pipeline_config_from_yaml -from rest_api.config import PIPELINE_YAML_PATH, FILE_UPLOAD_PATH, INDEXING_PIPELINE_NAME, FILE_PARSE_PATH +from rest_api.config import PIPELINE_YAML_PATH, FILE_UPLOAD_PATH, INDEXING_PIPELINE_NAME, INDEXING_QA_GENERATING_PIPELINE_NAME, FILE_PARSE_PATH from rest_api.controller.utils import as_form logger = logging.getLogger(__name__) @@ -55,11 +55,17 @@ "Indexing Pipeline with FAISSDocumentStore or InMemoryDocumentStore is not supported with the REST APIs." ) INDEXING_PIPELINE = None + INDEXING_QA_GENERATING_PIPELINE = None else: + INDEXING_QA_GENERATING_PIPELINE = Pipeline.load_from_yaml( + Path(PIPELINE_YAML_PATH), + pipeline_name=INDEXING_QA_GENERATING_PIPELINE_NAME) INDEXING_PIPELINE = Pipeline.load_from_yaml( Path(PIPELINE_YAML_PATH), pipeline_name=INDEXING_PIPELINE_NAME) + except KeyError: INDEXING_PIPELINE = None + INDEXING_QA_GENERATING_PIPELINE = None logger.warning( "Indexing Pipeline not found in the YAML configuration. File Upload API will not be available." ) @@ -89,6 +95,55 @@ class Response(BaseModel): file_id: str +@router.post("/file-upload-qa-generate") +def upload_file( + files: List[UploadFile] = File(...), + # JSON serialized string + meta: Optional[str] = Form("null"), # type: ignore + fileconverter_params: FileConverterParams = Depends( + FileConverterParams.as_form), # type: ignore +): + """ + You can use this endpoint to upload a file for indexing + """ + if not INDEXING_QA_GENERATING_PIPELINE: + raise HTTPException( + status_code=501, + detail="INDEXING_QA_GENERATING_PIPELINE is not configured.") + + file_paths: list = [] + file_metas: list = [] + meta_form = json.loads(meta) or {} # type: ignore + if not isinstance(meta_form, dict): + raise HTTPException( + status_code=500, + detail= + f"The meta field must be a dict or None, not {type(meta_form)}") + + for file in files: + try: + file_path = Path( + FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}" + with file_path.open("wb") as buffer: + shutil.copyfileobj(file.file, buffer) + + file_paths.append(file_path) + meta_form["name"] = file.filename + file_metas.append(meta_form) + finally: + file.file.close() + + result = INDEXING_QA_GENERATING_PIPELINE.run( + file_paths=file_paths, + meta=file_metas, + params={ + "TextFileConverter": fileconverter_params.dict(), + "PDFFileConverter": fileconverter_params.dict(), + }, + ) + return {'message': "OK"} + + @router.post("/file-upload") def upload_file( files: List[UploadFile] = File(...), diff --git a/pipelines/rest_api/controller/search.py b/pipelines/rest_api/controller/search.py index bb2d010f50f3..4561b199a7b6 100644 --- a/pipelines/rest_api/controller/search.py +++ b/pipelines/rest_api/controller/search.py @@ -27,7 +27,7 @@ from pipelines.pipelines.base import Pipeline from rest_api.config import PIPELINE_YAML_PATH, QUERY_PIPELINE_NAME from rest_api.config import LOG_LEVEL, CONCURRENT_REQUEST_PER_WORKER -from rest_api.schema import QueryRequest, QueryResponse, DocumentRequest, DocumentResponse, QueryImageResponse +from rest_api.schema import QueryRequest, QueryResponse, DocumentRequest, DocumentResponse, QueryImageResponse, QueryQAPairResponse, QueryQAPairRequest from rest_api.controller.utils import RequestLimiter logging.getLogger("pipelines").setLevel(LOG_LEVEL) @@ -41,6 +41,9 @@ PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=QUERY_PIPELINE_NAME) + +QA_PAIR_PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), + pipeline_name="query_qa_pairs") DOCUMENT_STORE = PIPELINE.get_document_store() logging.info(f"Loaded pipeline nodes: {PIPELINE.graph.nodes.keys()}") @@ -76,6 +79,7 @@ def query(request: QueryRequest): This endpoint receives the question as a string and allows the requester to set additional parameters that will be passed on to the pipelines pipeline. """ + print('query', request) with concurrency_limiter.run(): result = _process_request(PIPELINE, request) return result @@ -118,6 +122,25 @@ def query_documents(request: DocumentRequest): return result +@router.post("/query_qa_pairs", + response_model=QueryQAPairResponse, + response_model_exclude_none=True) +def query_qa_pairs(request: QueryQAPairRequest): + """ + This endpoint receives the question as a string and allows the requester to set + additional parameters that will be passed on to the pipelines pipeline. + """ + print('request', request) + result = {} + result['meta'] = request.meta + params = request.params or {} + res = QA_PAIR_PIPELINE.run(meta=request.meta, + params=params, + debug=request.debug) + result['filtered_cqa_triples'] = res['filtered_cqa_triples'] + return result + + def _process_request(pipeline, request) -> Dict[str, Any]: start_time = time.time() diff --git a/pipelines/rest_api/pipeline/pipelines.yaml b/pipelines/rest_api/pipeline/pipelines.yaml index c2243a90093b..d844a1187d48 100644 --- a/pipelines/rest_api/pipeline/pipelines.yaml +++ b/pipelines/rest_api/pipeline/pipelines.yaml @@ -29,6 +29,10 @@ components: # define all the building-blocks for Pipeline type: PDFToTextConverter - name: DocxFileConverter type: DocxToTextConverter + - name: AnswerExtractorPreprocessor + type: AnswerExtractorPreprocessor + - name: QAFilterPostprocessor + type: QAFilterPostprocessor - name: Preprocessor type: PreProcessor params: @@ -64,3 +68,31 @@ pipelines: inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever] + + - name: indexing_qa_generating + type: Indexing_qa_generating + nodes: + - name: FileTypeClassifier + inputs: [File] + - name: TextFileConverter + inputs: [FileTypeClassifier.output_1] + - name: PDFFileConverter + inputs: [FileTypeClassifier.output_2] + - name: DocxFileConverter + inputs: [FileTypeClassifier.output_4] + - name: ImageFileConverter + inputs: [FileTypeClassifier.output_6] + - name: AnswerExtractorPreprocessor + inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter] + - name: AnswerExtractor + inputs: [AnswerExtractorPreprocessor] + - name: QuestionGenerator + inputs: [AnswerExtractor] + - name: QAFilter + inputs: [QuestionGenerator] + - name: QAFilterPostprocessor + inputs: [QAFilter] + - name: Retriever + inputs: [QAFilterPostprocessor] + - name: DocumentStore + inputs: [Retriever] diff --git a/pipelines/rest_api/pipeline/unsupervised_qa.yaml b/pipelines/rest_api/pipeline/unsupervised_qa.yaml new file mode 100644 index 000000000000..40d799d85627 --- /dev/null +++ b/pipelines/rest_api/pipeline/unsupervised_qa.yaml @@ -0,0 +1,106 @@ +version: '1.1.0' + +components: # define all the building-blocks for Pipeline + - name: DocumentStore + type: ElasticsearchDocumentStore # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents + params: + host: localhost + port: 9200 + index: my_data + embedding_dim: 312 + - name: Retriever + type: DensePassageRetriever + params: + document_store: DocumentStore # params can reference other components defined in the YAML + top_k: 10 + query_embedding_model: rocketqa-zh-nano-query-encoder + passage_embedding_model: rocketqa-zh-nano-para-encoder + embed_title: False + - name: Ranker # custom-name for the component; helpful for visualization & debugging + type: ErnieRanker # pipelines Class name for the component + params: + model_name_or_path: rocketqa-nano-cross-encoder + top_k: 3 + - name: TextFileConverter + type: TextConverter + - name: ImageFileConverter + type: ImageToTextConverter + - name: PDFFileConverter + type: PDFToTextConverter + - name: DocxFileConverter + type: DocxToTextConverter + - name: AnswerExtractorPreprocessor + type: AnswerExtractorPreprocessor + - name: QAFilterPostprocessor + type: QAFilterPostprocessor + - name: Preprocessor + type: PreProcessor + params: + split_by: passage + split_respect_sentence_boundary: False + split_answers: True + - name: FileTypeClassifier + type: FileTypeClassifier + - name: AnswerExtractor + type: AnswerExtractor + params: + model: uie-base-answer-extractor + schema: ['答案'] + position_prob: 0.01 + max_answer_candidates: 3 + - name: QuestionGenerator + type: QuestionGenerator + params: + model: unimo-text-1.0-question-generation + num_return_sequences: 2 + - name: QAFilter + type: QAFilter + params: + model: uie-base-qa-filter + schema: ['答案'] + position_prob: 0.1 + +pipelines: + - name: query # a sample extractive-qa Pipeline + type: Query + nodes: + - name: Retriever + inputs: [Query] + - name: Ranker + inputs: [Retriever] + - name: indexing_qa_generating + type: Indexing_qa_generating + nodes: + - name: FileTypeClassifier + inputs: [File] + - name: TextFileConverter + inputs: [FileTypeClassifier.output_1] + - name: PDFFileConverter + inputs: [FileTypeClassifier.output_2] + - name: DocxFileConverter + inputs: [FileTypeClassifier.output_4] + - name: ImageFileConverter + inputs: [FileTypeClassifier.output_6] + - name: AnswerExtractorPreprocessor + inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter] + - name: AnswerExtractor + inputs: [AnswerExtractorPreprocessor] + - name: QuestionGenerator + inputs: [AnswerExtractor] + - name: QAFilter + inputs: [QuestionGenerator] + - name: QAFilterPostprocessor + inputs: [QAFilter] + - name: Retriever + inputs: [QAFilterPostprocessor] + - name: DocumentStore + inputs: [Retriever] + - name: query_qa_pairs + type: Query + nodes: + - name: AnswerExtractor + inputs: [Query] + - name: QuestionGenerator + inputs: [AnswerExtractor] + - name: QAFilter + inputs: [QuestionGenerator] diff --git a/pipelines/rest_api/schema.py b/pipelines/rest_api/schema.py index 486078554a94..c0f823707f7c 100644 --- a/pipelines/rest_api/schema.py +++ b/pipelines/rest_api/schema.py @@ -106,3 +106,19 @@ class QueryImageResponse(BaseModel): answers: List[str] = [] documents: List[DocumentSerialized] = [] debug: Optional[Dict] = Field(None, alias="_debug") + + +class QueryQAPairRequest(BaseModel): + meta: List[str] + params: Optional[dict] = None + debug: Optional[bool] = False + + class Config: + # Forbid any extra fields in the request to avoid silent failures + extra = Extra.forbid + + +class QueryQAPairResponse(BaseModel): + meta: List[str] + filtered_cqa_triples: List[dict] = [] + debug: Optional[Dict] = Field(None, alias="_debug") \ No newline at end of file diff --git a/pipelines/ui/utils.py b/pipelines/ui/utils.py index 540c44cc2247..a66106c84624 100644 --- a/pipelines/ui/utils.py +++ b/pipelines/ui/utils.py @@ -24,6 +24,12 @@ import streamlit as st from io import StringIO +import paddle +from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http +from pipelines.document_stores import ElasticsearchDocumentStore, MilvusDocumentStore +from pipelines.nodes import DensePassageRetriever +from pipelines.utils import launch_es + API_ENDPOINT = os.getenv("API_ENDPOINT") STATUS = "initialized" HS_VERSION = "hs_version" @@ -32,6 +38,8 @@ DOC_UPLOAD = "file-upload" DOC_PARSE = 'files' IMAGE_REQUEST = 'query_text_to_images' +QA_PAIR_REQUEST = 'query_qa_pairs' +FILE_UPLOAD_QA_GENERATE = 'file-upload-qa-generate' def pipelines_is_ready(): @@ -214,6 +222,31 @@ def text_to_image_search( return results, response +def text_to_qa_pair_search(query, + is_filter=True + ) -> Tuple[List[Dict[str, Any]], Dict[str, str]]: + """ + Send a prompt text and corresponding parameters to the REST API + """ + url = f"{API_ENDPOINT}/{QA_PAIR_REQUEST}" + params = { + "QAFilter": { + "is_filter": is_filter, + }, + } + + req = {"meta": [query], "params": params} + response_raw = requests.post(url, json=req) + if response_raw.status_code >= 400 and response_raw.status_code != 503: + raise Exception(f"{vars(response_raw)}") + + response = response_raw.json() + if "errors" in response: + raise Exception(", ".join(response["errors"])) + results = response["filtered_cqa_triples"] + return results, response + + def send_feedback(query, answer_obj, is_correct_answer, is_correct_document, document) -> None: """ @@ -242,6 +275,13 @@ def upload_doc(file): return response +def file_upload_qa_generate(file): + url = f"{API_ENDPOINT}/{FILE_UPLOAD_QA_GENERATE}" + files = [("files", file)] + response = requests.post(url, files=files).json() + return response + + def get_backlink(result) -> Tuple[Optional[str], Optional[str]]: if result.get("document", None): doc = result["document"] @@ -252,3 +292,60 @@ def get_backlink(result) -> Tuple[Optional[str], Optional[str]]: "title", None): return doc["meta"]["url"], doc["meta"]["title"] return None, None + + +def offline_ann(index_name, + doc_dir, + search_engine="elastic", + host="127.0.0.1", + port="9200", + query_embedding_model="rocketqa-zh-nano-query-encoder", + passage_embedding_model="rocketqa-zh-nano-para-encoder", + params_path="checkpoints/model_40/model_state.pdparams", + embedding_dim=312, + split_answers=True): + if (search_engine == "milvus"): + document_store = MilvusDocumentStore(embedding_dim=embedding_dim, + host=host, + index=index_name, + port=port, + index_param={ + "M": 16, + "efConstruction": 50 + }, + index_type="HNSW") + else: + launch_es() + document_store = ElasticsearchDocumentStore(host=host, + port=port, + username="", + password="", + embedding_dim=embedding_dim, + index=index_name) + # 将每篇文档按照段落进行切分 + dicts = convert_files_to_dicts(dir_path=doc_dir, + split_paragraphs=True, + split_answers=split_answers, + encoding='utf-8') + + print(dicts[:3]) + + # 文档数据写入数据库 + document_store.write_documents(dicts) + + ### 语义索引模型 + retriever = DensePassageRetriever( + document_store=document_store, + query_embedding_model=query_embedding_model, + passage_embedding_model=passage_embedding_model, + params_path=params_path, + output_emb_size=embedding_dim, + max_seq_len_query=64, + max_seq_len_passage=256, + batch_size=1, + use_gpu=True, + embed_title=False, + ) + + # 建立索引库 + document_store.update_embeddings(retriever) diff --git a/pipelines/ui/webapp_unsupervised_question_answering.py b/pipelines/ui/webapp_unsupervised_question_answering.py new file mode 100644 index 000000000000..c288f316d406 --- /dev/null +++ b/pipelines/ui/webapp_unsupervised_question_answering.py @@ -0,0 +1,340 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# Copyright 2021 deepset GmbH. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +import logging +import pandas as pd +from json import JSONDecodeError +from pathlib import Path +import streamlit as st +from annotated_text import annotation +from markdown import markdown + +from ui.utils import pipelines_is_ready, semantic_search, send_feedback, upload_doc, file_upload_qa_generate, pipelines_version, get_backlink, text_to_qa_pair_search, offline_ann + +# Adjust to a question that you would like users to see in the search bar when they load the UI: +# DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "如何办理企业养老保险?") +DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "") +# Sliders +DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", + "30")) +DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3")) +# Labels for the evaluation +EVAL_LABELS = os.getenv("EVAL_FILE", + str(Path(__file__).parent / "insurance_faq.csv")) +# Corpus dir for ANN +CORPUS_DIR = os.getenv("CORPUS_DIR", str('data/my_data')) +# QA pairs file to be saved +UPDATE_FILE = os.getenv("UPDATE_FILE", str('data/my_data/custom_qa_pairs.txt')) +# Whether the file upload should be enabled or not +DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD")) + +DEFAULT_NUMBER_OF_FILTER_STRENGTH = int( + os.getenv("DEFAULT_NUMBER_OF_FILTER_STRENGTH", "10")) + + +def set_state_if_absent(key, value): + if key not in st.session_state: + st.session_state[key] = value + + +def on_change_text(): + st.session_state.question = st.session_state.quest + st.session_state.answer = None + st.session_state.results = None + st.session_state.raw_json = None + + +def on_change_text_qag(): + st.session_state.qag_question = st.session_state.qag_quest + st.session_state.answer = None + st.session_state.qag_results = None + st.session_state.qag_raw_json = None + + +def upload(): + data_files = st.session_state.upload_files['files'] + for data_file in data_files: + # Upload file + if data_file and data_file.name not in st.session_state.upload_files[ + 'uploaded_files']: + # raw_json = upload_doc(data_file) + raw_json = file_upload_qa_generate(data_file) + st.session_state.upload_files['uploaded_files'].append( + data_file.name) + # Save the uploaded files + st.session_state.upload_files['uploaded_files'] = list( + set(st.session_state.upload_files['uploaded_files'])) + + +def main(): + + st.set_page_config(page_title="PaddleNLP无监督智能检索问答", page_icon='🐮') + # page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png") + + # Persistent state + set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP) + set_state_if_absent("qag_question", DEFAULT_QUESTION_AT_STARTUP) + set_state_if_absent("results", None) + set_state_if_absent("qag_results", None) + set_state_if_absent("raw_json", None) + set_state_if_absent("qag_raw_json", None) + set_state_if_absent("random_question_requested", False) + set_state_if_absent("upload_files", {'uploaded_files': [], 'files': []}) + + # Small callback to reset the interface in case the text of the question changes + def reset_results(*args): + st.session_state.answer = None + st.session_state.results = None + st.session_state.raw_json = None + + def reset_results_qag(*args): + st.session_state.answer = None + st.session_state.qag_results = None + st.session_state.qag_raw_json = None + + # Title + st.write("## 无监督智能检索问答") + # Sidebar + st.sidebar.header("选项") + st.sidebar.write("### 问答对生成:") + is_filter = st.sidebar.selectbox( + "是否进行自动过滤", + ('是', '否'), + on_change=reset_results, + ) + st.sidebar.write("### 问答检索:") + top_k_reader = st.sidebar.slider( + "返回答案数量", + min_value=1, + max_value=30, + value=DEFAULT_NUMBER_OF_ANSWERS, + step=1, + on_change=reset_results, + ) + top_k_retriever = st.sidebar.slider( + "最大检索数量", + min_value=1, + max_value=100, + value=DEFAULT_DOCS_FROM_RETRIEVER, + step=1, + on_change=reset_results, + ) + + if not DISABLE_FILE_UPLOAD: + st.sidebar.write("### 文件上传:") + data_files = st.sidebar.file_uploader( + "", + type=["pdf", "txt", "docx", "png"], + help="选择多个文件", + accept_multiple_files=True) + st.session_state.upload_files['files'] = data_files + st.sidebar.button("文件上传并自动生成载入问答对", on_click=upload) + for data_file in st.session_state.upload_files['uploaded_files']: + st.sidebar.write(str(data_file) + "    ✅ ") + + hs_version = "" + try: + hs_version = f" (v{pipelines_version()})" + except Exception: + pass + # Load csv into pandas dataframe + try: + df = pd.read_csv(EVAL_LABELS, sep=";") + except Exception: + st.error(f"The eval file was not found.") + sys.exit(f"The eval file was not found under `{EVAL_LABELS}`.") + + ## QA pairs generation + # Search bar + st.write("### 问答对生成:") + context = st.text_input("", + value=st.session_state.qag_question, + key="qag_quest", + on_change=on_change_text_qag, + max_chars=350, + placeholder='请输入要抽取问答对的文本') + qag_col1, qag_col2 = st.columns(2) + qag_col1.markdown("", + unsafe_allow_html=True) + qag_col2.markdown("", + unsafe_allow_html=True) + + # Run button + qag_run_pressed = qag_col1.button("开始生成") + + # Get next random question from the CSV + if qag_col2.button("存入数据库"): + with open(UPDATE_FILE, 'a', encoding='utf-8') as wf: + for count, result in enumerate(st.session_state.qag_results): + context = result["context"] + synthetic_answer = result["synthetic_answer"] + synthetic_question = result["synthetic_question"] + wf.write(synthetic_question.strip() + '\t' + + synthetic_answer.strip() + '\n') + offline_ann('my_data', CORPUS_DIR) + reset_results_qag() + + # st.session_state.random_question_requested = False + qag_run_query = (qag_run_pressed or context != st.session_state.qag_question + ) and not st.session_state.random_question_requested + # qag_run_query = qag_run_pressed + + # Check the connection + with st.spinner("⌛️    pipelines is starting..."): + if not pipelines_is_ready(): + st.error("🚫    Connection Error. Is pipelines running?") + run_query = False + reset_results_qag() + # Get results for query + if (qag_run_query or st.session_state.qag_results is None) and context: + reset_results_qag() + st.session_state.qag_question = context + with st.spinner( + "🧠    Performing neural search on documents... \n " + "Do you want to optimize speed or accuracy? \n"): + try: + st.session_state.qag_results, st.session_state.qag_raw_json = text_to_qa_pair_search( + context, is_filter=True if is_filter == "是" else False) + except JSONDecodeError as je: + st.error( + "👓    An error occurred reading the results. Is the document store working?" + ) + return + except Exception as e: + logging.exception(e) + if "The server is busy processing requests" in str( + e) or "503" in str(e): + st.error( + "🧑‍🌾    All our workers are busy! Try again later." + ) + else: + st.error( + "🐞    An error occurred during the request.") + return + + if st.session_state.qag_results: + st.write("#### 返回结果:") + for count, result in enumerate(st.session_state.qag_results): + context = result["context"] + synthetic_answer = result["synthetic_answer"] + synthetic_answer_probability = result[ + "synthetic_answer_probability"] + synthetic_question = result["synthetic_question"] + synthetic_question_probability = result[ + "synthetic_question_probability"] + st.write( + markdown(context), + unsafe_allow_html=True, + ) + st.write( + markdown('**问题:**' + synthetic_question), + unsafe_allow_html=True, + ) + st.write( + markdown('**答案:**' + synthetic_answer), + unsafe_allow_html=True, + ) + + st.write("___") + + ## QA search + # Search bar + st.write("### 问答检索:") + question = st.text_input("", + value=st.session_state.question, + key="quest", + on_change=on_change_text, + max_chars=100, + placeholder='请输入您的问题') + col1, col2 = st.columns(2) + col1.markdown("", + unsafe_allow_html=True) + col2.markdown("", + unsafe_allow_html=True) + + # Run button + run_pressed = col1.button("运行") + + # Get next random question from the CSV + if col2.button("随机提问"): + reset_results() + new_row = df.sample(1) + while ( + new_row["Question Text"].values[0] == st.session_state.question + ): # Avoid picking the same question twice (the change is not visible on the UI) + new_row = df.sample(1) + st.session_state.question = new_row["Question Text"].values[0] + st.session_state.random_question_requested = True + # Re-runs the script setting the random question as the textbox value + # Unfortunately necessary as the Random Question button is _below_ the textbox + st.experimental_rerun() + + st.session_state.random_question_requested = False + + run_query = (run_pressed or question != st.session_state.question + ) and not st.session_state.random_question_requested + + # Check the connection + with st.spinner("⌛️    pipelines is starting..."): + if not pipelines_is_ready(): + st.error("🚫    Connection Error. Is pipelines running?") + run_query = False + reset_results() + # Get results for query + if (run_query or st.session_state.results is None) and question: + reset_results() + st.session_state.question = question + with st.spinner( + "🧠    Performing neural search on documents... \n " + "Do you want to optimize speed or accuracy? \n"): + try: + st.session_state.results, st.session_state.raw_json = semantic_search( + question, + top_k_reader=top_k_reader, + top_k_retriever=top_k_retriever) + except JSONDecodeError as je: + st.error( + "👓    An error occurred reading the results. Is the document store working?" + ) + return + except Exception as e: + logging.exception(e) + if "The server is busy processing requests" in str( + e) or "503" in str(e): + st.error( + "🧑‍🌾    All our workers are busy! Try again later." + ) + else: + st.error( + "🐞    An error occurred during the request.") + return + + if st.session_state.results: + + st.write("#### 返回结果:") + for count, result in enumerate(st.session_state.results): + context = result["context"] + st.write( + markdown(context), + unsafe_allow_html=True, + ) + st.write("**答案:** ", result["answer"]) + st.write("**Relevance:** ", result["relevance"]) + st.write("___") + + +main() diff --git a/pipelines/utils/__init__.py b/pipelines/utils/__init__.py new file mode 100644 index 000000000000..97043fd7ba68 --- /dev/null +++ b/pipelines/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.