The following hyperparameters can be customized through command line arguments in the scripts/run_dpr_training.sh
script.
Command line argument | What it is for | Default value | Data type |
---|---|---|---|
--num_hard_neg | number of hard negative samples per positive sample in DPR fine tuning | 0 | int |
--bs | batch size in DPR fine tuning | 128 | int |
--epochs | number of epochs | 3 | int |
--eval_every | evaluate DPR performance every specified number of training steps | 87 | int |
--lr | learning rate | learning rate of AdamW optimizer | 1e-5 |
--warmup | number of warmup steps | 20 | int |
--query_encoder | pretrained model name on HF model hub as the starting point of query encoder | microsoft/infoxlm-base | any model on Huggingface model hub that can be loaded by the AutoModel.from_pretrained method |
--doc_encoder | pretrained model name on HF model hub as the starting point of document encoder | microsoft/infoxlm-base | any model on Huggingface model hub that can be loaded by the AutoModel.from_pretrained method |
--max_len_query | maximum sequence length of the query | 64 | int |
--max_len_passage | maximum sequence length of the document passage | 500 | int |
There are some other hyperparameters that can be further customized through the haystack DPR APIs. Please refer to the haystack source code for more information.
You can use command line arguments in the scripts/run_distributed_indexing.sh
to customize the indexing pipeline with the functionalities already implemented in our use case. Advanced users can write their own methods and incorporate customized functionality into src/gen-sods-doc-image-ray.py.
Command line argument | Options | What it does |
---|---|---|
--retrieval_method | all, dpr, bm25 | all : index into both Elasticsearch and postgresql databases for BM25 retrieval and DPR retrieval. dpr : only index into postgresql database. bm25 : only index into Elasticsearch database. |
--add_doc | this is a "store_true" flag, you either include it or not in the command | if included, the pipeline will add documents into database(s) |
--embed_doc | this is a "store_true" flag, you either include it or not in the command | if included, the pipeline will embed documents with DPR passage encoder into a FAISS index file |
--toy_example | this is a "store_true" flag, you either include it or not in the command | if included, the pipeline will run a toy example with a small subset of images |
Command line argument | What it is | Default value |
---|---|---|
--index_name | name of the index table to be stored in database(s) | dureadervis-documents |
--index_file | path to the faiss index file in haystack-api container | /home/user/output/index_files/faiss-indexfile.faiss |
You can experiment with the following FAISS document store parameters to tune the DPR retrieval performance. The higher the values, the better recall and MRR, but the retrieval speed will get slower.
Command line argument | What it is | Default value |
---|---|---|
--faiss_nlinks | n_links param for faiss document store | 512 |
--faiss_efsearch | ef_search param for faiss document store | 128 |
--faiss_efconstruct | ef_construct param for faiss document store | 200 |
The following methods are supported through command line arguments in scripts/run_distributed_indexing .sh
Method | What it does | command line argument |
---|---|---|
grayscale | turn a color image into gray-scale image | --preprocess grayscale |
binarize | turn a color image into blank-white image with locally adaptive thresolding | --preprocess binarize |
none | do nothing to the image, use the original | --preprocess none |
crop_image | crop images to a certain size, you can customize the cropping in src/utils.py |
--crop_image |
Currently we support the following two OCR engines, you can pick one through command line arguments in scripts/run_distributed_indexing .sh
OCR engine | What it is | command line argument with example values |
---|---|---|
PaddleOCR | OCR engine developed by Baidu https://github.com/PaddlePaddle/PaddleOCR | --ocr_engine paddleocr |
Tesseract | OCR engine developed by HP and Google https://github.com/tesseract-ocr/tesseract | --ocr_engine tesseract --ocr_lang chi_sim |
PaddleOCR and Tesseract both support multiple languages, please refer to their documentations on what languages are supported.
- For PaddleOCR: You need to change the
lang
arg insrc/test_pocr.py
for a language other than Chinese. - For Tesseract: You need to install additional packages for Tesseract if you want to extract text in a language other than English.
Advanced users: If you want to use another OCR engine, you need to implement your own ocr method and change the source code of src/gen-sods-doc-image-ray.py
accordingly.
The following methods are supported through command line arguments in scripts/run_distributed_indexing.sh
:
-
Splitting
This method splits document text into passages of certain length and certain overlap, each passage should be at least certain length long.
For example:--max_seq_len_passage 500 --overlap 10 --min_chars 5 --split_doc
The--split_doc
flag tells the pipeline to split the document, then the other args specify the splitting params: each passage max 500 string length with 10 string length overlap, minimum length of a passage is 5, any passages shorter than length of 5 will be discarded. -
Clustering
This method clusters each text blocks extracted by OCR engine into passages. The blocks are firstly embedded into vectors with a pretrained language model, and then principal component analysis (PCA) is done on the vectors to reduce dimensionality, and then Kmeans clustering is performed on the reduced vectors, then finally the blocks belonging to the same cluster are concatenated together to form a passage. There is an option to further split the clustered passages into shorter ones. There is also an option to force the number of clusters to be 2 or to dynamically determine the number of clusters based on the length of the documents.
For example:
--cluster_doc --cluster_model microsoft/infoxlm-base --max_seq_len_passage 500
The command line arguments above will use the specified model (example here is microsoft/infoxlm-base) to embed the text blocks and dynamically decide the number of clusters based on the total length of the document and the specified max length of passages (for example 500 here).
--cluster_doc --cluster_model microsoft/infoxlm-base --force_num_cluster
The command line arguments above will force the number of clusters to be 2.
--cluster_doc --cluster_model microsoft/infoxlm-base --force_num_cluster --split_doc --max_seq_len_passage 500 --overlap 10 --min_chars 5
The command line arguments above will force the number of clusters to be 2, and then for each cluster it will split the text into passages of specified max length/overlap/min length.
In most cases, you don't need to change the docker-compose yaml files. You can pick from one of the three docker-compose yaml files we provided for the three retrieval methods: bm25, dpr, ensemble.
The env config file may need editing. The variables are described below. In most cases, you just need to change the PIPELINE_PATH
to your customized pipeline yaml file and leave the other variables unchanged as long as you follow the folder structure described here. If you changed the index name, then you need to update the INDEX_NAME
variable accordingly.
Variable | What it does | Value |
---|---|---|
PIPELINE_NAME | functionality of the pipeline | fixed at 'query', no need to change |
PIPELINE_PATH | path to pipeline yaml file (path inside haystack-api container) | Example: /home/user/application/configs/pipelines_ensemble.haystack-pipeline.yml |
APPLICATION | volume to be mounted to haystack-api container that has the use case resource code | <your-path-to-source-code> |
MODEL | volume to be mounted to haystack-api container that has the DPR models | <your-path-to-dpr-models> |
ESDB | volume to be mounted to elasticsearch container that has the elasticsearch database files | <your-path-to-esdb-files> |
DB | volume to be mounted to postgresql container that has the postgresql database files | <your-path-to-postgresqldb-files> |
INDEX_NAME | name of the index table | must match the index name specified in indexing |
You can use one of the 3 pipeline yaml files we provided as the starting point. The params that you can customize are listed in the table below.
Retrieval method | Component | Param | What it is |
---|---|---|---|
BM25, Ensemble | DocumentStore | index | the index name specified in indexing |
DPR, Ensemble | DocumentStore | faiss_index_path | path to the faiss index file (.faiss file) in the haystack-api container |
DPR, Ensemble | DensePassageRetriever | query_embedding_model, passage_embedding_model | path to the fine-tuned local DPR models or pretrained model names on Huggingface model hub |
Ensemble | Ensembler | weights | weighting factors for BM25 and DPR retrievers, the first number is BM25 weight and the second number is DPR weight |
BM25, DPR, Ensemble | Retriever | top_k | how many documents to retrieve by BM25 or DPR retrievers |
Ensemble | Ensembler | top_k_join | how many documents to be returned by ensembler |
Important:
- Please do not change the filename of the UI config file, leave it as "config.yml".
- Do not change the volume mount path in the UI container in the docker-compose yaml file.
There are two parts in the UI config file:
- dataset: only one dataset is allowed
- pipelines: can have multiple pipelines
To customize the UI:
- dataset: specify the name of your own dataset and some questions that you want to show as examples on the webpage.
- pipelines: you can copy one of the pipelines in the UI config file, paste to the bottom of the file and modify as needed. For example, if you have a cutomized pipeline config yaml called "pipelines.Ranker.haystack-pipeline.yml" where you have a Retriever followed by a Ranker. The pipeline yaml file may look similar to the following:
components:
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
index: document
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 100
- name: Ranker
type: SentenceTransformersRanker
params:
model_name_or_path: /home/user/model/my_ranker_model
top_k: 5
Then for the pipelines section in the UI config file, you can add a sub-section like the following to the bottom of the UI config file.
- name: pipelines.Ranker.haystack-pipeline.yml
top_k_sliders:
- name: answer
desc: "Max. number of answers"
default_value: 5
keys:
- key: Ranker # match component name in pipeline yaml
param: top_k # match param name of Ranker in pipeline yaml
- name: retriever
desc: "Max. number of documents from retriever"
default_value: 100
keys:
- key: Retriever # match component name in pipeline yaml
param: top_k # match param name of Retriever in pipeline yaml
The most important variables are the keys
, the key
must match the name
of the component in the pipeline yaml file, and the param
in UI config should match the param
in pipeline yaml.