Name		Name	Last commit message	Last commit date
parent directory ..
BGE_M3		BGE_M3
C_MTEB		C_MTEB
LLARA		LLARA
LM_Cocktail		LM_Cocktail
Long_LLM		Long_LLM
MLVU		MLVU
baai_general_embedding		baai_general_embedding
llm_dense_retriever		llm_dense_retriever
llm_embedder		llm_embedder
llm_reranker		llm_reranker
old-examples		old-examples
reranker		reranker
visual_bge		visual_bge
README.md		README.md

Research

BGE-M3 (Paper, Code)

In this project, we introduce BGE-M3, the first embedding model which supports:

Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: It can support more than 100 working languages.
Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

The training code and fine-tuning data will be open-sourced in the near future.

In this project, we introduce Visualized-BGE, which integrating image token embedding into the BGE Text Embedding framework. Visualized-BGE can be used for various hybrid modal retrieval tasks, such as Multi-Modal Knowledge Retrieval, Composed Image Retrieval, and Knowledge Retrieval with Multi-Modal Queries.

Our model delivers outstanding zero-shot performance across multiple hybrid modal retrieval tasks. It can also serve as a base model for downstream fine-tuning for hybrid modal retrieval tasks.

LongLLM QLoRA

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine (the context length can go far beyond 80k with more computing resources). The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts.

Activation Beacon

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Activation Beacon condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. It is an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. More details please refer to our paper and code.

LM-Cocktail

LM-Cocktail automatically merges fine-tuned models and base model using a simple function to compute merging weights. LM-Cocktail can be used to improve the performance on target domain without decrease the general capabilities beyond target domain, as well as generate a model for new tasks without fine-tuning. You can use it to merge the LLMs (e.g., Llama) or embedding models. More details please refer to our report: LM-Cocktail and code.

LLM Embedder

LLM Embedder is fine-tuned based on the feedback from LLMs. It supports the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, example retrieval, and tool retrieval. It is fine-tuned over 6 tasks: Question Answering, Conversational Search, Long Conversation, Long-Range Language Modeling, In-Context Learning, and Tool Learning. For more details please refer to report and ./llm_embedder/README.md

BGE Reranker

Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. For more details please refer to ./reranker/README.md

LLM Reranker

We provide a new version of the cross-encoder that supports more languages and longer lengths. The data format is similar to our embedding models, but now includes prompt data for fine-tuning and inference. You can perform inference using specific layers or using the entire layers. You can fine-tune it easily following our example. For more details please refer to ./llm_reranker/README.md.

BGE Embedding

BGE embedding is a general Embedding Model. We pre-train the models using retromae and train them on large-scale pair data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. Refer to our report: c-pack and code for more details.

BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance.

C-MTEB

A benchmark for chinese text embedding. This benchmark has been merged into MTEB. Refer to our report: c-pack and code for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research

research

README.md

Research

BGE-M3 (Paper, Code)

Visualized-BGE

LongLLM QLoRA

Activation Beacon

LM-Cocktail

LLM Embedder

BGE Reranker

LLM Reranker

BGE Embedding

C-MTEB

Files

research

Directory actions

More options

Directory actions

More options

Latest commit

History

research

Folders and files

parent directory

Research

BGE-M3 (Paper, Code)