gpustack/bge-m3-GGUF

Name: gpustack/bge-m3-GGUF
Author: gpustack

High-quality GGUF model

2.0K 📥 Downloads

14 ❤️ Likes

9 📁 GGUF Files

4.51 GB 💾 Total Size

2 years ago 🔄 Last Updated

📋 Model Description

license: mit pipeline_tag: sentence-similarity tags:

sentence-transformers
feature-extraction
sentence-similarity
text-embeddings-inference

bge-m3-GGUF

Model creator: BAAI

Original model: bge-m3

GGUF quantization: based on llama.cpp release 61408e7f

For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding

BGE-M3 (paper, code)

In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: It can support more than 100 working languages.
Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

Some suggestions for retrieval pipeline in RAG

We recommend to use the following pipeline: hybrid retrieval + re-ranking.

Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities.

A classic example: using both embedding retrieval and the BM25 algorithm.
Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval.
This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings.
To use hybrid retrieval, you can refer to Vespa and Milvus.

As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model.

Utilizing the re-ranking model (e.g., bge-reranker, bge-reranker-v2) after retrieval can further filter the selected text.

News:

2024/7/1: We update the MIRACL evaluation results of BGE-M3. To reproduce the new results, you can refer to: bge-m3miracl2cr. We have also updated our paper on arXiv.

Details

The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the --remove-query parameter when using pyserini.search.faiss or pyserini.search.lucene to search the passages.

2024/3/20: Thanks Milvus team! Now you can use hybrid retrieval of bge-m3 in Milvus: pymilvus/examples

/hellohybridsparse_dense.py.
2024/3/8: Thanks for the experimental results from @Yannael. In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI.
2024/3/2: Release unified fine-tuning example and data
2024/2/6: We release the MLDR (a long document retrieval dataset covering 13 languages) and evaluation pipeline.
2024/2/1: Thanks for the excellent tool from Vespa. You can easily use multiple modes of BGE-M3 following this notebook

Specs

Model

Model Name Dimension Sequence Length Introduction

BAAI/bge-m3 1024 8192 multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised
BAAI/bge-m3-unsupervised 1024 8192 multilingual; contrastive learning from bge-m3-retromae
BAAI/bge-m3-retromae -- 8192 multilingual; extend the maxlength of xlm-roberta to 8192 and further pretrained via retromae

| BAAI/bge-large-en-v1.5 | 1024 | 512 | English model | | BAAI/bge-base-en-v1.5 | 768 | 512 | English model | | BAAI/bge-small-en-v1.5 | 384 | 512 | English model |
Data

Dataset Introduction

MLDR Docuemtn Retrieval Dataset, covering 13 languages
bge-m3-data Fine-tuning data used by bge-m3

FAQ

1. Introduction for different retrieval methods

Dense retrieval: map the text into a single embedding, e.g., DPR, BGE-v1.5
Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, unicoil, and splade
Multi-vector retrieval: use multiple vectors to represent a text, e.g., ColBERT.

2. How to use BGE-M3 in other projects?

For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE.
The only difference is that the BGE-M3 model no longer requires adding instructions to the queries.

For hybrid retrieval, you can use Vespa and Milvus.

3. How to fine-tune bge-M3 model?

You can follow the common in this example
to fine-tune the dense embedding.

If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the unified_fine-tuning example

Usage

Install:
git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding pip install -e .

or:
pip install -U FlagEmbedding

Generate Embedding for text

Dense Embedding

from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('BAAI/bge-m3', usefp16=True) # Setting usefp16 to True speeds up computation with a slight performance degradation sentences_1 = ["What is BGE M3?", "Defination of BM25"] sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] embeddings1 = model.encode(sentences1, batch_size=12, max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process. )['dense_vecs'] embeddings2 = model.encode(sentences2)['dense_vecs'] similarity = embeddings1 @ embeddings2.T print(similarity)
[[0.6265, 0.3477], [0.3499, 0.678 ]]

You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
Refer to baaigeneral_embedding for details.

Sparse Embedding (Lexical Weight)

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  usefp16=True) # Setting usefp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output1 = model.encode(sentences1, returndense=True, returnsparse=True, returncolbertvecs=False)
output2 = model.encode(sentences2, returndense=True, returnsparse=True, returncolbertvecs=False)

you can see the weight for each token:
print(model.convertidtotoken(output1['lexical_weights']))
[{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092}, 
 {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]

compute the scores via lexical mathcing
lexicalscores = model.computelexicalmatchingscore(output1['lexicalweights'][0], output2['lexicalweights'][0])
print(lexical_scores)
0.19554901123046875

print(model.computelexicalmatchingscore(output1['lexicalweights'][0], output1['lexical_weights'][1]))
0.0

Multi-Vector (ColBERT)

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True)

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output1 = model.encode(sentences1, returndense=True, returnsparse=True, returncolbertvecs=True)
output2 = model.encode(sentences2, returndense=True, returnsparse=True, returncolbertvecs=True)

print(model.colbertscore(output1['colbertvecs'][0], output2['colbert_vecs'][0]))
print(model.colbertscore(output1['colbertvecs'][0], output2['colbert_vecs'][1]))
0.7797

0.4620

Compute score for text pairs

Input a list of text pairs, you can get the scores computed by different methods.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True)

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

sentencepairs = [[i,j] for i in sentences1 for j in sentences_2]

print(model.computescore(sentencepairs, 
                          maxpassagelength=128, # a smaller max length leads to a lower latency
                          weightsfordifferentmodes=[0.4, 0.2, 0.4])) # weightsfordifferentmodes(w) is used to do weighted sum: w[0]densescore + w[1]sparsescore + w[2]*colbert_score

{
  'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142], 
  'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625], 
  'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625], 
  'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816], 
  'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
}

Evaluation

We provide the evaluation script for MKQA and MLDR

Benchmarks from the open-source community

!avatar The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI). For more details, please refer to the article and Github Repo

Our results

Multilingual (Miracl dataset)

!avatar

Cross-lingual (MKQA dataset)

!avatar

Long Document Retrieval

- MLDR: !avatar Please note that MLDR is a document retrieval dataset we constructed via LLM, covering 13 languages, including test set, validation set, and training set. We utilized the training set from MLDR to enhance the model's long document retrieval capabilities. Therefore, comparing baselines with Dense w.o.long(fine-tuning without long document dataset) is more equitable. Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets. We believe that this data will be helpful for the open-source community in training document retrieval models.

- NarritiveQA:
!avatar

Comparison with BM25

We utilized Pyserini to implement BM25, and the test results can be reproduced by this script.
We tested BM25 using two different tokenizers:
one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta).
The results indicate that BM25 remains a competitive baseline,
especially in long document retrieval.

!avatar

Training

Self-knowledge Distillation: combining multiple outputs from different

retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival)

Efficient Batching: Improve the efficiency when fine-tuning on long text.

The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model.

MCLS: A simple method to improve the performance on long text without fine-tuning.

If you have no enough resource to fine-tuning model with long text, the method is useful.

Refer to our report for more details.

Acknowledgement

Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
Thanks to the open-sourced libraries like Tevatron, Pyserini.

Citation

If you find this repository useful, please consider giving a star :star: and citation

@misc{bge-m3,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📂 GGUF File List

📁 Filename	📦 Size	⚡ Download
bge-m3-FP16.gguf LFS FP16	1.08 GB	Download
bge-m3-Q2_K.gguf LFS Q2	349.15 MB	Download
bge-m3-Q3_K.gguf LFS Q3	383.65 MB	Download
bge-m3-Q4_0.gguf Recommended LFS Q4	402.03 MB	Download
bge-m3-Q4_K_M.gguf LFS Q4	417.5 MB	Download
bge-m3-Q5_0.gguf LFS Q5	438.03 MB	Download
bge-m3-Q5_K_M.gguf LFS Q5	446 MB	Download
bge-m3-Q6_K.gguf LFS Q6	476.28 MB	Download
bge-m3-Q8_0.gguf LFS Q8	605.16 MB	Download

📊 Model Information

🆔 Model ID: gpustack/bge-m3-GGUF

📅 Created: 2 years ago

🔄 Last Updated: 2 years ago

📥 Downloads: 2.0K

❤️ Likes: 14

🎯 Difficulty: Beginner

⚙️ Quantization: FP16, Q2, Q3, Q4, Q5, Q6, Q8

🏷️ Tags

sentence-transformersgguffeature-extractionsentence-similaritytext-embeddings-inferencearxiv:2402.03216arxiv:2004.04906arxiv:2106.14807arxiv:2107.05720arxiv:2004.12832license:mitendpoints_compatibledeploy:azureregion:us

🔗 Related Links

🤗 Visit HuggingFace ⚡ Quick Download

gpustack/bge-m3-GGUF

📋 Model Description

bge-m3-GGUF

BGE-M3 (paper, code)

News:

Specs

FAQ

Usage

Generate Embedding for text

`[[0.6265, 0.3477], [0.3499, 0.678 ]]`

you can see the weight for each token:

[{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},

{'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]

compute the scores via lexical mathcing

0.19554901123046875

`0.0`

0.7797

`0.4620`

Compute score for text pairs

{

'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],

'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],

'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],

'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],

'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]

`}`

Evaluation

Benchmarks from the open-source community

Our results

Training

Acknowledgement

Citation

📂 GGUF File List

📊 Model Information

🏷️ Tags

🔗 Related Links

Model Name	Dimension	Sequence Length	Introduction
BAAI/bge-m3	1024	8192	multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised
BAAI/bge-m3-unsupervised	1024	8192	multilingual; contrastive learning from bge-m3-retromae
BAAI/bge-m3-retromae	--	8192	multilingual; extend the maxlength of xlm-roberta to 8192 and further pretrained via retromae

Dataset	Introduction
MLDR	Docuemtn Retrieval Dataset, covering 13 languages
bge-m3-data	Fine-tuning data used by bge-m3