π Model Description
language: - en - zh - ja - ko - fr - de - es - pt - ru - ar license: apache-2.0 library_name: gguf pipeline_tag: sentence-similarity tags: - gguf - quantized - embedding - sentence-transformers - Qwen3 - imatrix - llama-cpp base_model: Qwen/Qwen3-Embedding-0.6B model_creator: Qwen datasets: - explodinggradients/fiqa - PatronusAI/financebench - zeroshot/twitter-financial-news-sentiment - philschmid/finanical-rag-embedding-dataset - openai/gsm8k - DigitalLearningGmbH/MATH-lighteval quantized_by: PeterAM4
Qwen3-Embedding-0.6B -- GGUF
All-in-one GGUF quantizations of Qwen/Qwen3-Embedding-0.6B, from 8-bit down to 1-bit, with importance-matrix calibration optimized for financial and technical text retrieval.
Qwen3-Embedding-0.6B is a compact, multilingual embedding model well suited for RAG pipelines, semantic search, and document retrieval. These quantizations make it practical to run on edge devices, laptops, and resource-constrained servers -- particularly for financial NLP workloads where low latency and small memory footprint matter.
The importance matrix was calibrated on a mixed corpus weighted toward financial data (financial Q&A from FiQA, SEC 10-K filings from FinanceBench, financial sentiment from Twitter, RAG pairs) alongside math reasoning and general text, so the quantized models preserve the weights most relevant to financial domain embeddings.
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-Embedding-0.6B |
| Parameters | 595,776,512 |
| Max context | 32,768 tokens |
| Pooling | Last token |
| Embedding dim | 1024 |
| License | Apache 2.0 |
| Quantized with | llama.cpp |
Why quantize?
Despite their size, neural networks are remarkably sparse in information density. Most of the 16 bits allocated per weight during training exist to make gradient descent work -- not to store knowledge. Current estimates put the actual information content at roughly 2 bits per parameter. The remaining 14 bits are redundancy.
This explains why aggressive quantization works: compressing from 16-bit to 4-bit (75% reduction) discards almost exclusively noise. Our benchmark data confirms this -- Q3KM-imat at 4.66 BPW scores within +0.62 PPL of the full BF16 baseline while being 70% smaller than BF16 (331 MB vs 1.1 GB). For comparison, Q4KM-imat at 5.32 BPW shows +0.65 PPL -- the 3-bit model actually edges it out. The imatrix (importance matrix) is key here: by profiling which weights actually carry signal, we preserve those at higher precision and compress the rest. This is why imatrix-calibrated 3-bit models can outperform naive 5-bit quantizations.
The quality cliff appears around 3 BPW, where we start cutting into real information. Below that, PPL degrades rapidly (Q2K at 3.97 BPW: +391, IQ1S at 2.79 BPW: +23,008). Ternary quantizations (TQ20, TQ10) diverge entirely on this architecture.
Benchmark results
All models evaluated with llama-perplexity on a 22 MB calibration corpus (financial, math, and general text). Context window: 1536 tokens. Chunks: 200. Lower PPL = better.
Baseline PPL (BF16): 406.0250
Notes:
- The -imat suffix means the model was quantized with importance-matrix calibration. This is what allows 3-4 bit models to stay close to baseline -- the imatrix tells the quantizer which weights carry real information.
- Q3KS-imat reports an anomalously low PPL (340). This is a statistical artifact, not a genuine improvement over baseline.
- TQ20 / TQ10 (ternary quantizations) produce diverged PPL on this architecture. They require CPU or CUDA (not supported on Apple Metal) and are not usable for this model.
- Below ~4 BPW, quality degrades steeply. Below ~3 BPW, models are not recommended for any production use.
Choosing a model
| Use case | Model | Notes |
|---|---|---|
| Maximum quality | Q8_0 | Near-lossless, 610 MB |
| Best quality/size trade-off | Q3KM-imat | +0.62 PPL delta at 331 MB -- smallest model with near-baseline quality |
| Larger but safe margin | Q4KM-imat | +0.65 PPL delta at 378 MB |
| Extreme compression | Q2_K-imat | Usable for non-critical applications |
Quantization method
All models were quantized from the BF16 source using llama-quantize from llama.cpp.
Three strategies were used:
- Standard (Q80, Q6K, Q5KM, Q5KS, Q50, Q51) -- uniform precision reduction, no imatrix.
- K-Quant + imatrix (Q4KM, Q4KS, Q40, Q41, Q3KL, Q3KM, Q3KS, Q2K, Q2KS) -- block-level mixed precision, importance matrix recommended.
- Importance-weighted (IQ4NL, IQ4XS, IQ3M, IQ3S, IQ3XS, IQ3XXS, IQ2M, IQ2S, IQ2XS, IQ2XXS, IQ1M, IQ1S, TQ20, TQ1_0) -- non-linear quantization, imatrix required.
Calibration corpus
The importance matrix was generated from a mixed-domain corpus (22 MB, ~198,000 lines). The mix was chosen to cover the primary use case (financial text) while including general and mathematical text to maintain broad capability:
| Dataset | Source | Domain | Entries |
|---|---|---|---|
| WikiText-2 | ggml-org | General knowledge | 36,718 lines |
| Twitter Financial News | zeroshot/twitter-financial-news-sentiment | Financial sentiment | 9,543 |
| GSM8K | openai/gsm8k | Math word problems | 7,473 |
| Financial RAG | philschmid/finanical-rag-embedding-dataset | Financial Q&A pairs | 6,998 |
| FiQA | explodinggradients/fiqa | Personal finance Q&A | 5,650 |
| MATH Competition | DigitalLearningGmbH/MATH-lighteval | Competition math | 5,000 |
| FinanceBench | PatronusAI/financebench | SEC 10-K filings | 150 |
Usage
llama.cpp server
./llama-server \
-m Qwen3-Embedding-0.6B-Q3KM-imat.gguf \
--embedding --pooling last \
-c 32768 -np 8 \
--host 0.0.0.0 --port 8080
Python (llama-cpp-python)
from llama_cpp import Llama
model = Llama(
modelpath="Qwen3-Embedding-0.6B-Q3K_M-imat.gguf",
embedding=True,
pooling_type="last",
n_ctx=32768,
)
result = model.create_embedding(["Financial analysis of Q3 earnings"])
print(len(result["data"][0]["embedding"])) # 1024
Download a specific file
from huggingfacehub import hfhub_download
path = hfhubdownload(
repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF",
filename="Qwen3-Embedding-0.6B-Q3KM-imat.gguf",
)
Technical details
- GGUF format v3
- Tokenizer: Qwen3 (151,936 tokens)
- addeostoken: false (patched for llama.cpp compatibility; EOS token ID 151643 is still present and usable)
- Pooling type: 3 (last token)
- Hardware used: Apple M3 Pro with Metal acceleration (CPU fallback for ternary quants)
Credits
- Qwen/Qwen3-Embedding-0.6B by Alibaba Qwen Team
- llama.cpp by Georgi Gerganov et al.
License
Apache 2.0, inherited from the original model.
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
Qwen3-Embedding-0.6B-BF16.gguf
LFS
FP16
|
1.12 GB | Download |
|
Qwen3-Embedding-0.6B-IQ1_M-imat.gguf
LFS
|
205.86 MB | Download |
|
Qwen3-Embedding-0.6B-IQ1_S-imat.gguf
LFS
|
198.19 MB | Download |
|
Qwen3-Embedding-0.6B-IQ2_M-imat.gguf
LFS
Q2
|
252.45 MB | Download |
|
Qwen3-Embedding-0.6B-IQ2_S-imat.gguf
LFS
Q2
|
242.23 MB | Download |
|
Qwen3-Embedding-0.6B-IQ2_XS-imat.gguf
LFS
Q2
|
230.6 MB | Download |
|
Qwen3-Embedding-0.6B-IQ2_XXS-imat.gguf
LFS
Q2
|
218.63 MB | Download |
|
Qwen3-Embedding-0.6B-IQ3_M-imat.gguf
LFS
Q3
|
320.24 MB | Download |
|
Qwen3-Embedding-0.6B-IQ3_S-imat.gguf
LFS
Q3
|
307.89 MB | Download |
|
Qwen3-Embedding-0.6B-IQ3_XS-imat.gguf
LFS
Q3
|
298.04 MB | Download |
|
Qwen3-Embedding-0.6B-IQ3_XXS-imat.gguf
LFS
Q3
|
265.9 MB | Download |
|
Qwen3-Embedding-0.6B-IQ4_NL-imat.gguf
LFS
Q4
|
363.67 MB | Download |
|
Qwen3-Embedding-0.6B-IQ4_XS-imat.gguf
LFS
Q4
|
350.54 MB | Download |
|
Qwen3-Embedding-0.6B-Q2_K-imat.gguf
LFS
Q2
|
282.29 MB | Download |
|
Qwen3-Embedding-0.6B-Q2_K_S-imat.gguf
LFS
Q2
|
267.34 MB | Download |
|
Qwen3-Embedding-0.6B-Q3_K_L-imat.gguf
LFS
Q3
|
351.2 MB | Download |
|
Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf
LFS
Q3
|
330.83 MB | Download |
|
Qwen3-Embedding-0.6B-Q3_K_S-imat.gguf
LFS
Q3
|
307.89 MB | Download |
|
Qwen3-Embedding-0.6B-Q4_0-imat.gguf
Recommended
LFS
Q4
|
364.23 MB | Download |
|
Qwen3-Embedding-0.6B-Q4_1-imat.gguf
LFS
Q4
|
389.92 MB | Download |
|
Qwen3-Embedding-0.6B-Q4_K_M-imat.gguf
LFS
Q4
|
378.11 MB | Download |
|
Qwen3-Embedding-0.6B-Q4_K_S-imat.gguf
LFS
Q4
|
365.29 MB | Download |
|
Qwen3-Embedding-0.6B-Q5_0.gguf
LFS
Q5
|
416.17 MB | Download |
|
Qwen3-Embedding-0.6B-Q5_1.gguf
LFS
Q5
|
442.42 MB | Download |
|
Qwen3-Embedding-0.6B-Q5_K_M.gguf
LFS
Q5
|
423.61 MB | Download |
|
Qwen3-Embedding-0.6B-Q5_K_S.gguf
LFS
Q5
|
416.17 MB | Download |
|
Qwen3-Embedding-0.6B-Q6_K.gguf
LFS
Q6
|
471.95 MB | Download |
|
Qwen3-Embedding-0.6B-Q8_0.gguf
LFS
Q8
|
609.54 MB | Download |
|
Qwen3-Embedding-0.6B-TQ1_0-imat.gguf
LFS
|
216.01 MB | Download |
|
Qwen3-Embedding-0.6B-TQ2_0-imat.gguf
LFS
Q2
|
235.7 MB | Download |