πŸ“‹ Model Description


language: - en - zh - ja - ko - fr - de - es - pt - ru - ar license: apache-2.0 library_name: gguf pipeline_tag: sentence-similarity tags: - gguf - quantized - embedding - sentence-transformers - Qwen3 - imatrix - llama-cpp base_model: Qwen/Qwen3-Embedding-0.6B model_creator: Qwen datasets: - explodinggradients/fiqa - PatronusAI/financebench - zeroshot/twitter-financial-news-sentiment - philschmid/finanical-rag-embedding-dataset - openai/gsm8k - DigitalLearningGmbH/MATH-lighteval quantized_by: PeterAM4

Qwen3-Embedding-0.6B -- GGUF

All-in-one GGUF quantizations of Qwen/Qwen3-Embedding-0.6B, from 8-bit down to 1-bit, with importance-matrix calibration optimized for financial and technical text retrieval.

Qwen3-Embedding-0.6B is a compact, multilingual embedding model well suited for RAG pipelines, semantic search, and document retrieval. These quantizations make it practical to run on edge devices, laptops, and resource-constrained servers -- particularly for financial NLP workloads where low latency and small memory footprint matter.

The importance matrix was calibrated on a mixed corpus weighted toward financial data (financial Q&A from FiQA, SEC 10-K filings from FinanceBench, financial sentiment from Twitter, RAG pairs) alongside math reasoning and general text, so the quantized models preserve the weights most relevant to financial domain embeddings.

PropertyValue
Base modelQwen/Qwen3-Embedding-0.6B
Parameters595,776,512
Max context32,768 tokens
PoolingLast token
Embedding dim1024
LicenseApache 2.0
Quantized withllama.cpp

Why quantize?

Despite their size, neural networks are remarkably sparse in information density. Most of the 16 bits allocated per weight during training exist to make gradient descent work -- not to store knowledge. Current estimates put the actual information content at roughly 2 bits per parameter. The remaining 14 bits are redundancy.

This explains why aggressive quantization works: compressing from 16-bit to 4-bit (75% reduction) discards almost exclusively noise. Our benchmark data confirms this -- Q3KM-imat at 4.66 BPW scores within +0.62 PPL of the full BF16 baseline while being 70% smaller than BF16 (331 MB vs 1.1 GB). For comparison, Q4KM-imat at 5.32 BPW shows +0.65 PPL -- the 3-bit model actually edges it out. The imatrix (importance matrix) is key here: by profiling which weights actually carry signal, we preserve those at higher precision and compress the rest. This is why imatrix-calibrated 3-bit models can outperform naive 5-bit quantizations.

The quality cliff appears around 3 BPW, where we start cutting into real information. Below that, PPL degrades rapidly (Q2K at 3.97 BPW: +391, IQ1S at 2.79 BPW: +23,008). Ternary quantizations (TQ20, TQ10) diverge entirely on this architecture.


Benchmark results

All models evaluated with llama-perplexity on a 22 MB calibration corpus (financial, math, and general text). Context window: 1536 tokens. Chunks: 200. Lower PPL = better.

Baseline PPL (BF16): 406.0250

ModelSizeBPWPPLDelta PPL
Qwen3-Embedding-0.6B-BF16.gguf (unquantized) baseline1.1G16.08406.0250--
Qwen3-Embedding-0.6B-Q8_0.gguf610M8.58409.5689+3.54
Qwen3-Embedding-0.6B-Q6_K.gguf472M6.65417.3712+11.35
Qwen3-Embedding-0.6B-Q5_1.gguf442M6.23426.9407+20.92
Qwen3-Embedding-0.6B-Q5K_M.gguf424M5.96442.9431+36.92
Qwen3-Embedding-0.6B-Q5_0.gguf416M5.86413.1916+7.17
Qwen3-Embedding-0.6B-Q5K_S.gguf416M5.86414.9329+8.91
Qwen3-Embedding-0.6B-Q4_1-imat.gguf390M5.49403.0646-2.96
Qwen3-Embedding-0.6B-Q4K_M-imat.gguf378M5.32406.6788+0.65
Qwen3-Embedding-0.6B-Q4K_S-imat.gguf365M5.14406.9947+0.97
Qwen3-Embedding-0.6B-Q4_0-imat.gguf364M5.13419.8843+13.86
Qwen3-Embedding-0.6B-IQ4_NL-imat.gguf364M5.12435.0203+29.00
Qwen3-Embedding-0.6B-Q3K_L-imat.gguf351M4.94412.0217+6.00
Qwen3-Embedding-0.6B-IQ4_XS-imat.gguf351M4.94451.4025+45.38
Qwen3-Embedding-0.6B-Q3K_M-imat.gguf recommended331M4.66406.6408+0.62
Qwen3-Embedding-0.6B-IQ3_M-imat.gguf320M4.51460.9405+54.92
Qwen3-Embedding-0.6B-IQ3_S-imat.gguf308M4.34475.4797+69.45
Qwen3-Embedding-0.6B-Q3K_S-imat.gguf308M4.34340.2907-65.73
Qwen3-Embedding-0.6B-IQ3_XS-imat.gguf298M4.20520.3907+114.37
Qwen3-Embedding-0.6B-Q2_K-imat.gguf282M3.97797.8549+391.83
Qwen3-Embedding-0.6B-Q2K_S-imat.gguf267M3.761561.2449+1155.22
Qwen3-Embedding-0.6B-IQ3_XXS-imat.gguf266M3.74613.9329+207.91
Qwen3-Embedding-0.6B-IQ2_M-imat.gguf252M3.551283.4407+877.42
Qwen3-Embedding-0.6B-IQ2_S-imat.gguf242M3.411857.4142+1451.39
Qwen3-Embedding-0.6B-TQ2_0-imat.gguf236M3.32divergedN/A
Qwen3-Embedding-0.6B-IQ2_XS-imat.gguf231M3.253632.9250+3226.90
Qwen3-Embedding-0.6B-IQ2_XXS-imat.gguf219M3.085641.8950+5235.87
Qwen3-Embedding-0.6B-TQ1_0-imat.gguf216M3.04divergedN/A
Qwen3-Embedding-0.6B-IQ1_M-imat.gguf206M2.907495.4178+7089.39
Qwen3-Embedding-0.6B-IQ1_S-imat.gguf198M2.7923414.9432+23008.92

Notes:

  • The -imat suffix means the model was quantized with importance-matrix calibration. This is what allows 3-4 bit models to stay close to baseline -- the imatrix tells the quantizer which weights carry real information.
  • Q3KS-imat reports an anomalously low PPL (340). This is a statistical artifact, not a genuine improvement over baseline.
  • TQ20 / TQ10 (ternary quantizations) produce diverged PPL on this architecture. They require CPU or CUDA (not supported on Apple Metal) and are not usable for this model.
  • Below ~4 BPW, quality degrades steeply. Below ~3 BPW, models are not recommended for any production use.

Choosing a model

Use caseModelNotes
Maximum qualityQ8_0Near-lossless, 610 MB
Best quality/size trade-offQ3KM-imat+0.62 PPL delta at 331 MB -- smallest model with near-baseline quality
Larger but safe marginQ4KM-imat+0.65 PPL delta at 378 MB
Extreme compressionQ2_K-imatUsable for non-critical applications

Quantization method

All models were quantized from the BF16 source using llama-quantize from llama.cpp.

Three strategies were used:

  1. Standard (Q80, Q6K, Q5KM, Q5KS, Q50, Q51) -- uniform precision reduction, no imatrix.
  2. K-Quant + imatrix (Q4KM, Q4KS, Q40, Q41, Q3KL, Q3KM, Q3KS, Q2K, Q2KS) -- block-level mixed precision, importance matrix recommended.
  3. Importance-weighted (IQ4NL, IQ4XS, IQ3M, IQ3S, IQ3XS, IQ3XXS, IQ2M, IQ2S, IQ2XS, IQ2XXS, IQ1M, IQ1S, TQ20, TQ1_0) -- non-linear quantization, imatrix required.

Calibration corpus

The importance matrix was generated from a mixed-domain corpus (22 MB, ~198,000 lines). The mix was chosen to cover the primary use case (financial text) while including general and mathematical text to maintain broad capability:

DatasetSourceDomainEntries
WikiText-2ggml-orgGeneral knowledge36,718 lines
Twitter Financial Newszeroshot/twitter-financial-news-sentimentFinancial sentiment9,543
GSM8Kopenai/gsm8kMath word problems7,473
Financial RAGphilschmid/finanical-rag-embedding-datasetFinancial Q&A pairs6,998
FiQAexplodinggradients/fiqaPersonal finance Q&A5,650
MATH CompetitionDigitalLearningGmbH/MATH-lightevalCompetition math5,000
FinanceBenchPatronusAI/financebenchSEC 10-K filings150
The financial datasets (FiQA + Twitter Financial News + Financial RAG + FinanceBench) contribute ~22,300 entries of domain-specific text covering sentiment, Q&A, RAG pairs, and SEC filings -- ensuring the importance matrix prioritizes weights relevant to financial terminology and reasoning.

Usage

llama.cpp server

./llama-server \
    -m Qwen3-Embedding-0.6B-Q3KM-imat.gguf \
    --embedding --pooling last \
    -c 32768 -np 8 \
    --host 0.0.0.0 --port 8080

Python (llama-cpp-python)

from llama_cpp import Llama

model = Llama(
modelpath="Qwen3-Embedding-0.6B-Q3K_M-imat.gguf",
embedding=True,
pooling_type="last",
n_ctx=32768,
)

result = model.create_embedding(["Financial analysis of Q3 earnings"])
print(len(result["data"][0]["embedding"])) # 1024

Download a specific file

from huggingfacehub import hfhub_download

path = hfhubdownload(
repo_id="PeterAM4/Qwen3-Embedding-0.6B-GGUF",
filename="Qwen3-Embedding-0.6B-Q3KM-imat.gguf",
)


Technical details

  • GGUF format v3
  • Tokenizer: Qwen3 (151,936 tokens)
  • addeostoken: false (patched for llama.cpp compatibility; EOS token ID 151643 is still present and usable)
  • Pooling type: 3 (last token)
  • Hardware used: Apple M3 Pro with Metal acceleration (CPU fallback for ternary quants)

Credits

License

Apache 2.0, inherited from the original model.

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
Qwen3-Embedding-0.6B-BF16.gguf
LFS FP16
1.12 GB Download
Qwen3-Embedding-0.6B-IQ1_M-imat.gguf
LFS
205.86 MB Download
Qwen3-Embedding-0.6B-IQ1_S-imat.gguf
LFS
198.19 MB Download
Qwen3-Embedding-0.6B-IQ2_M-imat.gguf
LFS Q2
252.45 MB Download
Qwen3-Embedding-0.6B-IQ2_S-imat.gguf
LFS Q2
242.23 MB Download
Qwen3-Embedding-0.6B-IQ2_XS-imat.gguf
LFS Q2
230.6 MB Download
Qwen3-Embedding-0.6B-IQ2_XXS-imat.gguf
LFS Q2
218.63 MB Download
Qwen3-Embedding-0.6B-IQ3_M-imat.gguf
LFS Q3
320.24 MB Download
Qwen3-Embedding-0.6B-IQ3_S-imat.gguf
LFS Q3
307.89 MB Download
Qwen3-Embedding-0.6B-IQ3_XS-imat.gguf
LFS Q3
298.04 MB Download
Qwen3-Embedding-0.6B-IQ3_XXS-imat.gguf
LFS Q3
265.9 MB Download
Qwen3-Embedding-0.6B-IQ4_NL-imat.gguf
LFS Q4
363.67 MB Download
Qwen3-Embedding-0.6B-IQ4_XS-imat.gguf
LFS Q4
350.54 MB Download
Qwen3-Embedding-0.6B-Q2_K-imat.gguf
LFS Q2
282.29 MB Download
Qwen3-Embedding-0.6B-Q2_K_S-imat.gguf
LFS Q2
267.34 MB Download
Qwen3-Embedding-0.6B-Q3_K_L-imat.gguf
LFS Q3
351.2 MB Download
Qwen3-Embedding-0.6B-Q3_K_M-imat.gguf
LFS Q3
330.83 MB Download
Qwen3-Embedding-0.6B-Q3_K_S-imat.gguf
LFS Q3
307.89 MB Download
Qwen3-Embedding-0.6B-Q4_0-imat.gguf
Recommended LFS Q4
364.23 MB Download
Qwen3-Embedding-0.6B-Q4_1-imat.gguf
LFS Q4
389.92 MB Download
Qwen3-Embedding-0.6B-Q4_K_M-imat.gguf
LFS Q4
378.11 MB Download
Qwen3-Embedding-0.6B-Q4_K_S-imat.gguf
LFS Q4
365.29 MB Download
Qwen3-Embedding-0.6B-Q5_0.gguf
LFS Q5
416.17 MB Download
Qwen3-Embedding-0.6B-Q5_1.gguf
LFS Q5
442.42 MB Download
Qwen3-Embedding-0.6B-Q5_K_M.gguf
LFS Q5
423.61 MB Download
Qwen3-Embedding-0.6B-Q5_K_S.gguf
LFS Q5
416.17 MB Download
Qwen3-Embedding-0.6B-Q6_K.gguf
LFS Q6
471.95 MB Download
Qwen3-Embedding-0.6B-Q8_0.gguf
LFS Q8
609.54 MB Download
Qwen3-Embedding-0.6B-TQ1_0-imat.gguf
LFS
216.01 MB Download
Qwen3-Embedding-0.6B-TQ2_0-imat.gguf
LFS Q2
235.7 MB Download