πŸ“‹ Model Description


license: apache-2.0 base_model: Qwen/Qwen3-VL-Embedding-8B tags: - multimodal - embedding - gguf - llama.cpp - quantized library_name: llama.cpp language: - en - zh - multilingual

Qwen3-VL-Embedding-8B GGUF

GGUF quantizations of Qwen/Qwen3-VL-Embedding-8B for efficient CPU inference with llama.cpp.

Model Description

Qwen3-VL-Embedding-8B is a multimodal embedding model for information retrieval and cross-modal understanding. It supports text, images, screenshots, videos, and mixed multimodal inputs.

Original model specs:

  • Parameters: 8B
  • Context Length: 32K tokens
  • Embedding Dimension: 64-4096 (configurable)
  • Languages: 30+
  • Input Modalities: Text, Images, Videos

Available Quantizations

FileSizeUse Case
Qwen3-VL-Embedding-8B-F16.gguf15GBMaximum quality, baseline reference
Qwen3-VL-Embedding-8B-Q8_0.gguf7.5GBRecommended - minimal quality loss
Qwen3-VL-Embedding-8B-Q6_K.gguf5.8GBHigh quality, good balance
Qwen3-VL-Embedding-8B-Q5KM.gguf5.1GBGood quality, balanced size
Qwen3-VL-Embedding-8B-Q5KS.gguf5.0GBGood quality, smaller variant
Qwen3-VL-Embedding-8B-Q4KM.gguf4.4GBDecent quality, smaller size
Qwen3-VL-Embedding-8B-Q4KS.gguf4.2GBDecent quality, more compressed
Qwen3-VL-Embedding-8B-Q3KM.gguf3.6GBLower quality, significant compression
Qwen3-VL-Embedding-8B-Q2_K.gguf2.9GBLowest quality, maximum compression
Recommendation: Start with Q80 for production use. Use Q4KM or Q5K_M for resource-constrained environments.

Usage with llama.cpp

Installation

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Download Model

huggingface-cli download dam2452/Qwen3-VL-Embedding-8B-GGUF \
  Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --local-dir ./models

Run Embedding Server

./llama-server \
  -m models/Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --embedding \
  --port 8080 \
  --host 0.0.0.0

Generate Embeddings (API)

curl http://localhost:8080/embedding \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Your text or image data here"
  }'

Generate Embeddings (Python)

import requests

response = requests.post(
"http://localhost:8080/embedding",
json={"content": "A woman playing with her dog on a beach"}
)

embedding = response.json()["embedding"]
print(f"Embedding dimension: {len(embedding)}")

Performance

Original model performance on benchmarks:

  • MMEB-V2: 77.9 overall score
  • MMTEB: 67.88 mean task score
  • Retrieval: 81.08

Note: Quantized models may show slightly reduced performance, with Q8_0 typically having less than 1% degradation.

License

Apache 2.0 (inherited from original model)

Citation

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv},
  year={2026}
}

Resources

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
Qwen3-VL-Embedding-8B-F16.gguf
LFS FP16
14.1 GB Download
Qwen3-VL-Embedding-8B-Q2_K.gguf
LFS Q2
2.87 GB Download
Qwen3-VL-Embedding-8B-Q3_K_M.gguf
LFS Q3
3.59 GB Download
Qwen3-VL-Embedding-8B-Q4_K_M.gguf
Recommended LFS Q4
4.36 GB Download
Qwen3-VL-Embedding-8B-Q4_K_S.gguf
LFS Q4
4.15 GB Download
Qwen3-VL-Embedding-8B-Q5_K_M.gguf
LFS Q5
5.05 GB Download
Qwen3-VL-Embedding-8B-Q5_K_S.gguf
LFS Q5
4.93 GB Download
Qwen3-VL-Embedding-8B-Q6_K.gguf
LFS Q6
5.79 GB Download
Qwen3-VL-Embedding-8B-Q8_0.gguf
LFS Q8
7.5 GB Download