π Model Description
license: apache-2.0 base_model: Qwen/Qwen3-VL-Embedding-8B tags: - multimodal - embedding - gguf - llama.cpp - quantized library_name: llama.cpp language: - en - zh - multilingual
Qwen3-VL-Embedding-8B GGUF
GGUF quantizations of Qwen/Qwen3-VL-Embedding-8B for efficient CPU inference with llama.cpp.
Model Description
Qwen3-VL-Embedding-8B is a multimodal embedding model for information retrieval and cross-modal understanding. It supports text, images, screenshots, videos, and mixed multimodal inputs.
Original model specs:
- Parameters: 8B
- Context Length: 32K tokens
- Embedding Dimension: 64-4096 (configurable)
- Languages: 30+
- Input Modalities: Text, Images, Videos
Available Quantizations
| File | Size | Use Case |
|---|---|---|
| Qwen3-VL-Embedding-8B-F16.gguf | 15GB | Maximum quality, baseline reference |
| Qwen3-VL-Embedding-8B-Q8_0.gguf | 7.5GB | Recommended - minimal quality loss |
| Qwen3-VL-Embedding-8B-Q6_K.gguf | 5.8GB | High quality, good balance |
| Qwen3-VL-Embedding-8B-Q5KM.gguf | 5.1GB | Good quality, balanced size |
| Qwen3-VL-Embedding-8B-Q5KS.gguf | 5.0GB | Good quality, smaller variant |
| Qwen3-VL-Embedding-8B-Q4KM.gguf | 4.4GB | Decent quality, smaller size |
| Qwen3-VL-Embedding-8B-Q4KS.gguf | 4.2GB | Decent quality, more compressed |
| Qwen3-VL-Embedding-8B-Q3KM.gguf | 3.6GB | Lower quality, significant compression |
| Qwen3-VL-Embedding-8B-Q2_K.gguf | 2.9GB | Lowest quality, maximum compression |
Usage with llama.cpp
Installation
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
Download Model
huggingface-cli download dam2452/Qwen3-VL-Embedding-8B-GGUF \
Qwen3-VL-Embedding-8B-Q8_0.gguf \
--local-dir ./models
Run Embedding Server
./llama-server \
-m models/Qwen3-VL-Embedding-8B-Q8_0.gguf \
--embedding \
--port 8080 \
--host 0.0.0.0
Generate Embeddings (API)
curl http://localhost:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "Your text or image data here"
}'
Generate Embeddings (Python)
import requests
response = requests.post(
"http://localhost:8080/embedding",
json={"content": "A woman playing with her dog on a beach"}
)
embedding = response.json()["embedding"]
print(f"Embedding dimension: {len(embedding)}")
Performance
Original model performance on benchmarks:
- MMEB-V2: 77.9 overall score
- MMTEB: 67.88 mean task score
- Retrieval: 81.08
Note: Quantized models may show slightly reduced performance, with Q8_0 typically having less than 1% degradation.
License
Apache 2.0 (inherited from original model)
Citation
@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv},
year={2026}
}
Resources
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
Qwen3-VL-Embedding-8B-F16.gguf
LFS
FP16
|
14.1 GB | Download |
|
Qwen3-VL-Embedding-8B-Q2_K.gguf
LFS
Q2
|
2.87 GB | Download |
|
Qwen3-VL-Embedding-8B-Q3_K_M.gguf
LFS
Q3
|
3.59 GB | Download |
|
Qwen3-VL-Embedding-8B-Q4_K_M.gguf
Recommended
LFS
Q4
|
4.36 GB | Download |
|
Qwen3-VL-Embedding-8B-Q4_K_S.gguf
LFS
Q4
|
4.15 GB | Download |
|
Qwen3-VL-Embedding-8B-Q5_K_M.gguf
LFS
Q5
|
5.05 GB | Download |
|
Qwen3-VL-Embedding-8B-Q5_K_S.gguf
LFS
Q5
|
4.93 GB | Download |
|
Qwen3-VL-Embedding-8B-Q6_K.gguf
LFS
Q6
|
5.79 GB | Download |
|
Qwen3-VL-Embedding-8B-Q8_0.gguf
LFS
Q8
|
7.5 GB | Download |