π Model Description
license: apache-2.0 tags: - gguf - qwen - qwen3 - qwen3-4b - qwen3-4b-gguf - llama.cpp - quantized - text-generation - reasoning - agent - chat - multilingual - imatrix - q3_hifi - q4_hifi - q5_hifi base_model: Qwen/Qwen3-4B author: geoffmunn pipeline_tag: text-generation language: - en - zh - es - fr - de - ru - ar - ja - ko - hi
Qwen3-4B-f16-GGUF
This is a GGUF-quantized version of the Qwen/Qwen3-4B language model β a powerful 4-billion-parameter LLM from Alibaba's Qwen series, designed for strong reasoning, agentic workflows, and multilingual fluency on consumer-grade hardware.
Converted for use with llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.
Why Use a 4B Model?
The Qwen3-4B model strikes a powerful balance between capability and efficiency, offering:
- Strong reasoning and language understandingβsignificantly more capable than sub-1B models
- Smooth CPU inference with moderate hardware (no high-end GPU required)
- Memory footprint under ~8GB when quantized (e.g., GGUF Q4KM or AWQ)
- Excellent price-to-performance ratio for local or edge deployment
Itβs ideal for:
- Local chatbots with contextual memory and richer responses
- On-device AI on laptops or mid-tier edge servers
- Lightweight RAG (Retrieval-Augmented Generation) applications
- Developers needing a capable yet manageable open-weight model
Choose Qwen3-4B when you need more intelligence than a tiny model can provideβbut still want to run offline, avoid cloud costs, or maintain full control over your AI stack.
Qwen3 4B Quantization Guide: Cross-Bit Summary & Recommendations
Executive Summary
At 4B scale, quantization quality is exceptionalβmultiple variants achieve near-lossless or even better-than-F16 perplexity under specific conditions. However, imatrix interactions are uniquely counterintuitive at this scale: it harms certain variants (Q4KHIFI, Q5KS) while helping others. This makes quantization selection critically dependent on whether imatrix is used.
| Quantization | Best Variant (+ imatrix) | Quality vs F16 | File Size | Speed | Memory |
|---|---|---|---|---|---|
| Q5K | Q5K_HIFI + imatrix | -0.76% (better than F16!) | 2.67 GiB | 182.7 TPS | 2,734 MiB |
| Q4K | Q4K_M + imatrix | +2.75% | 2.32 GiB | 200.2 TPS | 2,376 MiB |
| Q3K | Q3K_HIFI + imatrix | +5.9% | 2.15 GiB | 151.3 TPS | 2,202 MiB |
Bit-Width Recommendations by Use Case
β Quality-Critical Applications
β Q5KHIFI + imatrix- Best perplexity at 14.2321 PPL (-0.76% vs F16) β statistically indistinguishable from (or better than) F16
- Only 1.4% slower than fastest variant (182.7 TPS)
- Requires custom llama.cpp build with
Q5KHIFIRES8support - β οΈ Never use Q5K_S + imatrix β quality degrades severely (+0.94% vs F16)
βοΈ Best Overall Balance (Recommended Default)
β Q4KM + imatrix- Excellent +2.75% precision loss vs F16 (PPL 14.2865)
- Strong speed (200.2 TPS, +143% vs F16)
- Compact size (2.32 GiB, 69% smaller than F16)
- Standard llama.cpp compatibility β no custom build required
- Ideal for most development and production scenarios
π Maximum Speed / Minimum Size
β Q5KS (no imatrix)- Fastest at 184.65 TPS (+124% vs F16)
- Smallest at 2.62 GiB (5.60 BPW)
- Best quality without imatrix at -0.68% vs F16 (beats F16!)
- β οΈ Critical: Do NOT use imatrix with Q5KS β it degrades quality by 1.63%
π± Extreme Memory Constraints (< 2.2 GiB)
β Q3KS + imatrix- Absolute smallest footprint (1.75 GiB file, 1,792 MiB runtime)
- Acceptable +16.6% precision loss with imatrix
- Fastest Q3 variant (223.5 TPS)
- Only viable Q3 option under 1.8 GiB VRAM
Critical Warnings for 4B Scale
β οΈ imatrix is NOT universally beneficial at 4B scale β it exhibits paradoxical behavior:
| Variant | imatrix Effect | Recommendation |
|---|---|---|
| Q5KS | β Harmful: +1.63% PPL degradation | Never use imatrix β quality drops from -0.68% to +0.94% vs F16 |
| Q4KHIFI | β Severely harmful: +4.4% PPL degradation | Never use imatrix β quality drops from +0.29% to +4.72% vs F16 |
| Q4KM | β Beneficial: -0.34% PPL improvement | Always use imatrix β best Q4 quality at +2.75% vs F16 |
| Q5KHIFI | β Beneficial: -0.80% PPL improvement | Always use imatrix β achieves -0.76% vs F16 (best overall) |
| Q3_K variants | β Beneficial: 6-12% PPL improvement | Always use imatrix β essential for production quality |
β οΈ Q5KS without imatrix is the 4B anomaly: Wins all three dimensions simultaneously (quality, speed, size) without imatrix β a rare quantization "free lunch" that only occurs at this specific model scale.
Decision Flowchart
Need best quality?
ββ Yes β Using imatrix?
β ββ Yes β Q5KHIFI + imatrix (-0.76% vs F16)
β ββ No β Q4KHIFI (no imatrix, +0.29% vs F16)
β
Need best balance?
ββ Yes β Using imatrix?
β ββ Yes β Q4KM + imatrix (+2.75% vs F16, standard build)
β ββ No β Q5KS (no imatrix, -0.68% vs F16, fastest/smallest)
β
Need max speed?
ββ Yes β Q5KS (no imatrix) β 184.65 TPS
β β οΈ Never pair with imatrix!
β
Memory constrained (< 2.2 GiB)?
ββ Yes β Q3KS + imatrix β 1,792 MiB runtime
Accept +16.6% quality loss for extreme footprint reduction
Cross-Bit Performance Comparison
| Priority | Q3K Best | Q4K Best | Q5_K Best | Winner |
|---|---|---|---|---|
| Quality (no imat) | Q3KHIFI (+12.6%) | Q4KHIFI (+0.29%) β | Q5KS (-0.68%) β β | Q5KS |
| Quality (with imat) | Q3KHIFI (+5.9%) | Q4KM (+2.75%) | Q5KHIFI (-0.76%) β | Q5KHIFI |
| Speed | Q3KS (223.5 TPS) | Q4KS (206.7 TPS) | Q5KS (184.6 TPS) | Q3KS |
| Smallest Size | Q3KS (1.75 GiB) β | Q4KS (2.21 GiB) | Q5KS (2.62 GiB) | Q3KS |
| Best Balance | Q3KM + imat | Q4KM + imat β | Q5KM + imat | Q4KM |
Memory Budget Guide
| Available VRAM | Recommended Variant | Expected Quality | Why |
|---|---|---|---|
| < 1.8 GiB | Q3KS + imatrix | +16.6% loss | Only option that fits; quality acceptable for non-critical tasks |
| 1.8 β 2.5 GiB | Q4KS (no imatrix) | +4.9% loss | Good speed/size balance; avoid imatrix (degrades Q4KS slightly) |
| 2.5 β 3.0 GiB | Q4KM + imatrix β | +2.75% loss | Best balance of quality/speed/size; standard compatibility |
| 3.0 β 4.0 GiB | Q5KHIFI + imatrix β | -0.76% loss | Near-F16 quality; requires custom build |
| > 7.5 GiB | F16 or Q5KHIFI + imatrix | 0% or -0.76% loss | F16 if absolute precision required; Q5KHIFI if speed/memory matter |
Bottom Line Recommendations
For Most Users
β Q4KM + imatrix Delivers excellent quality (+2.75% vs F16), strong speed (200 TPS), compact size (2.32 GiB), and universal llama.cpp compatibility. The safe, practical choice for 90% of deployments.For Quality-Critical Work
β Q5KHIFI + imatrix Achieves perplexity better than F16 (-0.76% vs F16) with 64% memory reduction. Requires custom build but delivers maximum fidelity.For Speed-Critical Work
β Q5KS (no imatrix) Fastest (184.7 TPS) AND highest quality (-0.68% vs F16) AND smallest size (2.62 GiB) β but never use imatrix with this variant.For Edge/Mobile Deployment
β Q3KM + imatrix Best Q3 balance: +11.0% quality loss but 40% faster and 10% smaller than Q3KHIFI. Fits in ~2.0 GiB with comfortable headroom.Critical Implementation Notes
β οΈ imatrix paradox at 4B scale: Unlike other model sizes where imatrix universally improves quality, at 4B:
- Q5KS and Q4KHIFI suffer quality degradation with imatrix
- This is caused by interference between imatrix's importance weighting and these variants' outlier/residual preservation strategies
- Always verify imatrix impact before deploying β never assume it helps
β οΈ Build requirements:
- Q5KHIFI: Requires llama.cpp build 8037+ with
Q5KHIFIRES8support - Q4KHIFI: Requires build 8025+ with
Q4KHIFI/Q5KHIFIRES8 support - Q4KM/Q5KS/Q3_K variants: Work with any recent standard llama.cpp build
β οΈ The 4B "sweet spot": This model size uniquely benefits from uniform quantization (Q5KS) without imatrix guidance. Larger models (8B+) require imatrix for optimal quality; smaller models (1.7B-) suffer severe degradation without imatrix. 4B sits in a Goldilocks zone where the weight distribution aligns perfectly with 5-bit uniform quantization.
Quick Reference Card
| Scenario | Variant | PPL | vs F16 | Speed | Size | Memory |
|---|---|---|---|---|---|---|
| Best quality | Q5KHIFI + imat | 14.2321 | -0.76% β | 182.7 TPS | 2.67 GiB | 2,734 MiB |
| Best balance | Q4KM + imat | 14.2865 | +2.75% β | 200.2 TPS | 2.32 GiB | 2,376 MiB |
| Fastest/smallest | Q5KS (no imat) | 14.2439 | -0.68% β β | 184.7 TPS | 2.62 GiB | 2,683 MiB |
| Near-lossless Q4 | Q4KHIFI (no imat) | 14.3832 | +0.29% β | 184.1 TPS | 2.50 GiB | 2,560 MiB |
| Smallest footprint | Q3KS + imat | 16.7282 | +16.6% | 223.5 TPS | 1.75 GiB | 1,792 MiB |
Golden rule for 4B:
- Q5KS β Never use imatrix
- Q4KHIFI β Never use imatrix
- Q4KM / Q5KHIFI β Always use imatrix
- Q3_K variants β Always use imatrix
Non-technical model anaysis and rankings
NOTE: This analysis does not include the HIFI models.
I have run each of these models across 6 questions, and ranked them all based on the quality of the anwsers.
Qwen3-4B-f16:Q3KM (or Qwen3-4B-f16:Q3HIFI) is the best model across all question types, but if you want to play it safe with a higher precision model, then you could consider using Qwen3-4B-f16:Q80.
You can read the results here: Qwen3-4b-f16-analysis.md
If you find this useful, please give the project a β€οΈ like.
Non-HIFI recommentation table based on output
| Level | Speed | Size | Recommendation |
|---|---|---|---|
| Q2_K | β‘ Fastest | 1.9 GB | π¨ DO NOT USE. Worst results from all the 4B models. |
| π₯ Q3KS | β‘ Fast | 2.2 GB | π₯ Runner up. A very good model for a wide range of queries. |
| π₯ Q3KM | β‘ Fast | 2.4 GB | π₯ Best overall model. Highly recommended for all query types. |
| Q4KS | π Fast | 2.7 GB | A late showing in low-temperature queries. Probably not recommended. |
| Q4KM | π Fast | 2.9 GB | A late showing in high-temperature queries. Probably not recommended. |
| Q5KS | π’ Medium | 3.3 GB | Did not appear in the top 3 for any question. Not recommended. |
| Q5KM | π’ Medium | 3.4 GB | A second place for a high-temperature question, probably not recommended. |
| Q6_K | π Slow | 3.9 GB | Did not appear in the top 3 for any question. Not recommended. |
| π₯ Q8_0 | π Slow | 5.1 GB | π₯ If you want to play it safe, this is a good option. Good results across a variety of questions. |
Build notes
All of these models were built using these commands:
mkdir build
cmake -B build -DCMAKEBUILDTYPE=Release -DGGMLNATIVE=ON -DGGMLAVX=ON -DGGMLAVX2=ON -DGGMLCUDA=ON -DGGMLVULKAN=OFF -DLLAMACURL=OFF
cmake --build build --config Release -j
NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.
The HIFI quantization also used a massive 9343 chunk imatrix file for extra precision. You can re-use it here: Qwen3-4B-f16-imatrix-9343-generic.gguf
The imatrix was created as a generic mix of Wikipedia, mathmatics, and coding examples.
Source code
You can use the HIFI GitHub repository to build it from source if you're interested: https://github.com/geoffmunn/llama.cpp.
Build notes: HIFIBUILD_GUIDE.md
Improvements and feedback are welcome.
Usage
Load this model using:
- OpenWebUI β self-hosted AI interface with RAG & tools
- LM Studio β desktop app with GPU support and chat templates
- GPT4All β private, local AI chatbot (offline-first)
- Or directly via
llama.cpp
Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.
Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value.
In this case try these steps:
wget https://huggingface.co/geoffmunn/Qwen3-4B-f16/resolve/main/Qwen3-4B-f16%3AQ3KM.gguf(replace the quantised version with the one you want)nano Modelfileand enter these details (again, replacing Q3KM with the version you want):
FROM ./Qwen3-4B-f16:Q3KM.gguf
Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|imend|>{{ end }}<|imstart|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
The num_ctx value has been dropped to increase speed significantly.
- Then run this command:
ollama create Qwen3-4B-f16:Q3KM -f Modelfile
You will now see "Qwen3-4B-f16:Q3KM" in your Ollama model list.
These import steps are also useful if you want to customise the default parameters or system prompt.
Author
π€ Geoff Munn (@geoffmunn)
π Hugging Face Profile
Disclaimer
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
Qwen3-4B-f16-imatrix-8843-coder.gguf
LFS
FP16
|
3.69 MB | Download |
|
Qwen3-4B-f16-imatrix-9343-generic.gguf
LFS
FP16
|
3.69 MB | Download |
|
Qwen3-4B-f16-imatrix:Q3_K_HIFI.gguf
LFS
Q3
|
2.15 GB | Download |
|
Qwen3-4B-f16-imatrix:Q3_K_M.gguf
LFS
Q3
|
1.93 GB | Download |
|
Qwen3-4B-f16-imatrix:Q3_K_S.gguf
LFS
Q3
|
1.76 GB | Download |
|
Qwen3-4B-f16-imatrix:Q4_K_HIFI.gguf
LFS
Q4
|
2.51 GB | Download |
|
Qwen3-4B-f16-imatrix:Q4_K_M.gguf
Recommended
LFS
Q4
|
2.33 GB | Download |
|
Qwen3-4B-f16-imatrix:Q4_K_S.gguf
LFS
Q4
|
2.22 GB | Download |
|
Qwen3-4B-f16-imatrix:Q5_K_HIFI.gguf
LFS
Q5
|
2.67 GB | Download |
|
Qwen3-4B-f16-imatrix:Q5_K_M.gguf
LFS
Q5
|
2.69 GB | Download |
|
Qwen3-4B-f16-imatrix:Q5_K_S.gguf
LFS
Q5
|
2.63 GB | Download |
|
Qwen3-4B-f16:Q2_K.gguf
LFS
Q2
|
1.55 GB | Download |
|
Qwen3-4B-f16:Q3_K_HIFI.gguf
LFS
Q3
|
2.15 GB | Download |
|
Qwen3-4B-f16:Q3_K_M.gguf
LFS
Q3
|
1.93 GB | Download |
|
Qwen3-4B-f16:Q3_K_S.gguf
LFS
Q3
|
1.76 GB | Download |
|
Qwen3-4B-f16:Q4_K_HIFI.gguf
LFS
Q4
|
2.51 GB | Download |
|
Qwen3-4B-f16:Q4_K_M.gguf
LFS
Q4
|
2.33 GB | Download |
|
Qwen3-4B-f16:Q4_K_S.gguf
LFS
Q4
|
2.22 GB | Download |
|
Qwen3-4B-f16:Q5_K_HIFI.gguf
LFS
Q5
|
2.67 GB | Download |
|
Qwen3-4B-f16:Q5_K_M.gguf
LFS
Q5
|
2.69 GB | Download |
|
Qwen3-4B-f16:Q5_K_S.gguf
LFS
Q5
|
2.63 GB | Download |
|
Qwen3-4B-f16:Q6_K.gguf
LFS
Q6
|
3.08 GB | Download |
|
Qwen3-4B-f16:Q8_0.gguf
LFS
Q8
|
3.99 GB | Download |