📋 Model Description


base_model: Nanbeige/Nanbeige4.1-3B tags: - gguf - llama - nanbeige - quantized license: apache-2.0 language: - en - zh

Nanbeige4.1-3B-GGUF

GGUF quantizations of Nanbeige/Nanbeige4.1-3B for use with llama.cpp, Ollama, and other GGUF-compatible tools.

Available Quantizations

FileQuantSizeDescription
nanbeige4.1-3b-f16.ggufF167.4 GBFull precision (no quantization)
nanbeige4.1-3b-Q80.ggufQ803.9 GBBest quality, largest quantized size
nanbeige4.1-3b-Q6K.ggufQ6K3.1 GBVery high quality
nanbeige4.1-3b-Q5KM.ggufQ5KM2.7 GBHigh quality
nanbeige4.1-3b-Q4KM.ggufQ4KM2.3 GBGood quality, recommended for most users
nanbeige4.1-3b-Q3KM.ggufQ3KM1.9 GBMedium quality
nanbeige4.1-3b-Q2K.ggufQ2K1.6 GBSmallest size, lower quality (received report of constantly stuck in a loop)

Usage

Ollama

# Download a specific quantization (e.g. Q4KM)
ollama run hf.co/tantk/Nanbeige4.1-3B-GGUF:Q4KM

Or create from a downloaded file

ollama create nanbeige4.1-3b -f Modelfile

llama.cpp

llama-cli -m nanbeige4.1-3b-Q4KM.gguf -p "Your prompt here" --temp 0.6 --top-p 0.95

Model Details

  • Base Model: Nanbeige/Nanbeige4.1-3B
  • Architecture: LlamaForCausalLM
  • Parameters: 3B (4B total)
  • Context Length: 131,072 tokens
  • Chat Template: ChatML (<|imstart|> / <|im_end|>)
  • License: Apache 2.0

Recommended Settings

  • Temperature: 0.6
  • Top-p: 0.95
  • Repeat penalty: 1.0

Benchmark Results

Test Hardware

ComponentSpec
CPUAMD Ryzen 5 5600G (6 cores / 12 threads, 3.9 GHz)
RAM32 GB DDR4-3200 (4x 8 GB Kingston)
GPUNVIDIA GeForce RTX 4070 Ti (12 GB VRAM)
OSWindows 11 Pro

CPU Benchmark (llama-bench)

  • Backend: CPU
  • Threads: 6
  • Prompt tokens: 512 (pp512)
  • Generation tokens: 128 (tg128)
  • Repetitions: 3
  • Tool: llama-bench (llama.cpp build 0c1f39a)
QuantSizeParamsPrompt (t/s)Generation (t/s)
Q2_K1.51 GiB3.93 B47.14 ± 0.7120.99 ± 1.04
Q3KM1.87 GiB3.93 B40.23 ± 1.0117.65 ± 0.25
Q4KM2.27 GiB3.93 B67.80 ± 1.1414.35 ± 0.52
Q5KM2.63 GiB3.93 B29.68 ± 0.2413.75 ± 0.17
Q6_K3.01 GiB3.93 B33.76 ± 2.4112.28 ± 0.06
Q8_03.89 GiB3.93 B45.07 ± 0.419.07 ± 0.47
F167.33 GiB3.93 B31.08 ± 0.755.22 ± 0.05

GPU Benchmark (llama-bench)

  • Backend: CUDA (RTX 4070 Ti, 100% GPU offload, ngl=99)
  • Prompt tokens: 512 (pp512)
  • Generation tokens: 128 (tg128)
  • Repetitions: 3
  • Tool: llama-bench (llama.cpp build 0c1f39a)
QuantSizeParamsPrompt (t/s)Generation (t/s)
Q2_K1.51 GiB3.93 B7,904.89 ± 44.44194.47 ± 1.68
Q3KM1.87 GiB3.93 B9,233.97 ± 132.75162.72 ± 1.04
Q4KM2.27 GiB3.93 B9,977.17 ± 123.83155.27 ± 0.21
Q5KM2.63 GiB3.93 B8,060.71 ± 1484.42139.18 ± 0.44
Q6_K3.01 GiB3.93 B7,794.85 ± 1023.17126.49 ± 0.83
Q8_03.89 GiB3.93 B6,349.76 ± 698.63102.88 ± 0.32
F167.33 GiB3.93 B8,946.09 ± 230.6160.75 ± 0.20

Credits

Original model by Nanbeige. Quantized with llama.cpp.

📂 GGUF File List

📁 Filename 📦 Size ⚡ Download
nanbeige4.1-3b-Q2_K.gguf
LFS Q2
1.51 GB Download
nanbeige4.1-3b-Q3_K_M.gguf
LFS Q3
1.88 GB Download
nanbeige4.1-3b-Q4_K_M.gguf
Recommended LFS Q4
2.28 GB Download
nanbeige4.1-3b-Q5_K_M.gguf
LFS Q5
2.63 GB Download
nanbeige4.1-3b-Q6_K.gguf
LFS Q6
3.01 GB Download
nanbeige4.1-3b-Q8_0.gguf
LFS Q8
3.9 GB Download
nanbeige4.1-3b-f16.gguf
LFS FP16
7.33 GB Download