πŸ“‹ Model Description


license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-32B-Instruct tags: - bitnet - quantization - ternary - 1.58-bit - qwen - qwen2.5 - code - experimental - 32b-architecture library_name: safetensors pipeline_tag: text-generation language: - en - zh model_name: Qwen2.5-Coder-32B-BitNet-1.58b datasets: [] metrics: []

Qwen2.5-Coder-32B-Instruct-BitNet-1.58b

Architecture: 32 Billion Parameters | BitNet 1.58-bit Ternary Quantization


IMPORTANT: Parameter Count Display

>

HuggingFace displays "9B params" because it counts packed bytes, not actual parameters.

This model has the full 32B parameter Qwen2.5-Coder architecture.

The weights are stored as ternary values ({-1, 0, +1}) packed 4 per byte, which reduces

storage to 9.6 GB but preserves all 32 billion parameters.


Overview

This is an experimental BitNet 1.58-bit quantization of the Qwen2.5-Coder-32B-Instruct model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte.

This is research/experimental work. Quality and performance have not been formally benchmarked.

Specifications

PropertyValue
Base ModelQwen/Qwen2.5-Coder-32B-Instruct
ArchitectureQwen2 (Qwen2ForCausalLM)
Parameters32B (full architecture preserved)
QuantizationBitNet 1.58-bit ternary
Bits per Weight~1.58
Group Size64
Original Size65.53 GB (BF16)
Quantized Size9.6 GB (SafeTensors)
GGUF Size11 GB (TQ2_0)
Compression~6.4x

Formats

FormatFileDescription
SafeTensorsmodel-*.safetensorsSharded quantized weights + scales
GGUFqwen2.5-coder-32b-TQ20.ggufllama.cpp TQ20 format (experimental)

GGUF Compatibility Note: The GGUF conversion is experimental. Our BitNet quantization uses group size 64, while TQ2_0 uses 256-element blocks. This may cause compatibility issues with some inference engines. The SafeTensors format is the primary supported format.

Quantization Method

Algorithm

  1. Reshape weights into groups of 64
  2. Compute per-group scale: scale = mean(|weights|)
  3. Normalize and round to nearest ternary: q = round(w / scale) clamped to {-1, 0, +1}
  4. Map to unsigned: {-1, 0, +1} β†’ {0, 1, 2}
  5. Pack 4 values per byte: v0 + v13 + v29 + v3*27

Tooling

  • Quantization: Custom Rust tool using Candle
  • GGUF Conversion: llama.cpp converthfto_gguf.py

Hardware Used

  • GPU: NVIDIA RTX 5080 (16GB VRAM)
  • Quantization time: ~369 seconds (streaming mode)
  • Memory: Streaming mode with CPU fallback for large tensors (>3GB threshold)

Usage

With Ollama/llama.cpp (experimental)

# llama.cpp (GGUF format - experimental, may have issues)
./llama-cli -m qwen2.5-coder-32b-TQ2_0.gguf -p "Write a Python function:"

Unpacking Weights (Python)

def unpackternary(packedbyte):
    """Unpack 4 ternary values from byte."""
    values = []
    val = packed_byte
    for _ in range(4):
        values.append((val % 3) - 1)  # {0,1,2} β†’ {-1,0,+1}
        val //= 3
    return values

Limitations

  • Quality not benchmarked - May have significant degradation vs original
  • Requires custom runtime - Standard transformers doesn't support ternary weights
  • Experimental - Not intended for production use without evaluation
  • GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors
  • HuggingFace may show incorrect param count due to packed storage

License

Apache 2.0 (inherited from Qwen2.5-Coder-32B-Instruct)

Citation

@misc{qwen-coder-32b-bitnet-2025,
  title={Qwen2.5-Coder-32B-BitNet-1.58b: Experimental BitNet Quantization},
  author={Tzervas},
  year={2025},
  url={https://huggingface.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b}
}

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
qwen-coder-32b-tq2.gguf
Recommended LFS
10.4 GB Download