π Model Description
license: mit base_model: MiniMaxAI/MiniMax-M2.5 tags:
- gguf
- moe
- minimax
- llama.cpp
- applesilicon
- reasoning
- conversational
MiniMax-M2.5-GGUF (230B MoE)
High-precision GGUF quants of the MiniMax-M2.5 (230B parameters) Mixture of Experts model. These versions are specifically optimized for local inference on high-RAM setups, particularly Apple Silicon (M3 Max/Ultra).
π¬ Perplexity Validation (WikiText-2):
Final PPL: 8.2213 +/- 0.09
Context: 4096 / 32 chunks
Outcome: The Q3KL quantization maintains high logical coherence while boosting speed to 28.7 t/s. Minimal degradation for a ~20GB size reduction vs Q4.
π Available Quants
| File Name | Method | Size | Use Case |
|---|---|---|---|
minimax-m2.5-Q4KM.gguf | Q4KM | 138 GB | Highest logic preservation. Requires >128GB RAM or SSD swap. |
minimax-m2.5-Q3KL.gguf | Q3KL | ~110 GB | Sweet spot for 128GB Macs. Runs natively in RAM with high t/s ( 28 ON MAC M3 MAX ). |
π Model Details
- Architecture: MiniMax-M2 (Mixture of Experts) with 256 experts (8 active per token).
- Parameters: ~230B total.
- Quantization Process: Unlike automated scripts, these quants were generated from a full F16 GGUF Master (457GB) to minimize accumulation of errors during the K-Quant process.
- Context Window: Up to 196k tokens (Native support).
- Chat Template: Includes the official Jinja template for proper handling of interleaved
tags, separating reasoning from the final response.
π» Usage
Requires llama.cpp build 8022 or higher.