π Model Description
license: apache-2.0 language:
- en
- PowerInfer/SmallThinker-21BA3B-Instruct
SmallThinker-21BA3B-Instruct-GGUF
- GGUF models with
.ggufsuffix can used with llama.cpp framework. - GGUF models with
.powerinfer.ggufsuffix are integrated with fused sparse FFN operators and sparse LM head operators. These models are only compatible to powerinfer framework.
Introduction
  π€ Hugging Face   |   π€ ModelScope   |    π Technical Report   
SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment,
co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI.
Designed from the ground up for resource-constrained environments,
SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
without relying on the cloud.
Performance
Note: The model is trained mainly on English.
| Model | MMLU | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
|---|---|---|---|---|---|---|---|
| SmallThinker-21BA3B-Instruct | 84.43 | 55.05 | 82.4 | 85.77 | 60.3 | 89.63 | 76.26 |
| Gemma3-12b-it | 78.52 | 34.85 | 82.4 | 74.68 | 44.5 | 82.93 | 66.31 |
| Qwen3-14B | 84.82 | 50 | 84.6 | 85.21 | 59.5 | 88.41 | 75.42 |
| Qwen3-30BA3B | 85.1 | 44.4 | 84.4 | 84.29 | 58.8 | 90.24 | 74.54 |
| Qwen3-8B | 81.79 | 38.89 | 81.6 | 83.92 | 49.5 | 85.9 | 70.26 |
| Phi-4-14B | 84.58 | 55.45 | 80.2 | 63.22 | 42.4 | 87.2 | 68.84 |
All models are evaluated in non-thinking mode.
Speed
| Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
|---|---|---|---|---|---|
| SmallThinker 21B+sparse | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 |
| SmallThinker 21B+sparse+limited memory | limit 8G | 20.30 | 15.50 | 8.56 | - |
| Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - |
| Qwen3 30B A3B+limited memory | limit 8G | 10.11 | 0.18 | 6.32 | - |
| Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 |
| Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 |
Model Card
| Architecture | Mixture-of-Experts (MoE) |
|---|---|
| Total Parameters | 21B |
| Activated Parameters | 3B |
| Number of Layers | 52 |
| Attention Hidden Dimension | 2560 |
| MoE Hidden Dimension (per Expert) | 768 |
| Number of Attention Heads | 28 |
| Number of KV Heads | 4 |
| Number of Experts | 64 |
| Selected Experts per Token | 6 |
| Vocabulary Size | 151,936 |
| Context Length | 16K |
| Attention Mechanism | GQA |
| Activation Function | ReGLU |
How to Run
Transformers
transformers==4.53.3 is required, we are actively working to support the latest version.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
path = "PowerInfer/SmallThinker-21BA3B-Instruct"
device = "cuda"
tokenizer = AutoTokenizer.frompretrained(path, trustremote_code=True)
model = AutoModelForCausalLM.frompretrained(path, torchdtype=torch.bfloat16, devicemap=device, trustremote_code=True)
messages = [
{"role": "user", "content": "Give me a short introduction to large language model."},
]
modelinputs = tokenizer.applychattemplate(messages, returntensors="pt", addgenerationprompt=True).to(device)
model_outputs = model.generate(
model_inputs,
do_sample=True,
maxnewtokens=1024
)
outputtokenids = [
modeloutputs[i][len(modelinputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batchdecode(outputtokenids, skipspecial_tokens=True)[0]
print(responses)
ModelScope
ModelScope adopts Python API similar to (though not entirely identical to) Transformers. For basic usage, simply modify the first line of the above code as follows:
from modelscope import AutoModelForCausalLM, AutoTokenizer
Statement
- Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information.
- Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content.
- SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
SmallThinker-21B-A3B-Instruct-QAT.Q4_0.gguf
Recommended
LFS
Q4
|
11.39 GB | Download |
|
SmallThinker-21B-A3B-Instruct-QAT.Q4_K_M.gguf
LFS
Q4
|
12.19 GB | Download |
|
SmallThinker-21B-A3B-Instruct-QAT.Q4_K_S.gguf
LFS
Q4
|
11.48 GB | Download |
|
SmallThinker-21B-A3B-Instruct.F16.gguf
LFS
FP16
|
40.08 GB | Download |
|
SmallThinker-21B-A3B-Instruct.IQ4_NL.gguf
LFS
Q4
|
11.49 GB | Download |
|
SmallThinker-21B-A3B-Instruct.IQ4_XS.gguf
LFS
Q4
|
10.9 GB | Download |
|
SmallThinker-21B-A3B-Instruct.IQ4_XS.imatrix.gguf
LFS
Q4
|
10.79 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q3_K.gguf
LFS
Q3
|
9.7 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q3_K.imatrix.gguf
LFS
Q3
|
9.7 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q3_K_S.gguf
LFS
Q3
|
8.78 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q3_K_S.imatrix.gguf
LFS
Q3
|
8.78 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_0.gguf
LFS
Q4
|
11.39 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_0.imatrix.gguf
LFS
Q4
|
11.44 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_0.powerinfer.gguf
LFS
Q4
|
11.31 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_1.gguf
LFS
Q4
|
12.62 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_1.imatrix.gguf
LFS
Q4
|
12.62 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_K.gguf
LFS
Q4
|
12.19 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_K.imatrix.gguf
LFS
Q4
|
12.19 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_K_S.gguf
LFS
Q4
|
11.48 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q4_K_S.imatrix.gguf
LFS
Q4
|
11.48 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q5_K.gguf
LFS
Q5
|
14.26 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q5_K.imatrix.gguf
LFS
Q5
|
14.26 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q6_K.gguf
LFS
Q6
|
16.46 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q6_K.imatrix.gguf
LFS
Q6
|
16.46 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q8_0.gguf
LFS
Q8
|
21.31 GB | Download |
|
SmallThinker-21B-A3B-Instruct.Q8_0.imatrix.gguf
LFS
Q8
|
21.31 GB | Download |