Model Description
language:
- en
- zh
- qwen3
- unsloth
- text-generation
- reasoning
- math
- grpo
- sft
- distillation
- conversational
DASD-4B-Thinking-2507-stage2
DASD-4B-Thinking-2507-stage2 is the final model in a three-stage training pipeline built upon Qwen/Qwen3-4B-Thinking-2507. It combines Reinforcement Learning via GRPO with a two-stage Supervised Fine-Tuning (SFT) strategy inspired by the Distribution-Aligned Sequence Distillation (DASD) methodology introduced by Alibaba Cloud Apsara Lab, resulting in a compact 4B model with enhanced mathematical reasoning and long chain-of-thought capabilities.
🧬 Training Pipeline Overview
This model is the culmination of three sequential training stages:
Qwen/Qwen3-4B-Thinking-2507
│
▼ Stage 0: GRPO (RL on Math & Reasoning)
DASD-4B-Thinking-2507-GRPO-v2
│
▼ Stage 1: SFT with Low-Temperature (T=0.6) Distillation Data
DASD-4B-Thinking-2507-stage1
│
▼ Stage 2: SFT with Default-Temperature (T=1.0) Distillation Data
DASD-4B-Thinking-2507-stage2 ← (this model)
📚 Stage Details
Stage 0 — GRPO Reinforcement Learning: DASD-4B-Thinking-2507-GRPO-v2
Starting from the base model Qwen/Qwen3-4B-Thinking-2507, Group Relative Policy Optimization (GRPO) was applied using a high-quality mathematical reasoning dataset distilled from DeepSeek-R1. This stage significantly improved the model's:
- Correctness on math problem solving
- Step-by-step logical reasoning
- Reward signal alignment for verifiable tasks
Stage 1 — Low-Temperature SFT: DASD-4B-Thinking-2507-stage1
Inspired by the Distribution-Aligned Sequence Distillation (DASD) pipeline from Alibaba-Apsara, Stage 1 SFT was performed using the low-temperature subset (T=0.6) of the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b dataset.
#### 💡 Why Low-Temperature Distillation for Small Models?
Low-temperature sampling from the teacher model (gpt-oss-120b) produces sharper, more deterministic output distributions, which are significantly easier for small student models to imitate and internalize. This "cold-start" strategy:
- Reduces distributional mismatch between teacher and student — the cleaner, more peaked distributions generated at low temperature align better with what a small model can currently express
- Provides a stable foundation — the model first learns the most consistent and representative reasoning patterns before being exposed to more diverse trajectories
- Boosts early performance rapidly — low-temperature data provides an efficient jump-start for math and scientific reasoning benchmarks
- Mitigates exposure bias — by gradually introducing complexity, the model avoids overfitting to noisy or outlier reasoning traces
This is the key insight behind DASD's temperature-scheduled learning: start cold for stability, then warm up for diversity.
Dataset used:
Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b— Stage 1 split (low-temperature, T=0.6)
Stage 2 — Default-Temperature SFT: DASD-4B-Thinking-2507-stage2 (this model)
Building on DASD-4B-Thinking-2507-stage1, Stage 2 SFT was performed using the default-temperature subset (T=1.0) of the same dataset. Higher-temperature data introduces greater lexical diversity and broader mode coverage, enabling the model to generalize better across diverse reasoning patterns and problem domains.
Dataset used:
Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b— Stage 2 split (default temperature, T=1.0)
🗂️ All Datasets Used
| Stage | Dataset | Purpose |
|---|---|---|
| GRPO (RL) | a-m-team/AM-DeepSeek-R1-Distilled-1.4M | Math & reasoning RL training via GRPO |
| SFT Stage 1 | Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 1, T=0.6) | Low-temp distillation, stable cold-start |
| SFT Stage 2 | Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 2, T=1.0) | High-temp distillation, diversity & generalization |
Superior-Reasoning-SFT-gpt-oss-120b dataset itself is built from the following upstream question sources:
nvidia/AceReason-1.1-SFTnvidia/OpenCodeReasoningnvidia/OpenScienceReasoning-2a-m-team/AM-DeepSeek-R1-Distilled-1.4M
🏃 Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jackrong/DASD-4B-Thinking-2507-stage2"
tokenizer = AutoTokenizer.frompretrained(modelname)
model = AutoModelForCausalLM.frompretrained(modelname, torchdtype="auto", devicemap="auto")
messages = [
{"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."}
]
text = tokenizer.applychattemplate(messages, tokenize=False, addgenerationprompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(inputs, maxnewtokens=4096)
response = tokenizer.decode(outputs[0][inputs.inputids.shape[1]:], skipspecial_tokens=True)
print(response)
Tip: This model naturally generates
reasoning traces before the final answer. You can parse these to inspect the chain-of-thought....
📋 Model Details
| Attribute | Value |
|---|---|
| Base Model | Qwen/Qwen3-4B-Thinking-2507 |
| Architecture | Qwen3 (4B Dense) |
| License | Apache 2.0 |
| Language(s) | English, Chinese |
| Training Framework | Unsloth + Hugging Face TRL |
| RL Algorithm | GRPO (Group Relative Policy Optimization) |
| Fine-tuning Method | SFT (Two-stage temperature-scheduled distillation) |
| Developed by | Jackrong |
⚠️ Limitations & Intended Use
- This model is intended for research and educational purposes related to reasoning and mathematical problem-solving.
- While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers — always verify outputs on critical tasks.
- The model inherits the capabilities and limitations of the underlying
Qwen3-4B-Thinking-2507architecture. - Not intended for deployment in high-stakes applications without additional safety evaluation.
📎 Related Models
| Model | Description |
|---|---|
Qwen/Qwen3-4B-Thinking-2507 | Base model |
Jackrong/DASD-4B-Thinking-2507-GRPO-v2 | After GRPO RL training |
Jackrong/DASD-4B-Thinking-2507-stage1 | After low-temperature SFT |
Jackrong/DASD-4B-Thinking-2507-stage2 | This model — final stage |
🙏 Acknowledgements
- Alibaba Cloud Apsara Lab for the DASD methodology and the
Superior-Reasoning-SFT-gpt-oss-120bdataset - AM-Team for the DeepSeek-R1 distilled dataset
- NVIDIA for open reasoning datasets
- Unsloth for efficient fine-tuning infrastructure
- Qwen Team for the excellent base model
GGUF File List
| 📁 Filename | 📦 Size | ⚡ Download |
|---|---|---|
|
DASD-4B-Thinking-2507-stage2-BF16.gguf
LFS
FP16
|
7.5 GB | Download |
|
DASD-4B-Thinking-2507-stage2-IQ4_XS.gguf
LFS
Q4
|
2.13 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q2_K.gguf
LFS
Q2
|
1.55 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q3_K_L.gguf
LFS
Q3
|
2.09 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q3_K_M.gguf
LFS
Q3
|
1.93 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q3_K_S.gguf
LFS
Q3
|
1.76 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q4_K_M.gguf
Recommended
LFS
Q4
|
2.33 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q4_K_S.gguf
LFS
Q4
|
2.22 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q5_K_M.gguf
LFS
Q5
|
2.69 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q5_K_S.gguf
LFS
Q5
|
2.63 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q6_K.gguf
LFS
Q6
|
3.08 GB | Download |
|
DASD-4B-Thinking-2507-stage2-Q8_0.gguf
LFS
Q8
|
3.99 GB | Download |