Jackrong/DASD-4B-Thinking-2507-stage2-GGUF

Name: Jackrong/DASD-4B-Thinking-2507-stage2-GGUF
Author: Jackrong

High-quality GGUF model

1.9K Downloads

0 Likes

12 GGUF Files

33.89 GB Total Size

2 days ago Updated

Model Description

language:

license: apache-2.0 base_model: Qwen/Qwen3-4B-Thinking-2507 tags:

qwen3
unsloth
text-generation
reasoning
math
grpo
sft
distillation
conversational

pipeline_tag: text-generation

DASD-4B-Thinking-2507-stage2

DASD-4B-Thinking-2507-stage2 is the final model in a three-stage training pipeline built upon Qwen/Qwen3-4B-Thinking-2507. It combines Reinforcement Learning via GRPO with a two-stage Supervised Fine-Tuning (SFT) strategy inspired by the Distribution-Aligned Sequence Distillation (DASD) methodology introduced by Alibaba Cloud Apsara Lab, resulting in a compact 4B model with enhanced mathematical reasoning and long chain-of-thought capabilities.

🧬 Training Pipeline Overview

This model is the culmination of three sequential training stages:

Qwen/Qwen3-4B-Thinking-2507
         │
         ▼  Stage 0: GRPO (RL on Math & Reasoning)
DASD-4B-Thinking-2507-GRPO-v2
         │
         ▼  Stage 1: SFT with Low-Temperature (T=0.6) Distillation Data
DASD-4B-Thinking-2507-stage1
         │
         ▼  Stage 2: SFT with Default-Temperature (T=1.0) Distillation Data
DASD-4B-Thinking-2507-stage2  ← (this model)

📚 Stage Details

Stage 0 — GRPO Reinforcement Learning: `DASD-4B-Thinking-2507-GRPO-v2`

Starting from the base model Qwen/Qwen3-4B-Thinking-2507, Group Relative Policy Optimization (GRPO) was applied using a high-quality mathematical reasoning dataset distilled from DeepSeek-R1. This stage significantly improved the model's:

Correctness on math problem solving
Step-by-step logical reasoning
Reward signal alignment for verifiable tasks

Stage 1 — Low-Temperature SFT: `DASD-4B-Thinking-2507-stage1`

Inspired by the Distribution-Aligned Sequence Distillation (DASD) pipeline from Alibaba-Apsara, Stage 1 SFT was performed using the low-temperature subset (T=0.6) of the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b dataset.

#### 💡 Why Low-Temperature Distillation for Small Models?

Low-temperature sampling from the teacher model (gpt-oss-120b) produces sharper, more deterministic output distributions, which are significantly easier for small student models to imitate and internalize. This "cold-start" strategy:

Reduces distributional mismatch between teacher and student — the cleaner, more peaked distributions generated at low temperature align better with what a small model can currently express
Provides a stable foundation — the model first learns the most consistent and representative reasoning patterns before being exposed to more diverse trajectories
Boosts early performance rapidly — low-temperature data provides an efficient jump-start for math and scientific reasoning benchmarks
Mitigates exposure bias — by gradually introducing complexity, the model avoids overfitting to noisy or outlier reasoning traces

This is the key insight behind DASD's temperature-scheduled learning: start cold for stability, then warm up for diversity.

Dataset used:

Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b — Stage 1 split (low-temperature, T=0.6)

Stage 2 — Default-Temperature SFT: `DASD-4B-Thinking-2507-stage2` (this model)

Building on DASD-4B-Thinking-2507-stage1, Stage 2 SFT was performed using the default-temperature subset (T=1.0) of the same dataset. Higher-temperature data introduces greater lexical diversity and broader mode coverage, enabling the model to generalize better across diverse reasoning patterns and problem domains.

Dataset used:

Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b — Stage 2 split (default temperature, T=1.0)

🗂️ All Datasets Used

Stage	Dataset	Purpose
GRPO (RL)	`a-m-team/AM-DeepSeek-R1-Distilled-1.4M`	Math & reasoning RL training via GRPO
SFT Stage 1	`Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b` (Stage 1, T=0.6)	Low-temp distillation, stable cold-start
SFT Stage 2	`Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b` (Stage 2, T=1.0)	High-temp distillation, diversity & generalization

The Superior-Reasoning-SFT-gpt-oss-120b dataset itself is built from the following upstream question sources:

🏃 Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jackrong/DASD-4B-Thinking-2507-stage2"
tokenizer = AutoTokenizer.frompretrained(modelname)
model = AutoModelForCausalLM.frompretrained(modelname, torchdtype="auto", devicemap="auto")

messages = [
    {"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."}
]

text = tokenizer.applychattemplate(messages, tokenize=False, addgenerationprompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(inputs, maxnewtokens=4096)
response = tokenizer.decode(outputs[0][inputs.inputids.shape[1]:], skipspecial_tokens=True)
print(response)

Tip: This model naturally generates ... reasoning traces before the final answer. You can parse these to inspect the chain-of-thought.

📋 Model Details

Attribute	Value
Base Model	`Qwen/Qwen3-4B-Thinking-2507`
Architecture	Qwen3 (4B Dense)
License	Apache 2.0
Language(s)	English, Chinese
Training Framework	Unsloth + Hugging Face TRL
RL Algorithm	GRPO (Group Relative Policy Optimization)
Fine-tuning Method	SFT (Two-stage temperature-scheduled distillation)
Developed by	Jackrong

⚠️ Limitations & Intended Use

This model is intended for research and educational purposes related to reasoning and mathematical problem-solving.
While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers — always verify outputs on critical tasks.
The model inherits the capabilities and limitations of the underlying Qwen3-4B-Thinking-2507 architecture.
Not intended for deployment in high-stakes applications without additional safety evaluation.

📎 Related Models

Model	Description
`Qwen/Qwen3-4B-Thinking-2507`	Base model
`Jackrong/DASD-4B-Thinking-2507-GRPO-v2`	After GRPO RL training
`Jackrong/DASD-4B-Thinking-2507-stage1`	After low-temperature SFT
`Jackrong/DASD-4B-Thinking-2507-stage2`	This model — final stage

🙏 Acknowledgements

Alibaba Cloud Apsara Lab for the DASD methodology and the Superior-Reasoning-SFT-gpt-oss-120b dataset
AM-Team for the DeepSeek-R1 distilled dataset
NVIDIA for open reasoning datasets
Unsloth for efficient fine-tuning infrastructure
Qwen Team for the excellent base model

GGUF File List

📁 Filename	📦 Size	⚡ Download
DASD-4B-Thinking-2507-stage2-BF16.gguf LFS FP16	7.5 GB	Download
DASD-4B-Thinking-2507-stage2-IQ4_XS.gguf LFS Q4	2.13 GB	Download
DASD-4B-Thinking-2507-stage2-Q2_K.gguf LFS Q2	1.55 GB	Download
DASD-4B-Thinking-2507-stage2-Q3_K_L.gguf LFS Q3	2.09 GB	Download
DASD-4B-Thinking-2507-stage2-Q3_K_M.gguf LFS Q3	1.93 GB	Download
DASD-4B-Thinking-2507-stage2-Q3_K_S.gguf LFS Q3	1.76 GB	Download
DASD-4B-Thinking-2507-stage2-Q4_K_M.gguf Recommended LFS Q4	2.33 GB	Download
DASD-4B-Thinking-2507-stage2-Q4_K_S.gguf LFS Q4	2.22 GB	Download
DASD-4B-Thinking-2507-stage2-Q5_K_M.gguf LFS Q5	2.69 GB	Download
DASD-4B-Thinking-2507-stage2-Q5_K_S.gguf LFS Q5	2.63 GB	Download
DASD-4B-Thinking-2507-stage2-Q6_K.gguf LFS Q6	3.08 GB	Download
DASD-4B-Thinking-2507-stage2-Q8_0.gguf LFS Q8	3.99 GB	Download

Model Information

Model ID: Jackrong/DASD-4B-Thinking-2507-stage2-GGUF

Created: 2 weeks ago

Last Updated: 2 days ago

Downloads: 1.9K

Likes: 0

Difficulty: Intermediate

Quantization: FP16, Q4, Q2, Q3, Q5, Q6, Q8

Jackrong/DASD-4B-Thinking-2507-stage2-GGUF

Model Description

DASD-4B-Thinking-2507-stage2

🧬 Training Pipeline Overview

📚 Stage Details

Stage 0 — GRPO Reinforcement Learning: `DASD-4B-Thinking-2507-GRPO-v2`

Stage 1 — Low-Temperature SFT: `DASD-4B-Thinking-2507-stage1`

Stage 2 — Default-Temperature SFT: `DASD-4B-Thinking-2507-stage2` (this model)

🗂️ All Datasets Used

🏃 Quickstart

📋 Model Details

⚠️ Limitations & Intended Use

📎 Related Models

🙏 Acknowledgements

GGUF File List

Model Information

Tags

Related Links

Jackrong/DASD-4B-Thinking-2507-stage2-GGUF

Model Description

DASD-4B-Thinking-2507-stage2

🧬 Training Pipeline Overview

📚 Stage Details

Stage 0 — GRPO Reinforcement Learning: DASD-4B-Thinking-2507-GRPO-v2

Stage 1 — Low-Temperature SFT: DASD-4B-Thinking-2507-stage1

Stage 2 — Default-Temperature SFT: DASD-4B-Thinking-2507-stage2 (this model)

🗂️ All Datasets Used

🏃 Quickstart

📋 Model Details

⚠️ Limitations & Intended Use

📎 Related Models

🙏 Acknowledgements

GGUF File List

Model Information

Tags

Related Links

Stage 0 — GRPO Reinforcement Learning: `DASD-4B-Thinking-2507-GRPO-v2`

Stage 1 — Low-Temperature SFT: `DASD-4B-Thinking-2507-stage1`

Stage 2 — Default-Temperature SFT: `DASD-4B-Thinking-2507-stage2` (this model)