unsloth/Ministral-3-14B-Reasoning-2512-GGUF

Name: unsloth/Ministral-3-14B-Reasoning-2512-GGUF
Author: unsloth

High-quality GGUF model

32.1K 📥 Downloads

25 ❤️ Likes

29 📁 GGUF Files

211.76 GB 💾 Total Size

2 weeks ago 🔄 Last Updated

📋 Model Description

library_name: vllm language:

license: apache-2.0 inference: false base_model:

mistralai/Ministral-3-14B-Reasoning-2512

extragateddescription: >- If you want to learn more about how we process your personal data, please read our Privacy Policy. tags:

mistral-common
mistral
unsloth

See our Ministral 3 collection for all versions including GGUF, 4-bit & FP8 formats.

Learn to run Ministral correctly - Read our Guide.

See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks.

✨ Read our Ministral 3 Guide here!

Fine-tune Ministral 3 for free using our Google Colab notebook
Or train Ministral 3 with reinforcement learning (GSPO) with our free notebook.
View the rest of our notebooks in our docs here.

Ministral 3 14B Reasoning 2512

The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities.

This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases.

The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 14B can even be deployed locally, capable of fitting in 32GB of VRAM in BF16, and less than 24GB of RAM/VRAM when quantized.

Key Features

Ministral 3 14B consists of two main architectural components:

13.5B Language Model
0.4B Vision Encoder

The Ministral 3 14B Reasoning model offers the following capabilities:

Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.
Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
System Prompt: Maintains strong adherence and support for system prompts.
Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
Reasoning: Excels at complex, multi-step reasoning and dynamic problem-solving.
Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere.
Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
Large Context Window: Supports a 256k context window.

Use Cases

Private AI deployments where advanced capabilities meet practical hardware constraints:

Private/custom chat and AI assistant deployments in constrained environments
Advanced local agentic use cases
Fine-tuning and specialization
And more...

Bringing advanced AI capabilities to most environments.

Ministral 3 Family

Model Name	Type	Precision	Link
Ministral 3 3B Base 2512	Base pre-trained	BF16	Hugging Face
Ministral 3 3B Instruct 2512	Instruct post-trained	FP8	Hugging Face
Ministral 3 3B Reasoning 2512	Reasoning capable	BF16	Hugging Face
Ministral 3 8B Base 2512	Base pre-trained	BF16	Hugging Face
Ministral 3 8B Instruct 2512	Instruct post-trained	FP8	Hugging Face
Ministral 3 8B Reasoning 2512	Reasoning capable	BF16	Hugging Face
Ministral 3 14B Base 2512	Base pre-trained	BF16	Hugging Face
Ministral 3 14B Instruct 2512	Instruct post-trained	FP8	Hugging Face
Ministral 3 14B Reasoning 2512	Reasoning capable	BF16	Hugging Face

Other formats available here.

Benchmark Results

We compare Ministral 3 to similar sized models.

Reasoning

Model	AIME25	AIME24	GPQA Diamond	LiveCodeBench
Ministral 3 14B	0.850	0.898	0.712	0.646
Qwen3-14B (Thinking)	0.737	0.837	0.663	0.593

Ministral 3 8B	0.787	0.860	0.668	0.616
Qwen3-VL-8B-Thinking	0.798	0.860	0.671	0.580

Ministral 3 3B	0.721	0.775	0.534	0.548
Qwen3-VL-4B-Thinking	0.697	0.729	0.601	0.513

Instruct

Model	Arena Hard	WildBench	MATH Maj@1	MM MTBench
Ministral 3 14B	0.551	68.5	0.904	8.49
Qwen3 14B (Non-Thinking)	0.427	65.1	0.870	NOT MULTIMODAL
Gemma3-12B-Instruct	0.436	63.2	0.854	6.70

Ministral 3 8B	0.509	66.8	0.876	8.08
Qwen3-VL-8B-Instruct	0.528	66.3	0.946	8.00

Ministral 3 3B	0.305	56.8	0.830	7.83
Qwen3-VL-4B-Instruct	0.438	56.8	0.900	8.01
Qwen3-VL-2B-Instruct	0.163	42.2	0.786	6.36
Gemma3-4B-Instruct	0.318	49.1	0.759	5.23

Base

Model	Multilingual MMLU	MATH CoT 2-Shot	AGIEval 5-shot	MMLU Redux 5-shot	MMLU 5-shot	TriviaQA 5-shot
Ministral 3 14B	0.742	0.676	0.648	0.820	0.794	0.749
Qwen3 14B Base	0.754	0.620	0.661	0.837	0.804	0.703
Gemma 3 12B Base	0.690	0.487	0.587	0.766	0.745	0.788

Ministral 3 8B	0.706	0.626	0.591	0.793	0.761	0.681
Qwen 3 8B Base	0.700	0.576	0.596	0.794	0.760	0.639

Ministral 3 3B	0.652	0.601	0.511	0.735	0.707	0.592
Qwen 3 4B Base	0.677	0.405	0.570	0.759	0.713	0.530
Gemma 3 4B Base	0.516	0.294	0.430	0.626	0.589	0.640

Usage

The model can be used with the following frameworks;

vllm: See here
transformers: See here

vLLM

We recommend using this model with vLLM.

#### Installation

Make sure to install vLLM >= 0.12.0:

pip install vllm --upgrade

Doing so should automatically install mistralcommon >= 1.8.6.

To check:

python -c "import mistralcommon; print(mistralcommon.version)"

You can also make use of a ready-to-go docker image or on the docker hub.

#### Serve

To fully exploit the Ministral-3-14B-Reasoning-2512 we recommed using 2xH200 GPUs for deployment due to its large context. However if you don't need a large context, you can fall back to a single GPU.

A simple launch command is:

vllm serve mistralai/Ministral-3-14B-Reasoning-2512-FP8 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

Key parameter notes:

enable-auto-tool-choice: Required when enabling tool usage.
tool-call-parser mistral: Required when enabling tool usage.
reasoning-parser mistral: Required when enabling reasoning.

Additional flags:

You can set --max-model-len to preserve memory. By default it is set to 262144 which is quite large but not necessary for most scenarios.
You can set --max-num-batched-tokens to balance throughput and latency, higher means higher throughput but higher latency.

#### Usage of the model

Here we asumme that the model mistralai/Ministral-3-8B-Reasoning-2512 is served and you can ping it to the domain localhost with the port 8000 which is the default for vLLM.

Vision Reasoning

Let's see if the Ministral 3 model knows when to pick a fight !

from typing import Any

from openai import OpenAI
from huggingfacehub import hfhub_download

Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    apikey=openaiapi_key,
    baseurl=openaiapi_base,
)

models = client.models.list()
model = models.data[0].id

def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
    filepath = hfhubdownload(repoid=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

indexbeginthink = system_prompt.find("[THINK]")
    indexendthink = system_prompt.find("[/THINK]")

return {
        "role": "system",
        "content": [
            {"type": "text", "text": systemprompt[:indexbegin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    indexbeginthink + len("[THINK]") : indexendthink
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": systemprompt[indexend_think + len("[/THINK]") :],
            },
        ],
    }

SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "imageurl", "imageurl": {"url": image_url}},
        ],
    },
]

stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    topp=TOPP,
    maxtokens=MAXTOK,
)

print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoningcontent = chunk.choices[0].delta.reasoningcontent
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

if reasoning_content is not None:
        if not printedreasoningcontent:
            printedreasoningcontent = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        # Extract and print the content
        if not reasoningcontent and printedreasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )

Now we'll make it compute some maths !

from typing import Any

from openai import OpenAI
from huggingfacehub import hfhub_download

Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    apikey=openaiapi_key,
    baseurl=openaiapi_base,
)

models = client.models.list()
model = models.data[0].id

def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
    filepath = hfhubdownload(repoid=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

indexbeginthink = system_prompt.find("[THINK]")
    indexendthink = system_prompt.find("[/THINK]")

return {
        "role": "system",
        "content": [
            {"type": "text", "text": systemprompt[:indexbegin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    indexbeginthink + len("[THINK]") : indexendthink
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": systemprompt[indexend_think + len("[/THINK]") :],
            },
        ],
    }

SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")

image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
            },
            {"type": "imageurl", "imageurl": {"url": image_url}},
        ],
    },
]

stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    topp=TOPP,
    maxtokens=MAXTOK,
)

print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoningcontent = chunk.choices[0].delta.reasoningcontent
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

if reasoning_content is not None:
        if not printedreasoningcontent:
            printedreasoningcontent = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoningcontent and printedreasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )

Text-Only Request

Let's do more maths and leave it up to the model to figure out how to achieve a result.

from typing import Any
from openai import OpenAI
from huggingfacehub import hfhub_download

Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    apikey=openaiapi_key,
    baseurl=openaiapi_base,
)

models = client.models.list()
model = models.data[0].id

def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
    filepath = hfhubdownload(repoid=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

indexbeginthink = system_prompt.find("[THINK]")
    indexendthink = system_prompt.find("[/THINK]")

return {
        "role": "system",
        "content": [
            {"type": "text", "text": systemprompt[:indexbegin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    indexbeginthink + len("[THINK]") : indexendthink
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": systemprompt[indexend_think + len("[/THINK]") :],
            },
        ],
    }

SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")

query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."

messages = [
    SYSTEM_PROMPT,
    {"role": "user", "content": query}
]
stream = client.chat.completions.create(
  model=model,
  messages=messages,
  stream=True,
  temperature=TEMP,
  topp=TOPP,
  maxtokens=MAXTOK,
)

print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoningcontent = chunk.choices[0].delta.reasoningcontent
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

if reasoning_content is not None:
        if not printedreasoningcontent:
            printedreasoningcontent = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoningcontent and printedreasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print("No answer was generated by the model, probably because the maximum number of tokens was reached.")

Transformers

You can also use Ministral 3 3B Reasoning 2512 with Transformers !
Make sure to install Transformers from its first v5 release candidate or from "main":

pip install transformers==5.0.0rc0

To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.8.6 to use our tokenizer.

pip install mistral-common --upgrade

Then load our tokenizer along with the model and generate:

Python snippet

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-14B-Reasoning-2512"

tokenizer = MistralCommonBackend.frompretrained(modelid)
model = Mistral3ForConditionalGeneration.from_pretrained(
    modelid, torchdtype=torch.bfloat16, device_map="auto"
)

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "imageurl", "imageurl": {"url": image_url}},
        ],
    },
]

tokenized = tokenizer.applychattemplate(messages, returntensors="pt", returndict=True)

tokenized["inputids"] = tokenized["inputids"].to(device="cuda")
tokenized["pixelvalues"] = tokenized["pixelvalues"].to(dtype=torch.bfloat16, device="cuda")
imagesizes = [tokenized["pixelvalues"].shape[-2:]]

output = model.generate(
    tokenized,
    imagesizes=imagesizes,
    maxnewtokens=8092,
)[0]

decodedoutput = tokenizer.decode(output[len(tokenized["inputids"][0]):])
print(decoded_output)

License

This model is licensed under the Apache 2.0 License.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.

📂 GGUF File List

📁 Filename	📦 Size	⚡ Download
Ministral-3-14B-Reasoning-2512-BF16.gguf LFS FP16	25.17 GB	Download
Ministral-3-14B-Reasoning-2512-IQ4_NL.gguf LFS Q4	7.27 GB	Download
Ministral-3-14B-Reasoning-2512-IQ4_XS.gguf LFS Q4	6.92 GB	Download
Ministral-3-14B-Reasoning-2512-Q2_K.gguf LFS Q2	4.89 GB	Download
Ministral-3-14B-Reasoning-2512-Q2_K_L.gguf LFS Q2	5.03 GB	Download
Ministral-3-14B-Reasoning-2512-Q3_K_M.gguf LFS Q3	6.22 GB	Download
Ministral-3-14B-Reasoning-2512-Q3_K_S.gguf LFS Q3	5.66 GB	Download
Ministral-3-14B-Reasoning-2512-Q4_0.gguf Recommended LFS Q4	7.27 GB	Download
Ministral-3-14B-Reasoning-2512-Q4_1.gguf LFS Q4	7.99 GB	Download
Ministral-3-14B-Reasoning-2512-Q4_K_M.gguf LFS Q4	7.67 GB	Download
Ministral-3-14B-Reasoning-2512-Q4_K_S.gguf LFS Q4	7.3 GB	Download
Ministral-3-14B-Reasoning-2512-Q5_K_M.gguf LFS Q5	8.96 GB	Download
Ministral-3-14B-Reasoning-2512-Q5_K_S.gguf LFS Q5	8.74 GB	Download
Ministral-3-14B-Reasoning-2512-Q6_K.gguf LFS Q6	10.33 GB	Download
Ministral-3-14B-Reasoning-2512-Q8_0.gguf LFS Q8	13.37 GB	Download
Ministral-3-14B-Reasoning-2512-UD-IQ1_M.gguf LFS	3.42 GB	Download
Ministral-3-14B-Reasoning-2512-UD-IQ1_S.gguf LFS	3.21 GB	Download
Ministral-3-14B-Reasoning-2512-UD-IQ2_M.gguf LFS Q2	4.57 GB	Download
Ministral-3-14B-Reasoning-2512-UD-IQ2_XXS.gguf LFS Q2	3.78 GB	Download
Ministral-3-14B-Reasoning-2512-UD-IQ3_XXS.gguf LFS Q3	5.12 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q2_K_XL.gguf LFS Q2	5.15 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q3_K_XL.gguf LFS Q3	6.46 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q4_K_XL.gguf LFS Q4	7.79 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q5_K_XL.gguf LFS Q5	8.98 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q6_K_XL.gguf LFS Q6	11.29 GB	Download
Ministral-3-14B-Reasoning-2512-UD-Q8_K_XL.gguf LFS Q8	15.94 GB	Download
mmproj-BF16.gguf LFS FP16	838.53 MB	Download
mmproj-F16.gguf LFS FP16	837.38 MB	Download
mmproj-F32.gguf LFS	1.64 GB	Download

📊 Model Information

🆔 Model ID: unsloth/Ministral-3-14B-Reasoning-2512-GGUF

📅 Created: 2 weeks ago

🔄 Last Updated: 2 weeks ago

📥 Downloads: 32.1K

❤️ Likes: 25

🎯 Difficulty: Advanced

⚙️ Quantization: FP16, Q4, Q2, Q3, Q5, Q6, Q8

🏷️ Tags

vllmggufmistral-commonmistralunslothenfresdeitptnlzhjakoarbase_model:mistralai/Ministral-3-14B-Reasoning-2512base_model:quantized:mistralai/Ministral-3-14B-Reasoning-2512license:apache-2.0region:usconversational

🔗 Related Links

🤗 Visit HuggingFace ⚡ Quick Download