π Model Description
library_name: vllm language:
- en
- fr
- es
- de
- it
- pt
- nl
- zh
- ja
- ko
- ar
- mistralai/Ministral-3-14B-Reasoning-2512
- mistral-common
- mistral
- unsloth
See our Ministral 3 collection for all versions including GGUF, 4-bit & FP8 formats.
Learn to run Ministral correctly - Read our Guide.
See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks.
β¨ Read our Ministral 3 Guide here!
- Fine-tune Ministral 3 for free using our Google Colab notebook
- Or train Ministral 3 with reinforcement learning (GSPO) with our free notebook.
- View the rest of our notebooks in our docs here.
Ministral 3 14B Reasoning 2512
The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities.
This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases.
The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 14B can even be deployed locally, capable of fitting in 32GB of VRAM in BF16, and less than 24GB of RAM/VRAM when quantized.
Key Features
Ministral 3 14B consists of two main architectural components:- 13.5B Language Model
- 0.4B Vision Encoder
The Ministral 3 14B Reasoning model offers the following capabilities:
- Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.
- Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
- System Prompt: Maintains strong adherence and support for system prompts.
- Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
- Reasoning: Excels at complex, multi-step reasoning and dynamic problem-solving.
- Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere.
- Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
- Large Context Window: Supports a 256k context window.
Use Cases
Private AI deployments where advanced capabilities meet practical hardware constraints:- Private/custom chat and AI assistant deployments in constrained environments
- Advanced local agentic use cases
- Fine-tuning and specialization
- And more...
Bringing advanced AI capabilities to most environments.
Ministral 3 Family
| Model Name | Type | Precision | Link |
|---|---|---|---|
| Ministral 3 3B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 3B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 3B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 8B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 8B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 8B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 14B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 14B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 14B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
Benchmark Results
We compare Ministral 3 to similar sized models.
Reasoning
| Model | AIME25 | AIME24 | GPQA Diamond | LiveCodeBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.850 | 0.898 | 0.712 | 0.646 |
| Qwen3-14B (Thinking) | 0.737 | 0.837 | 0.663 | 0.593 |
| Ministral 3 8B | 0.787 | 0.860 | 0.668 | 0.616 |
| Qwen3-VL-8B-Thinking | 0.798 | 0.860 | 0.671 | 0.580 |
| Ministral 3 3B | 0.721 | 0.775 | 0.534 | 0.548 |
| Qwen3-VL-4B-Thinking | 0.697 | 0.729 | 0.601 | 0.513 |
Instruct
| Model | Arena Hard | WildBench | MATH Maj@1 | MM MTBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.551 | 68.5 | 0.904 | 8.49 |
| Qwen3 14B (Non-Thinking) | 0.427 | 65.1 | 0.870 | NOT MULTIMODAL |
| Gemma3-12B-Instruct | 0.436 | 63.2 | 0.854 | 6.70 |
| Ministral 3 8B | 0.509 | 66.8 | 0.876 | 8.08 |
| Qwen3-VL-8B-Instruct | 0.528 | 66.3 | 0.946 | 8.00 |
| Ministral 3 3B | 0.305 | 56.8 | 0.830 | 7.83 |
| Qwen3-VL-4B-Instruct | 0.438 | 56.8 | 0.900 | 8.01 |
| Qwen3-VL-2B-Instruct | 0.163 | 42.2 | 0.786 | 6.36 |
| Gemma3-4B-Instruct | 0.318 | 49.1 | 0.759 | 5.23 |
Base
| Model | Multilingual MMLU | MATH CoT 2-Shot | AGIEval 5-shot | MMLU Redux 5-shot | MMLU 5-shot | TriviaQA 5-shot |
|---|---|---|---|---|---|---|
| Ministral 3 14B | 0.742 | 0.676 | 0.648 | 0.820 | 0.794 | 0.749 |
| Qwen3 14B Base | 0.754 | 0.620 | 0.661 | 0.837 | 0.804 | 0.703 |
| Gemma 3 12B Base | 0.690 | 0.487 | 0.587 | 0.766 | 0.745 | 0.788 |
| Ministral 3 8B | 0.706 | 0.626 | 0.591 | 0.793 | 0.761 | 0.681 |
| Qwen 3 8B Base | 0.700 | 0.576 | 0.596 | 0.794 | 0.760 | 0.639 |
| Ministral 3 3B | 0.652 | 0.601 | 0.511 | 0.735 | 0.707 | 0.592 |
| Qwen 3 4B Base | 0.677 | 0.405 | 0.570 | 0.759 | 0.713 | 0.530 |
| Gemma 3 4B Base | 0.516 | 0.294 | 0.430 | 0.626 | 0.589 | 0.640 |
Usage
The model can be used with the following frameworks;
vllm: See heretransformers: See here
vLLM
We recommend using this model with vLLM.
#### Installation
Make sure to install vLLM >= 0.12.0:
pip install vllm --upgrade
Doing so should automatically install mistralcommon >= 1.8.6.
To check:
python -c "import mistralcommon; print(mistralcommon.version)"
You can also make use of a ready-to-go docker image or on the docker hub.
#### Serve
To fully exploit the Ministral-3-14B-Reasoning-2512 we recommed using 2xH200 GPUs for deployment due to its large context. However if you don't need a large context, you can fall back to a single GPU.
A simple launch command is:
vllm serve mistralai/Ministral-3-14B-Reasoning-2512-FP8 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
Key parameter notes:
- enable-auto-tool-choice: Required when enabling tool usage.
- tool-call-parser mistral: Required when enabling tool usage.
- reasoning-parser mistral: Required when enabling reasoning.
Additional flags:
- You can set
--max-model-lento preserve memory. By default it is set to262144which is quite large but not necessary for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.
#### Usage of the model
Here we asumme that the model mistralai/Ministral-3-8B-Reasoning-2512 is served and you can ping it to the domain localhost with the port 8000 which is the default for vLLM.
Vision Reasoning
Let's see if the Ministral 3 model knows when to pick a fight !
from typing import Any
from openai import OpenAI
from huggingfacehub import hfhub_download
Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
apikey=openaiapi_key,
baseurl=openaiapi_base,
)
models = client.models.list()
model = models.data[0].id
def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
filepath = hfhubdownload(repoid=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
indexbeginthink = system_prompt.find("[THINK]")
indexendthink = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": systemprompt[:indexbegin_think]},
{
"type": "thinking",
"thinking": system_prompt[
indexbeginthink + len("[THINK]") : indexendthink
],
"closed": True,
},
{
"type": "text",
"text": systemprompt[indexend_think + len("[/THINK]") :],
},
],
}
SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
SYSTEM_PROMPT,
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "imageurl", "imageurl": {"url": image_url}},
],
},
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
topp=TOPP,
maxtokens=MAXTOK,
)
print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoningcontent = chunk.choices[0].delta.reasoningcontent
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printedreasoningcontent:
printedreasoningcontent = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
elif content is not None:
# Extract and print the content
if not reasoningcontent and printedreasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print(
"No answer was generated by the model, probably because the maximum number of tokens was reached."
)
Now we'll make it compute some maths !
from typing import Any
from openai import OpenAI
from huggingfacehub import hfhub_download
Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
apikey=openaiapi_key,
baseurl=openaiapi_base,
)
models = client.models.list()
model = models.data[0].id
def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
filepath = hfhubdownload(repoid=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
indexbeginthink = system_prompt.find("[THINK]")
indexendthink = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": systemprompt[:indexbegin_think]},
{
"type": "thinking",
"thinking": system_prompt[
indexbeginthink + len("[THINK]") : indexendthink
],
"closed": True,
},
{
"type": "text",
"text": systemprompt[indexend_think + len("[/THINK]") :],
},
],
}
SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")
image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"
messages = [
SYSTEM_PROMPT,
{
"role": "user",
"content": [
{
"type": "text",
"text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
},
{"type": "imageurl", "imageurl": {"url": image_url}},
],
},
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
topp=TOPP,
maxtokens=MAXTOK,
)
print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoningcontent = chunk.choices[0].delta.reasoningcontent
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printedreasoningcontent:
printedreasoningcontent = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
if content is not None:
# Extract and print the content
if not reasoningcontent and printedreasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print(
"No answer was generated by the model, probably because the maximum number of tokens was reached."
)
Text-Only Request
Let's do more maths and leave it up to the model to figure out how to achieve a result.
from typing import Any
from openai import OpenAI
from huggingfacehub import hfhub_download
Modify OpenAI's API key and API base to use vLLM's API server.
openaiapikey = "EMPTY"
openaiapibase = "http://localhost:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
apikey=openaiapi_key,
baseurl=openaiapi_base,
)
models = client.models.list()
model = models.data[0].id
def loadsystemprompt(repo_id: str, filename: str) -> dict[str, Any]:
filepath = hfhubdownload(repoid=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
indexbeginthink = system_prompt.find("[THINK]")
indexendthink = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": systemprompt[:indexbegin_think]},
{
"type": "thinking",
"thinking": system_prompt[
indexbeginthink + len("[THINK]") : indexendthink
],
"closed": True,
},
{
"type": "text",
"text": systemprompt[indexend_think + len("[/THINK]") :],
},
],
}
SYSTEMPROMPT = loadsystemprompt(model, "SYSTEMPROMPT.txt")
query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, Γ, Γ· (and parentheses for grouping), to make the number 24."
messages = [
SYSTEM_PROMPT,
{"role": "user", "content": query}
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
topp=TOPP,
maxtokens=MAXTOK,
)
print("client: Start streaming chat completions...:\n")
printedreasoningcontent = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoningcontent = chunk.choices[0].delta.reasoningcontent
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printedreasoningcontent:
printedreasoningcontent = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
if content is not None:
# Extract and print the content
if not reasoningcontent and printedreasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print("No answer was generated by the model, probably because the maximum number of tokens was reached.")
Transformers
You can also use Ministral 3 3B Reasoning 2512 with Transformers !
Make sure to install Transformers from its first v5 release candidate or from "main":
pip install transformers==5.0.0rc0
To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.8.6 to use our tokenizer.
pip install mistral-common --upgrade
Then load our tokenizer along with the model and generate:
Python snippet
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-14B-Reasoning-2512"
tokenizer = MistralCommonBackend.frompretrained(modelid)
model = Mistral3ForConditionalGeneration.from_pretrained(
modelid, torchdtype=torch.bfloat16, device_map="auto"
)
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "imageurl", "imageurl": {"url": image_url}},
],
},
]
tokenized = tokenizer.applychattemplate(messages, returntensors="pt", returndict=True)
tokenized["inputids"] = tokenized["inputids"].to(device="cuda")
tokenized["pixelvalues"] = tokenized["pixelvalues"].to(dtype=torch.bfloat16, device="cuda")
imagesizes = [tokenized["pixelvalues"].shape[-2:]]
output = model.generate(
tokenized,
imagesizes=imagesizes,
maxnewtokens=8092,
)[0]
decodedoutput = tokenizer.decode(output[len(tokenized["inputids"][0]):])
print(decoded_output)
License
This model is licensed under the Apache 2.0 License.
You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third partyβs rights, including intellectual property rights.
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
Ministral-3-14B-Reasoning-2512-BF16.gguf
LFS
FP16
|
25.17 GB | Download |
|
Ministral-3-14B-Reasoning-2512-IQ4_NL.gguf
LFS
Q4
|
7.27 GB | Download |
|
Ministral-3-14B-Reasoning-2512-IQ4_XS.gguf
LFS
Q4
|
6.92 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q2_K.gguf
LFS
Q2
|
4.89 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q2_K_L.gguf
LFS
Q2
|
5.03 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q3_K_M.gguf
LFS
Q3
|
6.22 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q3_K_S.gguf
LFS
Q3
|
5.66 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q4_0.gguf
Recommended
LFS
Q4
|
7.27 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q4_1.gguf
LFS
Q4
|
7.99 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q4_K_M.gguf
LFS
Q4
|
7.67 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q4_K_S.gguf
LFS
Q4
|
7.3 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q5_K_M.gguf
LFS
Q5
|
8.96 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q5_K_S.gguf
LFS
Q5
|
8.74 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q6_K.gguf
LFS
Q6
|
10.33 GB | Download |
|
Ministral-3-14B-Reasoning-2512-Q8_0.gguf
LFS
Q8
|
13.37 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-IQ1_M.gguf
LFS
|
3.42 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-IQ1_S.gguf
LFS
|
3.21 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-IQ2_M.gguf
LFS
Q2
|
4.57 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-IQ2_XXS.gguf
LFS
Q2
|
3.78 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-IQ3_XXS.gguf
LFS
Q3
|
5.12 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q2_K_XL.gguf
LFS
Q2
|
5.15 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q3_K_XL.gguf
LFS
Q3
|
6.46 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q4_K_XL.gguf
LFS
Q4
|
7.79 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q5_K_XL.gguf
LFS
Q5
|
8.98 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q6_K_XL.gguf
LFS
Q6
|
11.29 GB | Download |
|
Ministral-3-14B-Reasoning-2512-UD-Q8_K_XL.gguf
LFS
Q8
|
15.94 GB | Download |
|
mmproj-BF16.gguf
LFS
FP16
|
838.53 MB | Download |
|
mmproj-F16.gguf
LFS
FP16
|
837.38 MB | Download |
|
mmproj-F32.gguf
LFS
|
1.64 GB | Download |