Mungert/NVIDIA-Nemotron-Nano-12B-v2-GGUF

Name: Mungert/NVIDIA-Nemotron-Nano-12B-v2-GGUF
Author: Mungert

High-quality GGUF model

21.8K 📥 Downloads

1 ❤️ Likes

28 📁 GGUF Files

204.32 GB 💾 Total Size

6 days ago 🔄 Last Updated

📋 Model Description

license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ pipeline_tag: text-generation datasets:

nvidia/Nemotron-Post-Training-Dataset-v1
nvidia/Nemotron-Post-Training-Dataset-v2
nvidia/Nemotron-Pretraining-Dataset-sample
nvidia/Nemotron-CC-v2
nvidia/Nemotron-CC-Math-v1
nvidia/Nemotron-Pretraining-SFT-v1

language:

library_name: transformers tags:

nvidia
pytorch

track_downloads: true base_model:

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base

NVIDIA-Nemotron-Nano-12B-v2 GGUF Models

Model Generation Details

This model was generated using llama.cpp at commit 4fd1242b.

Quantization Beyond the IMatrix

I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides.

In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the --tensor-type option in llama.cpp to manually "bump" important layers to higher precision. You can see the implementation here:
👉 Layer bumping with llama.cpp

While this does increase model file size, it significantly improves precision for a given quantization level.

I'd love your feedback—have you tried this? How does it perform for you?

Click here to get info on choosing the right GGUF model format

NVIDIA-Nemotron-Nano-12B-v2

Model Developer: NVIDIA Corporation

Model Dates:

June 2025 \- August 2025

Data Freshness:

September 2024

The pretraining data has a cutoff date of September 2024.

Model Overview

NVIDIA-Nemotron-Nano-12B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks. The model was fine-tuned from NVIDIA-Nemotron-Nano-12B-v2-Base was further compressed into NVIDIA-Nemotron-Nano-9B-v2.

The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report.
The model was trained using Megatron-LM and NeMo-RL.

The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen.

This model is ready for commercial use.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement.

Evaluation Results

Benchmark Results (Reasoning On)

We evaluated our model in Reasoning-On mode across all benchmarks, except RULER, which is evaluated in Reasoning-Off mode.

Benchmark	NVIDIA-Nemotron-Nano-12B-v2
AIME25	76.25%
MATH500	97.75%
GPQA	64.48%
LCB	70.79%
BFCL v3	66.98%
IFEVAL-Prompt	84.70%
IFEVAL-Instruction	89.81%

All evaluations were done using NeMo-Skills.
We published a tutorial with all details necessary to reproduce our evaluation results.

Reasoning Budget Control

This model supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think".

Model Architecture

Architecture Type: Mamba2-Transformer Hybrid
Network Architecture: Nemotron-Hybrid

Deployment Geography: Global

Use Case

NVIDIA-Nemotron-Nano-12B-v2 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Spanish and Japanese) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks.

Release Date: 08/29/2025

Huggingface 08/29/2025 via https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2

References

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Input

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D): Sequences
Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English.

Output

Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D): Sequences up to 128K

Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): NeMo 25.07.nemotron-nano-v2
Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100
Operating System(s): Linux

Use it with Transformers

The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-12B-v2")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-Nano-12B-v2",
    torch_dtype=torch.bfloat16,
    trustremotecode=True,
    device_map="auto"
)

Case 1: /think or no reasoning signal is provided in the system prompt, reasoning will be set to True

messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]

Case 2: /no_think is provided, reasoning will be set to False

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]

Note: /think or /no_think keywords can also be provided in “user” messages for turn-level reasoning control.

The rest of the inference snippet remains the same

tokenizedchat = tokenizer.applychat_template(
    messages,
    tokenize=True,
    addgenerationprompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    maxnewtokens=32,
    eostokenid=tokenizer.eostokenid
)
print(tokenizer.decode(outputs[0]))

We recommend setting temperature to 0.6, topp to 0.95 for reasoning True and greedy search for reasoning False, and increase maxnew_tokens to 1024 or higher for reasoning True.

Use it with TRT-LLM

The snippet below shows how to use this model with TRT-LLM. We tested this on the following commit and followed these instructions to build and install TRT-LLM in a docker container.

from tensorrt_llm import SamplingParams
from tensorrtllm.torch import LLM
from tensorrtllm.torch.pyexecutor.config import PyTorchConfig
from tensorrt_llm.llmapi import KvCacheConfig
from transformers import AutoTokenizer
pytorch_config = PyTorchConfig(
    disableoverlapscheduler=True, enabletrtllmdecoder=True
)
kvcacheconfig = KvCacheConfig(
    enableblockreuse=False,
)

model_id = "nvidia/NVIDIA-Nemotron-Nano-12B-v2"
tokenizer = AutoTokenizer.frompretrained(modelid)

llm = LLM(
    model=model_id,
    maxseqlen=32678,
    maxbatchsize=4,
    pytorchbackendconfig=pytorch_config,
    kvcacheconfig=kvcacheconfig,
    tensorparallelsize=8,
)
messages = [
    {"role": "system",  "content": "/think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]
prompt = tokenizer.applychattemplate(messages, tokenize=False, addgenerationprompt=True)
sampling_params = SamplingParams(
    max_tokens=512,
    temperature=0.6,
    top_p=0.95,
    addspecialtokens=False,
)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Use it with vLLM

The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM.

pip install -U "vllm>=0.10.1"

Now you can run run the server with:

vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2 \
    --trust-remote-code \
    --max-num-seqs 64 \
    --mambassmcache_dtype float32

Note:

Remember to add \--mamba\ssm\cache\_dtype float32\ for accurate quality. Without this option, the model’s accuracy may degrade.
If you encounter a CUDA OOM issue, try --max-num-seqs 64 and consider lower the value further if the error persists.

Alternativly, you can use Docker to launch a vLLM server.

export TP_SIZE=1  # Adjust this value based on the number of GPUs you want to use
docker run --runtime nvidia --gpus all \
           -v ~/.cache/huggingface:/root/.cache/huggingface \
           --env "HUGGINGFACEHUBTOKEN=$HFTOKEN" \
           -p 8000:8000 \
           --ipc=host \
           vllm/vllm-openai:v0.10.1 \
           --model nvidia/NVIDIA-Nemotron-Nano-12B-v2 \
           --tensor-parallel-size ${TP_SIZE} \
           --max-num-seqs 64 \
           --max-model-len 131072 \
           --trust-remote-code \
           --mambassmcache_dtype float32

#### Using Budget Control with a vLLM Server

The thinking budget allows developers to keep accuracy high and meet response‑time targets \- which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts.

With budget control, you can set a limit for internal reasoning:

maxthinkingtokens: This is a threshold that will attempt to end the reasoning trace at the next newline encountered in the reasoning trace. If no newline is encountered within 500 tokens, it will abruptly end the reasoning trace at \max\thinking\tokens \+ 500\.

Start a vLLM server:

vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2 \
    --trust-remote-code \
    --mambassmcache_dtype float32

Client for supporting budget control:

from typing import Any, Dict, List

import openai
from transformers import AutoTokenizer

class ThinkingBudgetClient:
   def init(self, baseurl: str, apikey: str, tokenizernameor_path: str):
       self.baseurl = baseurl
       self.apikey = apikey
       self.tokenizer = AutoTokenizer.frompretrained(tokenizernameorpath)
       self.client = openai.OpenAI(baseurl=self.baseurl, apikey=self.apikey)

def chat_completion(
       self,
       model: str,
       messages: List[Dict[str, Any]],
       maxthinkingbudget: int = 512,
       max_tokens: int = 1024,
       kwargs,
   ) -> Dict[str, Any]:
       assert (
           maxtokens > maxthinking_budget
       ), f"thinking budget must be smaller than maximum new tokens. Given {maxtokens=} and {maxthinking_budget=}"

# 1. first call chat completion to get reasoning content
       response = self.client.chat.completions.create(
           model=model, messages=messages, maxtokens=maxthinking_budget, kwargs
       )
       content = response.choices[0].message.content

reasoning_content = content
       if not "</think>" in reasoning_content:
           # reasoning content is too long, closed with a period (.)
           reasoningcontent = f"{reasoningcontent}.\n</think>\n\n"
       reasoningtokenslen = len(
           self.tokenizer.encode(reasoningcontent, addspecial_tokens=False)
       )
       remainingtokens = maxtokens - reasoningtokenslen
       assert (
           remaining_tokens > 0
       ), f"remaining tokens must be positive. Given {remainingtokens=}. Increase the maxtokens or lower the maxthinkingbudget."

# 2. append reasoning content to messages and call completion
       messages.append({"role": "assistant", "content": reasoning_content})
       prompt = self.tokenizer.applychattemplate(
           messages,
           tokenize=False,
           continuefinalmessage=True,
       )
       response = self.client.completions.create(
           model=model, prompt=prompt, maxtokens=remainingtokens, kwargs
       )

response_data = {
           "reasoningcontent": reasoningcontent.strip().strip("</think>").strip(),
           "content": response.choices[0].text,
           "finishreason": response.choices[0].finishreason,
       }
       return response_data

Calling the server with a budget (Restricted to 32 tokens here as an example)

tokenizernameor_path = "nvidia/NVIDIA-Nemotron-Nano-12B-v2"
client = ThinkingBudgetClient(
   base_url="http://localhost:8000/v1",  # Nano 12B v2 deployed in thinking mode
   api_key="EMPTY",
   tokenizernameorpath=tokenizernameorpath,
)

result = client.chat_completion(
   model="nvidia/NVIDIA-Nemotron-Nano-12B-v2",
   messages=[
       {"role": "system", "content": "You are a helpful assistant. /think"},
       {"role": "user", "content": "What is 2+2?"},
   ],
   maxthinkingbudget=32,
   max_tokens=512,
   temperature=0.6,
   top_p=0.95,
)
print(result)

You should see output similar to the following:

{'reasoningcontent': "Okay, the user asked, What is 2+2? Let me think. Well, 2 plus 2 equals 4. That's a basic.", 'content': '2 + 2 equals 4.\n', 'finishreason': 'stop'}

#### Using Tool-Calling with a vLLM Server

Start a vLLM server with native tool-calling:

git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2

vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2 \ --trust-remote-code \ --mambassmcache_dtype float32 \ --enable-auto-tool-choice \ --tool-parser-plugin "NVIDIA-Nemotron-Nano-12B-v2/nemotrontoolcallparsernostreaming.py" \ --tool-call-parser "nemotron_json"

After launching a vLLM server, you can call the server with tool-call support using a Python script like below:

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:5000/v1",
    api_key="dummy",
)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2",
    messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "calculate_tip",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "bill_total": {
                            "type": "integer",
                            "description": "The total amount of the bill"
                        },
                        "tip_percentage": {
                            "type": "integer",
                            "description": "The percentage of tip to be applied"
                        }
                    },
                    "required": ["billtotal", "tippercentage"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "convert_currency",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "amount": {
                            "type": "integer",
                            "description": "The amount to be converted"
                        },
                        "from_currency": {
                            "type": "string",
                            "description": "The currency code to convert from"
                        },
                        "to_currency": {
                            "type": "string",
                            "description": "The currency code to convert to"
                        }
                    },
                    "required": ["fromcurrency", "amount", "tocurrency"]
                }
            }
        }
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
    stream=False
)

print(completion.choices[0].message.content)
print(completion.choices[0].message.tool_calls)

You should see output similar to the following:

<think>
Okay, let's see. The user has a bill of $100 and wants to know the amount for an 18% tip. Hmm, I need to calculate the tip based on the bill total and the percentage. The tools provided include calculatetip, which takes billtotal and tippercentage as parameters. So the billtotal here is 100, and the tippercentage is 18. I should call the calculatetip function with these values. Wait, do I need to check if the parameters are integers? The bill is $100, which is an integer, and 18% is also an integer. So that fits the function's requirements. I don't need to convert any currency here because the user is asking about a tip in the same currency. So the correct tool to use is calculate_tip with those parameters.
</think>

[ChatCompletionMessageToolCall(id='chatcmpl-tool-e341c6954d2c48c2a0e9071c7bdefd8b', function=Function(arguments='{"billtotal": 100, "tippercentage": 18}', name='calculate_tip'), type='function')]

Model Version

v1.0

Prompt Format

We follow the jinja chat template provided below. This template conditionally adds \n to the start of the Assistant response if /think is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds to the start of the Assistant response if /no_think is found in the system prompt. Thus enforcing reasoning on/off behavior.

{%- set ns = namespace(enable_thinking = true) %}

{%- for message in messages -%}
    {%- set content = message['content'] -%}
    {%- if message['role'] == 'user' or message['role'] == 'system' -%}
        {%- if '/think' in content -%}
            {%- set ns.enable_thinking = true -%}
        {%- elif '/no_think' in content -%}
            {%- set ns.enable_thinking = false -%}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}

{%- if messages[0]['role'] != 'system' -%}
    {%- set ns.nontoolsystem_content = '' -%}
    {{- '<SPECIAL_10>System\n' -}}
{%- else -%}
    {%- set ns.nontoolsystem_content = messages[0]['content']
        .replace('/think', '')
        .replace('/no_think', '')
        .strip()
    -%}
    {{- '<SPECIAL10>System\n' + ns.nontoolsystemcontent }}
{%- endif -%}

{%- if tools -%}
    {%- if ns.nontoolsystemcontent is defined and ns.nontoolsystemcontent != '' -%}
        {{- '\n\n' -}}
    {%- endif -%}

{{- 'You can use the following tools to assist the user if required:' -}}
    {{- '\n<AVAILABLE_TOOLS>[' -}}
    {%- for tool in tools -%}
        {{- (tool.function if tool.function is defined else tool) | tojson -}}
        {{- ', ' if not loop.last else '' -}}
    {%- endfor -%}
    {{- ']</AVAILABLE_TOOLS>\n\n' -}}

{{- 'If you decide to call any tool(s), use the following format:\n' -}}
    {{- '<TOOLCALL>[{{"name": "toolname1", "arguments": "toolargs1"}}, ' -}}
    {{- '{{"name": "toolname2", "arguments": "toolargs2"}}]</TOOLCALL>\n\n' -}}

{{- 'The user will execute tool-calls and return responses from tool(s) in this format:\n' -}}
    {{- '<TOOLRESPONSE>[{{"toolresponse1"}}, {{"toolresponse2"}}]</TOOLRESPONSE>\n\n' -}}

{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
{%- endif -%}

{{- '\n' -}}

{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}

{%- if messages[-1]['role'] == 'assistant' -%}
    {%- set ns.lastturnassistant_content = messages[-1]['content'].strip() -%}
    {%- set messages = messages[:-1] -%}
{%- endif -%}

{%- for message in messages -%}
    {%- set content = message['content'] -%}

{%- if message['role'] == 'user' -%}
        {{- '<SPECIAL11>User\n' + content.replace('/think', '').replace('/nothink', '').strip() + '\n' }}

{%- elif message['role'] == 'tool' -%}
        {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
            {{- '<SPECIAL11>User\n' + '<TOOLRESPONSE>[' }}
        {%- endif -%}
        {{- message['content'] -}}
        {{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
        {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
            {{- ']</TOOL_RESPONSE>\n' -}}
        {%- endif -%}

{%- elif message['role'] == 'assistant' -%}
        {%- if '</think>' in content -%}
            {%- set content = content.split('</think>')[1].strip() %}
        {%- endif -%}

{{- '<SPECIAL_11>Assistant\n' + content.strip() }}

{%- if message.tool_calls -%}
            {%- if content.strip() != '' -%}
                {{- '\n\n' -}}
            {%- endif -%}
            {{- '<TOOLCALL>[' -}}
            {%- for call in message.tool_calls -%}
                {%- set fn = call.function if call.function is defined else call -%}
                {{- '{"name": "' + fn.name + '", "arguments": ' -}}
                {%- if fn.arguments is string -%}
                    {{- fn.arguments -}}
                {%- else -%}
                    {{- fn.arguments | tojson -}}
                {%- endif -%}
                {{- '}' + (', ' if not loop.last else '') -}}
            {%- endfor -%}
            {{- ']</TOOLCALL>' -}}
        {%- endif -%}

{{- '\n<SPECIAL_12>\n' -}}
    {%- endif -%}
{%- endfor -%}

{%- if addgenerationprompt -%}
    {{- '<SPECIAL_11>Assistant\n' -}}
    {%- if ns.enablethinking is defined and ns.enablethinking is false -%}
        {{- '<think></think>' -}}
    {%- else -%}
        {{- '<think>\n' -}}
    {%- endif -%}
    {%- if ns.lastturnassistantcontent is defined and ns.lastturnassistantcontent != '' -%}
        {{- ns.lastturnassistant_content -}}
    {%- endif -%}

{%- else -%}
    {%- if ns.lastturnassistantcontent is defined and ns.lastturnassistantcontent != '' -%}
        {{- '<SPECIAL_11>Assistant\n' -}}
        {%- if ns.enablethinking is defined and ns.enablethinking is false -%}
            {{- '<think></think>' -}}
        {%- else -%}
            {{- '<think>\n' -}}
        {%- endif -%}
        {{- ns.lastturnassistant_content -}}

{%- if continuefinalmessage is defined -%}
            {%- if continuefinalmessage is false -%}
                {{- '\n<SPECIAL_12>\n' -}}
            {%- endif -%}
        {%- else -%}
            {{- '\n<SPECIAL_12>\n' -}}
        {%- endif -%}
    {%- endif -%}
{%- endif -%}

Training, Testing, and Evaluation Datasets

Training datasets

Data Modality: Text
Text Training Data Size: More than 10 Trillion Tokens
Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing.
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic

Properties: The post-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.

The pre-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately twenty trillion tokens.

Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes.

More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model .

Public Datasets

Dataset	Collection Period
Problems in Elementary Mathematics for Home Study	4/23/2025
GSM8K	4/23/2025
PRM800K	4/23/2025
CC-NEWS	4/23/2025
Common Crawl	4/23/2025
Wikimedia	4/23/2025
Bespoke-Stratos-17k	4/23/2025
tigerbot-kaggle-leetcodesolutions-en-2k	4/23/2025
glaive-function-calling-v2	4/23/2025
APIGen Function-Calling	4/23/2025
LMSYS-Chat-1M	4/23/2025
Open Textbook Library \- CC BY-SA & GNU subset and OpenStax \- CC BY-SA subset	4/23/2025
Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench	4/23/2025
FineWeb-2	4/23/2025
Court Listener	Legacy Download
peS2o	Legacy Download
OpenWebMath	Legacy Download
BioRxiv	Legacy Download
PMC Open Access Subset	Legacy Download
OpenWebText2	Legacy Download
Stack Exchange Data Dump	Legacy Download
PubMed Abstracts	Legacy Download
NIH ExPorter	Legacy Download
arXiv	Legacy Download
BigScience Workshop Datasets	Legacy Download
Reddit Dataset	Legacy Download
SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR)	Legacy Download
Public Software Heritage S3	Legacy Download
The Stack	Legacy Download
mC4	Legacy Download
Advanced Mathematical Problem Solving	Legacy Download
MathPile	Legacy Download
NuminaMath CoT	Legacy Download
PMC Article	Legacy Download
FLAN	Legacy Download
Advanced Reasoning Benchmark	Legacy Download
SciBench	Legacy Download
WikiTableQuestions	Legacy Download
FinQA	Legacy Download
Riddles	Legacy Download
Problems in Elementary Mathematics for Home Study	Legacy Download
MedMCQA	Legacy Download
Cosmos QA	Legacy Download
MCTest	Legacy Download
AI2's Reasoning Challenge	Legacy Download
OpenBookQA	Legacy Download
MMLU Auxiliary Train	Legacy Download
social-chemestry-101	Legacy Download
Moral Stories	Legacy Download
The Common Pile v0.1	Legacy Download
FineMath	Legacy Download
MegaMath	Legacy Download
FastChat	6/30/2025

Private Non-publicly Accessible Datasets of Third Parties

Dataset
Global Regulation
Workbench

Online Dataset Sources

The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper.

Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC.

The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report).

Dataset	Modality	Dataset Size (Tokens)	Collection Period
English Common Crawl	Text	3.360T	4/8/2025
Multilingual Common Crawl	Text	812.7B	5/1/2025
GitHub Crawl	Text	747.4B	4/29/2025

NVIDIA-Sourced Synthetic Datasets

Dataset	Modality	Dataset Size (Tokens)	Seed Dataset	Model(s) used for generation
Synthetic Art of Problem Solving from DeepSeek-R1	Text	25.5B	Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10;	DeepSeek-R1
Synthetic Moral Stories and Social Chemistry from Mixtral-8x22B-v0.1	Text	327M	social-chemestry-101; Moral Stories	Mixtral-8x22B-v0.1
Synthetic Social Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B	Text	83.6M	OpenStax \- CC BY-SA subset	DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B
Synthetic Health Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B	Text	9.7M	OpenStax \- CC BY-SA subset	DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B
Synthetic STEM seeded with OpenStax, Open Textbook Library, and GSM8K from DeepSeek-R1, DeepSeek-V3, DeepSeek-V3-0324, and Qwen2.5-72B	Text	175M	OpenStax \- CC BY-SA subset; GSM8K; Open Textbook Library \- CC BY-SA & GNU subset	DeepSeek-R1, DeepSeek-V3; DeepSeek-V3-0324; Qwen2.5-72B
Nemotron-PrismMath	Text	4.6B	Big-Math-RL-Verified; OpenR1-Math-220k	Qwen2.5-0.5B-instruct, Qwen2.5-72B-Instruct; DeepSeek-R1-Distill-Qwen-32B
Synthetic Question Answering Data from Papers and Permissible Books from Qwen2.5-72B-Instruct	Text	350M	arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD	Qwen2.5-72B-Instruct
Synthetic FineMath-4+ Reprocessed from DeepSeek-V3	Text	9.2B	Common Crawl	DeepSeek-V3
Synthetic FineMath-3+ Reprocessed from phi-4	Text	27.6B	Common Crawl	phi-4
Synthetic Union-3+ Reprocessed from phi-4	Text	93.1B	Common Crawl	phi-4
Refreshed Nemotron-MIND from phi-4	Text	73B	Common Crawl	phi-4
Synthetic Union-4+ Reprocessed from phi-4	Text	14.12B	Common Crawl	phi-4
Synthetic Union-3+ minus 4+ Reprocessed from phi-4	Text	78.95B	Common Crawl	phi-4
Synthetic Union-3 Refreshed from phi-4	Text	80.94B	Common Crawl	phi-4
Synthetic Union-4+ Refreshed from phi-4	Text	52.32B	Common Crawl	phi-4
Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from DeepSeek-V3 and DeepSeek-V3-0324	Text	4.0B	AQUA-RAT; LogiQA; AR-LSAT	DeepSeek-V3; DeepSeek-V3-0324
Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B	Text	4.2B	AQUA-RAT; LogiQA; AR-LSAT	Qwen3-30B-A3B
Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct	Text	83.1B	Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; GSM8K; PRM800K	Qwen2.5-32B-Instruct; Qwen2.5-Math-72B; Qwen2.5-Math-7B; Qwen2.5-72B-Instruct
Synthetic MMLU Auxiliary Train from DeepSeek-R1	Text	0.5B	MMLU Auxiliary Train	DeepSeek-R1
Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct	Text	5.4B	arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD	Qwen2.5-72B-Instruct
Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct	Text	1.949T	Common Crawl	Qwen3-30B-A3B; Mistral-NeMo-12B-Instruct
Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B	Text	997.3B	Common Crawl	Qwen3-30B-A3B
Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B	Text	55.1B	Wikimedia	Qwen3-30B-A3B
Synthetic OpenMathReasoning from DeepSeek-R1-0528	Text	1.5M	OpenMathReasoning	DeepSeek-R1-0528
Synthetic OpenCodeReasoning from DeepSeek-R1-0528	Text	1.1M	OpenCodeReasoning	DeepSeek-R1-0528
Synthetic Science Data from DeepSeek-R1-0528	Text	1.5M	\-	DeepSeek-R1-0528
Synthetic Humanity's Last Exam from DeepSeek-R1-0528	Text	460K	Humanity's Last Exam	DeepSeek-R1-0528
Synthetic ToolBench from Qwen3-235B-A22B	Text	400K	ToolBench	Qwen3-235B-A22B
Synthetic Nemotron Content Safety Dataset V2, eval-safety, Gretel Synthetic Safety Alignment, and RedTeam\2K from DeepSeek-R1-0528	Text	52K	Nemotron Content Safety Dataset V2; eval-safety; Gretel Synthetic Safety Alignment; RedTeam\2K	DeepSeek-R1-0528
Synthetic HelpSteer from Qwen3-235B-A22B	Text	120K	HelpSteer3; HelpSteer2	Qwen3-235B-A22B
Synthetic Alignment data from Mixtral-8x22B-Instruct-v0.1, Mixtral-8x7B-Instruct-v0.1, and Nemotron-4 Family	Text	400K	HelpSteer2; C4; LMSYS-Chat-1M; ShareGPT52K; tigerbot-kaggle-leetcodesolutions-en-2k; GSM8K; PRM800K; lm\identity (NVIDIA internal); FinQA; WikiTableQuestions; Riddles; ChatQA nvolve-multiturn (NVIDIA internal); glaive-function-calling-v2; SciBench; OpenBookQA; Advanced Reasoning Benchmark; Public Software Heritage S3; Khan Academy Math Keywords	Nemotron-4-15B-Base (NVIDIA internal); Nemotron-4-15B-Instruct (NVIDIA internal); Nemotron-4-340B-Base; Nemotron-4-340B-Instruct; Nemotron-4-340B-Reward; Mixtral-8x7B-Instruct-v0.1; Mixtral-8x22B-Instruct-v0.1
Synthetic LMSYS-Chat-1M from Qwen3-235B-A22B	Text	1M	LMSYS-Chat-1M	Qwen3-235B-A22B
Synthetic Multilingual Reasoning data from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, and Qwen2.5-14B-Instruct	Text	25M	OpenMathReasoning; OpenCodeReasoning	DeepSeek-R1-0528; Qwen2.5-32B-Instruct-AWQ (translation); Qwen2.5-14B-Instruct (translation);
Synthetic Multilingual Reasoning data from Qwen3-235B-A22B and Gemma 3 Post-Trained models	Text	5M	WildChat	Qwen3-235B-A22B; Gemma 3 PT 12B; Gemma 3 PT 27B

Evaluation Dataset:

Data Collection Method by dataset: Hybrid: Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic

Inference

## Engines: HF, vLLM, TRT-LLM
## Test Hardware NVIDIA A10G 24GB, H100 80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

@misc{nvidia2025nvidianemotronnano2,
      title={NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model},
      author={NVIDIA},
      year={2025},
      eprint={2508.14444},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14444},
}

🚀 If you find these models useful

Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks:

👉 Quantum Network Monitor

The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder

💬 How to test:
Choose an AI assistant type:
- TurboLLM (GPT-4.1-mini)
- HugLLM (Hugginface Open-source models)
- TestLLM (Experimental CPU-only)

What I’m Testing

I’m pushing the limits of small open-source models for AI network monitoring, specifically:

Function calling against live network services
How small can a model go while still handling:

- Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks

🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):

✅ Zero-configuration setup
⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low.
🔧 Help wanted! If you’re into edge-device AI, let’s collaborate!

Other Assistants

🟢 TurboLLM – Uses gpt-4.1-mini :

It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited.
Create custom cmd processors to run .net code on Quantum Network Monitor Agents
Real-time network diagnostics and monitoring
Security Audits
Penetration testing (Nmap/Metasploit)

🔵 HugLLM – Latest Open-source models:

🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.

💡 Example commands you could test:

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
'"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution!

Final Word

I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful.

If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

I'm also open to job opportunities or sponsorship.

Thank you! 😊

📂 GGUF File List

📁 Filename	📦 Size	⚡ Download
NVIDIA-Nemotron-Nano-12B-v2-bf16.gguf LFS FP16	22.94 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-bf16_q8_0.gguf LFS Q8	16.1 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-f16_q8_0.gguf LFS Q8	16.1 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-imatrix.gguf LFS	4.86 MB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq1_m.gguf LFS	3.57 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq1_s.gguf LFS	3.17 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq2_m.gguf LFS Q2	4.39 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq2_s.gguf LFS Q2	4.16 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq2_xs.gguf LFS Q2	4.07 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq2_xxs.gguf LFS Q2	3.77 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq3_m.gguf LFS Q3	5.73 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq3_xs.gguf LFS Q3	5.39 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq3_xxs.gguf LFS Q3	5.15 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq4_nl.gguf LFS Q4	6.46 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-iq4_xs.gguf LFS Q4	6.47 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q2_k_m.gguf LFS Q2	4.52 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q2_k_s.gguf LFS Q2	4.36 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q3_k_m.gguf LFS Q3	5.81 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q3_k_s.gguf LFS Q3	5.65 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q4_0.gguf Recommended LFS Q4	7.09 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q4_1.gguf LFS Q4	7.26 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q4_k_m.gguf LFS Q4	7.4 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q4_k_s.gguf LFS Q4	6.92 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q5_0.gguf LFS Q5	8.37 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q5_1.gguf LFS Q5	9 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q5_k_m.gguf LFS Q5	8.57 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q6_k_m.gguf LFS Q6	9.72 GB	Download
NVIDIA-Nemotron-Nano-12B-v2-q8_0.gguf LFS Q8	12.19 GB	Download

📊 Model Information

🆔 Model ID: Mungert/NVIDIA-Nemotron-Nano-12B-v2-GGUF

📅 Created: 1 weeks ago

🔄 Last Updated: 6 days ago

📥 Downloads: 21.8K

❤️ Likes: 1

🎯 Difficulty: Advanced

⚙️ Quantization: FP16, Q8, Q2, Q3, Q4, Q5, Q6

🏷️ Tags

transformersggufnvidiapytorchtext-generationenesfrdeitjadataset:nvidia/Nemotron-Post-Training-Dataset-v1dataset:nvidia/Nemotron-Post-Training-Dataset-v2dataset:nvidia/Nemotron-Pretraining-Dataset-sampledataset:nvidia/Nemotron-CC-v2dataset:nvidia/Nemotron-CC-Math-v1dataset:nvidia/Nemotron-Pretraining-SFT-v1arxiv:2504.03624arxiv:2508.14444arxiv:2412.02595base_model:nvidia/NVIDIA-Nemotron-Nano-12B-v2-Basebase_model:quantized:nvidia/NVIDIA-Nemotron-Nano-12B-v2-Baselicense:otherendpoints_compatibleregion:usconversational

🔗 Related Links

🤗 Visit HuggingFace ⚡ Quick Download