norallm/normistral-7b-warm-instruct

Name: norallm/normistral-7b-warm-instruct
Author: norallm

High-quality GGUF model

1.8K 📥 Downloads

28 ❤️ Likes

5 📁 GGUF Files

24.85 GB 💾 Total Size

1 day ago 🔄 Last Updated

📋 Model Description

language:

'no'
nb
nn

inference: true tags:

mistral
norwegian
instruction
chat

license: apache-2.0 pipeline_tag: text-generation datasets:

CohereForAI/ayadataset
OpenAssistant/oasst1
OpenAssistant/oasst2
laion/OIG
HuggingFaceH4/norobots
databricks/databricks-dolly-15k
glaiveai/glaive-code-assistant-v2

Instruction-tuned NorMistral-7b-warm

This is a model instruction-tuned on open datasets released under the most permissive apache-2.0 licence (in other words, we don't use any datasets generated by ChatGPT) — thus we can release this model under the same license and make it openly available for commercial applications. The model has been finetuned on 4096 context length, twice as many tokens as the base model.

The released weights are still a work in progress and they might change in the future. This is the first iteration of instruction-tuning our NorMistral models and it currently uses only the SFT phase without any preference optimization. Please let us know your feedback to improve the model in future releases.

Finetuning corpus

The corpus was compiled by this process:

We gathered all openly available datasets: Aya, OASST 1, OASST 2, OIG-small-chip2, No Robots, Dolly and Glaive code assistant.
These were first manually inspected and filtered, and then automatically filtered with Mixtral-8x7B to remove incorrect, offensive, non-English and American-centric responses.
The responses were augmented to be more descriptive by Mixtral-8x7B.
Since most of that dataset contains only a single dialogue turn, we generated more turns using Mixtral-8x7B.
Finally, we translated the resulting dataset into Bokmål and Nynorsk using NorMistral-7b-warm.

How to run the model?

1. Prompt format

NorMistral uses ChatML-like format for structuring the (multi-turn) conversations. An example of a prompt in this format looks like the following (notice the special <|imstart|> and <|imend|> tokens).

<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant

This prompt-format is available as a chat template in the NorMistral's tokens, so it can be easily applied by calling the
tokenizer.applychattemplate() method:

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct")

messages = [ {"role": "user", "content": "Hva er hovedstaden i Norge?"}, {"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."}, {"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"} ] geninput = tokenizer.applychattemplate(messages, addgenerationprompt=True, returntensors="pt")

When tokenizing messages for generation, set addgenerationprompt=True when calling applychattemplate(). This will append <|im_start|>assistant\n to your prompt, to ensure
that the model continues with an assistant response.

2. Generation parameters

The model is quite sensitive to generation parameters, it's important to set them correctly. We give an example of a reasonable generation setting below. Note that other libraries have different defaults and that it's important to check them.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.frompretrained("norallm/normistral-7b-warm-instruct", torchdtype=torch.bfloat16)

model.generate(
    gen_input,
    maxnewtokens=1024,
    top_k=64,  # top-k sampling
    top_p=0.9,  # nucleus sampling
    temperature=0.3,  # a low temparature to make the outputs less chaotic
    repetition_penalty=1.0,  # turn the repetition penalty off, having it on can lead to very bad outputs
    do_sample=True,  # randomly sample the outputs
    use_cache=True  # speed-up generation
)

About the base model

NorMistral-7b-warm is a large Norwegian language model initialized from Mistral-7b-v0.1 and
continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).

This model is a part of the NORA.LLM family developed in collaboration between the Language Technology Group at the University of Oslo, the High Performance Language Technologies (HPLT) project, the National Library of Norway, and the University of Turku.
All the models are pre-trained on the same dataset and with the same tokenizer.
NorMistral-7b-warm has over 7 billion parameters and is based on the Mistral architecture.

The NORA.LLM language model family includes (as of now):

NorMistral-7b-warm -- an LLM initialized from Mistral-7b-v0.1 and continuously pretrained on Norwegian data;
NorMistral-7b-scratch -- a Mistral-based LLM pretrained from scratch on Norwegian data;
NorBLOOM-7b-scratch -- a BLOOM-based LLM pretrained from scratch on Norwegian data.

Quantization

Provided files

Name	Quant method	Bits Per Weight	Size	Max RAM/VRAM required	Use case
normistral-7b-warm-instruct.Q3KM.gguf	Q3K_M	3.89	3.28 GB	5.37 GB	very small, high loss of quality
normistral-7b-warm-instruct.Q4KM.gguf	Q4K_M	4.83	4.07 GB	6.16 GB	medium, balanced quality
normistral-7b-warm-instruct.Q5KM.gguf	Q5K_M	5.67	4.78 GB	6.87 GB	large, very low quality loss
normistral-7b-warm-instruct.Q6K.gguf	Q6K	6.56	5.54 GB	7.63 GB	very large, extremely low quality loss
normistral-7b-warm-instruct.Q80.gguf	Q80	8.50	7.17 GB	9.26 GB	very large, extremely low quality loss

How to run from Python code

You can use GGUF models from Python using the llama-cpp-python for example.

#### How to load this model in Python code, using llama-cpp-python

For full documentation, please see: llama-cpp-python docs.

#### First install the package

Run one of the following commands, according to your system:

# Base llama-ccp-python with no GPU acceleration
pip install llama-cpp-python
With NVidia CUDA acceleration
CMAKEARGS="-DLLAMACUBLAS=on" pip install llama-cpp-python
Or with OpenBLAS acceleration
CMAKEARGS="-DLLAMABLAS=ON -DLLAMABLASVENDOR=OpenBLAS" pip install llama-cpp-python
Or with CLBLast acceleration
CMAKEARGS="-DLLAMACLBLAST=on" pip install llama-cpp-python
Or with AMD ROCm GPU acceleration (Linux only)
CMAKEARGS="-DLLAMAHIPBLAS=on" pip install llama-cpp-python
Or with Metal GPU acceleration for macOS systems only
CMAKEARGS="-DLLAMAMETAL=on" pip install llama-cpp-python

In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKEARGS = "-DLLAMAOPENBLAS=on"
pip install llama-cpp-python

#### Simple llama-cpp-python example code

from llama_cpp import Llama

Directly from huggingface-hub (requires huggingface-hub to be installed)
Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama.from_pretrained(
  repo_id="norallm/normistral-7b-warm-instruct",  # HuggingFace repository containing the GGUF files.
  filename="*Q4KM.gguf", # suffix of the filename containing the level of quantization. 
  n_ctx=32768,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  ngpulayers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)

Simple inference example
output = llm(
  """<s><|im_start|> user
Hva kan jeg bruke einstape til?<|im_end|>
<|im_start|> assistant
""", # Prompt
  max_tokens=512,  # Generate up to 512 tokens
  stop=["<|im_end|>"],   # Example stop token
  echo=True,       # Whether to echo the prompt
  temperature=0.3  # Temperature to set, for Q3KM, Q4KM, Q5KM, and Q6_0 it is recommended to set it relatively low.
)

Chat Completion API

llm.createchatcompletion(
    messages = [
        {
            "role": "user",
            "content": "Hva kan jeg bruke einstape til?"
        }
    ]
)

Citation

@inproceedings{samuel-etal-2025-small,
    title = "Small Languages, Big Models: {A} Study of Continual Training on Languages of {Norway}",
    author = "Samuel, David  and
      Mikhailov, Vladislav  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      Charpentier, Lucas Georges Gabriel  and
      Kutuzov, Andrey  and
      Oepen, Stephan",
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.61/",
    pages = "573--608",
    ISBN = "978-9908-53-109-0"
}

📂 GGUF File List

📁 Filename	📦 Size	⚡ Download
normistral-7b-warm-instruct.Q3_K_M.gguf LFS Q3	3.28 GB	Download
normistral-7b-warm-instruct.Q4_K_M.gguf Recommended LFS Q4	4.07 GB	Download
normistral-7b-warm-instruct.Q5_K_M.gguf LFS Q5	4.78 GB	Download
normistral-7b-warm-instruct.Q6_K.gguf LFS Q6	5.54 GB	Download
normistral-7b-warm-instruct.Q8_0.gguf LFS Q8	7.17 GB	Download

📊 Model Information

🆔 Model ID: norallm/normistral-7b-warm-instruct

📅 Created: 2 years ago

🔄 Last Updated: 1 day ago

📥 Downloads: 1.8K

❤️ Likes: 28

🎯 Difficulty: Intermediate

⚙️ Quantization: Q3, Q4, Q5, Q6, Q8

🏷️ Tags

transformerspytorchsafetensorsggufmistraltext-generationnorwegianinstructionchatconversationalnonbnndataset:CohereForAI/aya_datasetdataset:OpenAssistant/oasst1dataset:OpenAssistant/oasst2dataset:laion/OIGdataset:HuggingFaceH4/no_robotsdataset:databricks/databricks-dolly-15kdataset:glaiveai/glaive-code-assistant-v2license:apache-2.0text-generation-inferenceendpoints_compatibleregion:us

🔗 Related Links

🤗 Visit HuggingFace ⚡ Quick Download