πŸ“‹ Model Description


license: llama3.3 base_model: meta-llama/Llama-3.3-70B-Instruct basemodelrelation: quantized tags:
  • Llama 3.3 70B
  • GGUF
  • quantized
  • 4-bit
  • 3-bit

Llama.cpp hybrid layer quantization of Llama 3.3 70B Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable
both high performance and small file size at the same time. The quants
employed are all K to avoid slow CPU or older GPU processing of IQ quants. Three quants
are available for the model as follows:

Q3SH : Smallest Q3_K based quant available

LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KS"],[4 ,"Q3KS"],[5 ,"Q3KS"],[6 ,"Q3KS"],[7 ,"Q3KS"],
[8 ,"Q3KS"],[9 ,"Q3KS"],[10,"Q3KS"],[11,"Q3KS"],[12,"Q3KS"],[13,"Q3KS"],[14,"Q3KS"],[15,"Q3KS"],
[16,"Q3KS"],[17,"Q3KS"],[18,"Q3KS"],[19,"Q3KS"],[20,"Q3KS"],[21,"Q3KS"],[22,"Q3KS"],[23,"Q3KS"],
[24,"Q3KS"],[25,"Q3KS"],[26,"Q3KS"],[27,"Q3KS"],[28,"Q3KS"],[29,"Q3KS"],[30,"Q3KS"],[31,"Q3KS"],
[32,"Q3KS"],[33,"Q3KS"],[34,"Q3KS"],[35,"Q3KS"],[36,"Q3KS"],[37,"Q3KS"],[38,"Q3KS"],[39,"Q3KS"],
[40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
[48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
[56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
[64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
[72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KM"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"

Q3KH : Slightly larger Q3_K based quant

LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KM"],[4 ,"Q3KS"],[5 ,"Q3KM"],[6 ,"Q3KS"],[7 ,"Q3KM"],
[8 ,"Q3KS"],[9 ,"Q3KM"],[10,"Q3KS"],[11,"Q3KM"],[12,"Q3KS"],[13,"Q3KM"],[14,"Q3KS"],[15,"Q3KM"],
[16,"Q3KM"],[17,"Q3KS"],[18,"Q3KM"],[19,"Q3KS"],[20,"Q3KM"],[21,"Q3KS"],[22,"Q3KM"],[23,"Q3KS"],
[24,"Q3KM"],[25,"Q3KS"],[26,"Q3KM"],[27,"Q3KS"],[28,"Q3KM"],[29,"Q3KS"],[30,"Q3KM"],[31,"Q3KS"],
[32,"Q3KM"],[33,"Q3KS"],[34,"Q3KM"],[35,"Q3KS"],[36,"Q3KM"],[37,"Q3KS"],[38,"Q3KM"],[39,"Q3KS"],
[40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
[48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
[56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
[64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
[72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KL"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"

Q4KH : Largest and best performance quant

LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q4KM"],[2 ,"Q4KS"],[3 ,"Q4KS"],[4 ,"Q3KM"],[5 ,"Q3KL"],[6 ,"Q3KM"],[7 ,"Q3KL"],
[8 ,"Q3KM"],[9 ,"Q3KL"],[10,"Q3KM"],[11,"Q3KL"],[12,"Q3KM"],[13,"Q3KL"],[14,"Q3KM"],[15,"Q3KL"],
[16,"Q3KL"],[17,"Q3KM"],[18,"Q3KL"],[19,"Q3KM"],[20,"Q3KL"],[21,"Q3KM"],[22,"Q3KL"],[23,"Q3KM"],
[24,"Q3KL"],[25,"Q3KM"],[26,"Q3KL"],[27,"Q3KM"],[28,"Q3KL"],[29,"Q3KM"],[30,"Q3KL"],[31,"Q3KM"],
[32,"Q3KL"],[33,"Q3KM"],[34,"Q3KL"],[35,"Q3KM"],[36,"Q3KL"],[37,"Q3KM"],[38,"Q3KL"],[39,"Q3KM"],
[40,"Q3KL"],[41,"Q3KM"],[42,"Q3KL"],[43,"Q3KM"],[44,"Q3KL"],[45,"Q3KM"],[46,"Q3KL"],[47,"Q3KM"],
[48,"Q3KL"],[49,"Q3KM"],[50,"Q3KL"],[51,"Q3KM"],[52,"Q3KL"],[53,"Q3KM"],[54,"Q3KL"],[55,"Q3KM"],
[56,"Q3KL"],[57,"Q3KM"],[58,"Q3KL"],[59,"Q3KM"],[60,"Q3KL"],[61,"Q3KM"],[62,"Q3KL"],[63,"Q3KM"],
[64,"Q4KS"],[65,"Q3KL"],[66,"Q4KS"],[67,"Q3KL"],[68,"Q4KS"],[69,"Q3KL"],[70,"Q4KS"],[71,"Q3KL"],
[72,"Q4KS"],[73,"Q4KS"],[74,"Q4KM"],[75,"Q4KS"],[76,"Q4KM"],[77,"Q5KS"],[78,"Q5KM"],[79,"Q6_K" ]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q6K"

All three quants were optimized to maintain knowledge preservation and reasoning performance using a small set
of curated test/evaluation prompts. All three quants score 100% on the eval prompts but the Q3 quants sometimes get a little
goofy, giving wrong answer then correcting itself with the right one, or adding some non sequiter with the answer etc.
Q4KH is rock solid. Note that use of Q2K or Q2K_S was not possible with this model since any Q2 use even at deep layers
threw the model immediately into either noncoherence or large knowledge loss.

Comparison:

Quant | size | PPL | Comment
---------|---------|------|-----------
Q3SH | 32.6e9 | 4.8 | Q3K dominant with Q4K embedding
Q3KH | 33.4e9 | 4.8 | " "
Q3KM | 34.3e9 | 4.9 | Fails parts of eval prompt set
Q4KH | 37.5e9 | 4.5 | Best available quant
IQ4XS | 38.3e9 | 4.4 | Q4K embedding Q6_K output

Usage:

This model may be used together with fixie-ai ultravox-v05-llama-33-70b or ultravox-v06-llama-33-70b to enable it to process audio
(.mp3 and .wav files) and text inputs and generate text outputs. The mmproj file are made available here:
https://huggingface.co/steampunque/ultravox-v05-llama-33-70b-Hybrid-GGUF , https://huggingface.co/steampunque/ultravox-v06-llama-33-70b-Hybrid-GGUF
More information about running multimedia may be found in the docs in the mtmd readme in the tools directory of the llama.cpp source tree
https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md.

Benchmarks:

A partial set of benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

LinkTypeSize/e9 BNotes
Llama-3.3-70B-Instruct.Q3SH.ggufQ3SH32.6e9 B1.7B smaller than Q3K_M
Llama-3.3-70B-Instruct.Q3KH.ggufQ3KH33.4e9 B0.9B smaller than Q3K_M
Llama-3.3-70B-Instruct.Q4KH.ggufQ4KH37.5e9 B0.8B smaller than IQ4XS
ultravox-v05-llama-3_3-70b.mmproj.ggufmmproj1.38e9 Bmultimedia projector
ultravox-v06-llama-3_3-70b.mmproj.ggufmmproj1.38e9 Bmultimedia projector
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
Llama-3.3-70B-Instruct.Q3_K_H.gguf
Recommended LFS Q3
31.08 GB Download
Llama-3.3-70B-Instruct.Q3_S_H.gguf
LFS Q3
30.33 GB Download
Llama-3.3-70B-Instruct.Q4_K_H.gguf
LFS Q4
34.93 GB Download