π Model Description
license: llama3.3 base_model: meta-llama/Llama-3.3-70B-Instruct basemodelrelation: quantized tags:
- Llama 3.3 70B
- GGUF
- quantized
- 4-bit
- 3-bit
Llama.cpp hybrid layer quantization of Llama 3.3 70B Instruct by meta-llama
Original model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
The hybrid quant employs different quantization levels on a per layer basis to enable
both high performance and small file size at the same time. The quants
employed are all K to avoid slow CPU or older GPU processing of IQ quants. Three quants
are available for the model as follows:
Q3SH : Smallest Q3_K based quant available
LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KS"],[4 ,"Q3KS"],[5 ,"Q3KS"],[6 ,"Q3KS"],[7 ,"Q3KS"],
[8 ,"Q3KS"],[9 ,"Q3KS"],[10,"Q3KS"],[11,"Q3KS"],[12,"Q3KS"],[13,"Q3KS"],[14,"Q3KS"],[15,"Q3KS"],
[16,"Q3KS"],[17,"Q3KS"],[18,"Q3KS"],[19,"Q3KS"],[20,"Q3KS"],[21,"Q3KS"],[22,"Q3KS"],[23,"Q3KS"],
[24,"Q3KS"],[25,"Q3KS"],[26,"Q3KS"],[27,"Q3KS"],[28,"Q3KS"],[29,"Q3KS"],[30,"Q3KS"],[31,"Q3KS"],
[32,"Q3KS"],[33,"Q3KS"],[34,"Q3KS"],[35,"Q3KS"],[36,"Q3KS"],[37,"Q3KS"],[38,"Q3KS"],[39,"Q3KS"],
[40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
[48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
[56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
[64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
[72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KM"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"
Q3KH : Slightly larger Q3_K based quant
LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KM"],[4 ,"Q3KS"],[5 ,"Q3KM"],[6 ,"Q3KS"],[7 ,"Q3KM"],
[8 ,"Q3KS"],[9 ,"Q3KM"],[10,"Q3KS"],[11,"Q3KM"],[12,"Q3KS"],[13,"Q3KM"],[14,"Q3KS"],[15,"Q3KM"],
[16,"Q3KM"],[17,"Q3KS"],[18,"Q3KM"],[19,"Q3KS"],[20,"Q3KM"],[21,"Q3KS"],[22,"Q3KM"],[23,"Q3KS"],
[24,"Q3KM"],[25,"Q3KS"],[26,"Q3KM"],[27,"Q3KS"],[28,"Q3KM"],[29,"Q3KS"],[30,"Q3KM"],[31,"Q3KS"],
[32,"Q3KM"],[33,"Q3KS"],[34,"Q3KM"],[35,"Q3KS"],[36,"Q3KM"],[37,"Q3KS"],[38,"Q3KM"],[39,"Q3KS"],
[40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
[48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
[56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
[64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
[72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KL"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"
Q4KH : Largest and best performance quant
LAYER_TYPES='[
[0 ,"Q4KM"],[1 ,"Q4KM"],[2 ,"Q4KS"],[3 ,"Q4KS"],[4 ,"Q3KM"],[5 ,"Q3KL"],[6 ,"Q3KM"],[7 ,"Q3KL"],
[8 ,"Q3KM"],[9 ,"Q3KL"],[10,"Q3KM"],[11,"Q3KL"],[12,"Q3KM"],[13,"Q3KL"],[14,"Q3KM"],[15,"Q3KL"],
[16,"Q3KL"],[17,"Q3KM"],[18,"Q3KL"],[19,"Q3KM"],[20,"Q3KL"],[21,"Q3KM"],[22,"Q3KL"],[23,"Q3KM"],
[24,"Q3KL"],[25,"Q3KM"],[26,"Q3KL"],[27,"Q3KM"],[28,"Q3KL"],[29,"Q3KM"],[30,"Q3KL"],[31,"Q3KM"],
[32,"Q3KL"],[33,"Q3KM"],[34,"Q3KL"],[35,"Q3KM"],[36,"Q3KL"],[37,"Q3KM"],[38,"Q3KL"],[39,"Q3KM"],
[40,"Q3KL"],[41,"Q3KM"],[42,"Q3KL"],[43,"Q3KM"],[44,"Q3KL"],[45,"Q3KM"],[46,"Q3KL"],[47,"Q3KM"],
[48,"Q3KL"],[49,"Q3KM"],[50,"Q3KL"],[51,"Q3KM"],[52,"Q3KL"],[53,"Q3KM"],[54,"Q3KL"],[55,"Q3KM"],
[56,"Q3KL"],[57,"Q3KM"],[58,"Q3KL"],[59,"Q3KM"],[60,"Q3KL"],[61,"Q3KM"],[62,"Q3KL"],[63,"Q3KM"],
[64,"Q4KS"],[65,"Q3KL"],[66,"Q4KS"],[67,"Q3KL"],[68,"Q4KS"],[69,"Q3KL"],[70,"Q4KS"],[71,"Q3KL"],
[72,"Q4KS"],[73,"Q4KS"],[74,"Q4KM"],[75,"Q4KS"],[76,"Q4KM"],[77,"Q5KS"],[78,"Q5KM"],[79,"Q6_K" ]
]'
FLAGS="--token-embedding-type Q4K --output-tensor-type Q6K"
All three quants were optimized to maintain knowledge preservation and reasoning performance using a small set
of curated test/evaluation prompts. All three quants score 100% on the eval prompts but the Q3 quants sometimes get a little
goofy, giving wrong answer then correcting itself with the right one, or adding some non sequiter with the answer etc.
Q4KH is rock solid. Note that use of Q2K or Q2K_S was not possible with this model since any Q2 use even at deep layers
threw the model immediately into either noncoherence or large knowledge loss.
Comparison:
Quant | size | PPL | Comment
---------|---------|------|-----------
Q3SH | 32.6e9 | 4.8 | Q3K dominant with Q4K embedding
Q3KH | 33.4e9 | 4.8 | " "
Q3KM | 34.3e9 | 4.9 | Fails parts of eval prompt set
Q4KH | 37.5e9 | 4.5 | Best available quant
IQ4XS | 38.3e9 | 4.4 | Q4K embedding Q6_K output
Usage:
This model may be used together with fixie-ai ultravox-v05-llama-33-70b or ultravox-v06-llama-33-70b to enable it to process audio
(.mp3 and .wav files) and text inputs and generate text outputs. The mmproj file are made available here:
https://huggingface.co/steampunque/ultravox-v05-llama-33-70b-Hybrid-GGUF , https://huggingface.co/steampunque/ultravox-v06-llama-33-70b-Hybrid-GGUF
More information about running multimedia may be found in the docs in the mtmd readme in the tools directory of the llama.cpp source tree
https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md.
Benchmarks:
A partial set of benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
Llama-3.3-70B-Instruct.Q3SH.gguf | Q3SH | 32.6e9 B | 1.7B smaller than Q3K_M |
Llama-3.3-70B-Instruct.Q3KH.gguf | Q3KH | 33.4e9 B | 0.9B smaller than Q3K_M |
Llama-3.3-70B-Instruct.Q4KH.gguf | Q4KH | 37.5e9 B | 0.8B smaller than IQ4XS |
ultravox-v05-llama-3_3-70b.mmproj.gguf | mmproj | 1.38e9 B | multimedia projector |
ultravox-v06-llama-3_3-70b.mmproj.gguf | mmproj | 1.38e9 B | multimedia projector |
https://github.com/ggml-org/llama.cpp/discussions/13040