steampunque/Llama-3.3-70B-Instruct-Hybrid-GGUF

Name: steampunque/Llama-3.3-70B-Instruct-Hybrid-GGUF
Author: steampunque

High-quality GGUF model

19.2K 📥 Downloads

0 ❤️ Likes

3 📁 GGUF Files

96.34 GB 💾 Total Size

3 months ago 🔄 Last Updated

📋 Model Description

license: llama3.3 base_model: meta-llama/Llama-3.3-70B-Instruct basemodelrelation: quantized tags:

Llama 3.3 70B
GGUF
quantized
4-bit
3-bit

Llama.cpp hybrid layer quantization of Llama 3.3 70B Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable
both high performance and small file size at the same time. The quants
employed are all K to avoid slow CPU or older GPU processing of IQ quants. Three quants
are available for the model as follows:

Q3SH : Smallest Q3_K based quant available

LAYER_TYPES='[
   [0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KS"],[4 ,"Q3KS"],[5 ,"Q3KS"],[6 ,"Q3KS"],[7 ,"Q3KS"],
   [8 ,"Q3KS"],[9 ,"Q3KS"],[10,"Q3KS"],[11,"Q3KS"],[12,"Q3KS"],[13,"Q3KS"],[14,"Q3KS"],[15,"Q3KS"],
   [16,"Q3KS"],[17,"Q3KS"],[18,"Q3KS"],[19,"Q3KS"],[20,"Q3KS"],[21,"Q3KS"],[22,"Q3KS"],[23,"Q3KS"],
   [24,"Q3KS"],[25,"Q3KS"],[26,"Q3KS"],[27,"Q3KS"],[28,"Q3KS"],[29,"Q3KS"],[30,"Q3KS"],[31,"Q3KS"],
   [32,"Q3KS"],[33,"Q3KS"],[34,"Q3KS"],[35,"Q3KS"],[36,"Q3KS"],[37,"Q3KS"],[38,"Q3KS"],[39,"Q3KS"],
   [40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
   [48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
   [56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
   [64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
   [72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KM"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
   ]'
  FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"

Q3KH : Slightly larger Q3_K based quant

LAYER_TYPES='[
   [0 ,"Q4KM"],[1 ,"Q3KL"],[2 ,"Q3KM"],[3 ,"Q3KM"],[4 ,"Q3KS"],[5 ,"Q3KM"],[6 ,"Q3KS"],[7 ,"Q3KM"],
   [8 ,"Q3KS"],[9 ,"Q3KM"],[10,"Q3KS"],[11,"Q3KM"],[12,"Q3KS"],[13,"Q3KM"],[14,"Q3KS"],[15,"Q3KM"],
   [16,"Q3KM"],[17,"Q3KS"],[18,"Q3KM"],[19,"Q3KS"],[20,"Q3KM"],[21,"Q3KS"],[22,"Q3KM"],[23,"Q3KS"],
   [24,"Q3KM"],[25,"Q3KS"],[26,"Q3KM"],[27,"Q3KS"],[28,"Q3KM"],[29,"Q3KS"],[30,"Q3KM"],[31,"Q3KS"],
   [32,"Q3KM"],[33,"Q3KS"],[34,"Q3KM"],[35,"Q3KS"],[36,"Q3KM"],[37,"Q3KS"],[38,"Q3KM"],[39,"Q3KS"],
   [40,"Q3KM"],[41,"Q3KS"],[42,"Q3KM"],[43,"Q3KS"],[44,"Q3KM"],[45,"Q3KS"],[46,"Q3KM"],[47,"Q3KS"],
   [48,"Q3KM"],[49,"Q3KS"],[50,"Q3KM"],[51,"Q3KS"],[52,"Q3KM"],[53,"Q3KS"],[54,"Q3KM"],[55,"Q3KS"],
   [56,"Q3KM"],[57,"Q3KS"],[58,"Q3KM"],[59,"Q3KS"],[60,"Q3KM"],[61,"Q3KS"],[62,"Q3KM"],[63,"Q3KS"],
   [64,"Q3KM"],[65,"Q3KM"],[66,"Q3KM"],[67,"Q3KM"],[68,"Q3KM"],[69,"Q3KM"],[70,"Q3KM"],[71,"Q3KM"],
   [72,"Q3KM"],[73,"Q3KM"],[74,"Q3KM"],[75,"Q3KM"],[76,"Q3KL"],[77,"Q3KL"],[78,"Q4KS"],[79,"Q4KM"]
   ]'
  FLAGS="--token-embedding-type Q4K --output-tensor-type Q5K --layer-types-high"

Q4KH : Largest and best performance quant

LAYER_TYPES='[
   [0 ,"Q4KM"],[1 ,"Q4KM"],[2 ,"Q4KS"],[3 ,"Q4KS"],[4 ,"Q3KM"],[5 ,"Q3KL"],[6 ,"Q3KM"],[7 ,"Q3KL"],
   [8 ,"Q3KM"],[9 ,"Q3KL"],[10,"Q3KM"],[11,"Q3KL"],[12,"Q3KM"],[13,"Q3KL"],[14,"Q3KM"],[15,"Q3KL"],
   [16,"Q3KL"],[17,"Q3KM"],[18,"Q3KL"],[19,"Q3KM"],[20,"Q3KL"],[21,"Q3KM"],[22,"Q3KL"],[23,"Q3KM"],
   [24,"Q3KL"],[25,"Q3KM"],[26,"Q3KL"],[27,"Q3KM"],[28,"Q3KL"],[29,"Q3KM"],[30,"Q3KL"],[31,"Q3KM"],
   [32,"Q3KL"],[33,"Q3KM"],[34,"Q3KL"],[35,"Q3KM"],[36,"Q3KL"],[37,"Q3KM"],[38,"Q3KL"],[39,"Q3KM"],
   [40,"Q3KL"],[41,"Q3KM"],[42,"Q3KL"],[43,"Q3KM"],[44,"Q3KL"],[45,"Q3KM"],[46,"Q3KL"],[47,"Q3KM"],
   [48,"Q3KL"],[49,"Q3KM"],[50,"Q3KL"],[51,"Q3KM"],[52,"Q3KL"],[53,"Q3KM"],[54,"Q3KL"],[55,"Q3KM"],
   [56,"Q3KL"],[57,"Q3KM"],[58,"Q3KL"],[59,"Q3KM"],[60,"Q3KL"],[61,"Q3KM"],[62,"Q3KL"],[63,"Q3KM"],
   [64,"Q4KS"],[65,"Q3KL"],[66,"Q4KS"],[67,"Q3KL"],[68,"Q4KS"],[69,"Q3KL"],[70,"Q4KS"],[71,"Q3KL"],
   [72,"Q4KS"],[73,"Q4KS"],[74,"Q4KM"],[75,"Q4KS"],[76,"Q4KM"],[77,"Q5KS"],[78,"Q5KM"],[79,"Q6_K"  ]
   ]'
   FLAGS="--token-embedding-type Q4K --output-tensor-type Q6K"

All three quants were optimized to maintain knowledge preservation and reasoning performance using a small set
of curated test/evaluation prompts. All three quants score 100% on the eval prompts but the Q3 quants sometimes get a little
goofy, giving wrong answer then correcting itself with the right one, or adding some non sequiter with the answer etc.
Q4KH is rock solid. Note that use of Q2K or Q2K_S was not possible with this model since any Q2 use even at deep layers
threw the model immediately into either noncoherence or large knowledge loss.

Comparison:

Quant | size | PPL | Comment
---------|---------|------|-----------
Q3SH | 32.6e9 | 4.8 | Q3K dominant with Q4K embedding
Q3KH | 33.4e9 | 4.8 | " "
Q3KM | 34.3e9 | 4.9 | Fails parts of eval prompt set
Q4KH | 37.5e9 | 4.5 | Best available quant
IQ4XS | 38.3e9 | 4.4 | Q4K embedding Q6_K output

Usage:

This model may be used together with fixie-ai ultravox-v05-llama-33-70b or ultravox-v06-llama-33-70b to enable it to process audio
(.mp3 and .wav files) and text inputs and generate text outputs. The mmproj file are made available here:
https://huggingface.co/steampunque/ultravox-v05-llama-33-70b-Hybrid-GGUF , https://huggingface.co/steampunque/ultravox-v06-llama-33-70b-Hybrid-GGUF
More information about running multimedia may be found in the docs in the mtmd readme in the tools directory of the llama.cpp source tree
https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md.

Benchmarks:

A partial set of benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link	Type	Size/e9 B	Notes
Llama-3.3-70B-Instruct.Q3SH.gguf	Q3SH	32.6e9 B	1.7B smaller than Q3K_M
Llama-3.3-70B-Instruct.Q3KH.gguf	Q3KH	33.4e9 B	0.9B smaller than Q3K_M
Llama-3.3-70B-Instruct.Q4KH.gguf	Q4KH	37.5e9 B	0.8B smaller than IQ4XS
ultravox-v05-llama-3_3-70b.mmproj.gguf	mmproj	1.38e9 B	multimedia projector
ultravox-v06-llama-3_3-70b.mmproj.gguf	mmproj	1.38e9 B	multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

📂 GGUF File List

📁 Filename	📦 Size	⚡ Download
Llama-3.3-70B-Instruct.Q3_K_H.gguf Recommended LFS Q3	31.08 GB	Download
Llama-3.3-70B-Instruct.Q3_S_H.gguf LFS Q3	30.33 GB	Download
Llama-3.3-70B-Instruct.Q4_K_H.gguf LFS Q4	34.93 GB	Download

📊 Model Information

🆔 Model ID: steampunque/Llama-3.3-70B-Instruct-Hybrid-GGUF

📅 Created: 3 months ago

🔄 Last Updated: 3 months ago

📥 Downloads: 19.2K

❤️ Likes: 0

🎯 Difficulty: Advanced

⚙️ Quantization: Q3, Q4

🏷️ Tags

ggufLlama 3.3 70BGGUFquantized4-bit3-bitbase_model:meta-llama/Llama-3.3-70B-Instructbase_model:quantized:meta-llama/Llama-3.3-70B-Instructlicense:llama3.3endpoints_compatibleregion:usconversational

🔗 Related Links

🤗 Visit HuggingFace ⚡ Quick Download