π Model Description
license: gemma language:
- en
- zh
- Taiwan
- SLM
- GGUF
- agent
- lianghsun/tw-reasoning-instruct
- lianghsun/tw-contract-review-chat
- minyichen/tw-instruct-R1-200k
- minyichen/twmmR1
- minyichen/LongPapermultitaskzhtwR1
- nvidia/Nemotron-Instruction-Following-Chat-v1
- accuracy
- name: gemma-3-4B-T1-it
Gemma 3 4B T1-it GGUF Collection
GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.
About
Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.
Available Models
| Model | Size | Use Case |
|---|---|---|
twinkle-ai-gemma-3-4B-T1-it-BF16.gguf | Largest | Best quality, highest precision |
twinkle-ai-gemma-3-4B-T1-it-F16.gguf | Large | High quality, good precision |
twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf | Medium | Balanced quality and speed |
twinkle-ai-gemma-3-4b-t1-it-q4km.gguf | Smallest | Fastest inference, lower memory |
Quick Start
Option 1: Using Hugging Face Hub (Recommended)
Install llama.cpp via Homebrew:
brew install llama.cpp
Run inference directly from Hugging Face:
llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here"
Start as a server:
llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-c 2048
Option 2: Build from Source
#### Step 1: Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
#### Step 2: Build llama.cpp
Basic build (CPU only):
LLAMA_CURL=1 make
Hardware-specific build options:
- NVIDIA GPU (Linux):
LLAMACUDA=1 LLAMACURL=1 make
- Apple Silicon (Mac):
LLAMAMETAL=1 LLAMACURL=1 make
- AMD GPU (ROCm):
LLAMAHIPBLAS=1 LLAMACURL=1 make
#### Step 3: Run inference
./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here"
#### Step 4: Start server (optional)
./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-c 2048
Advanced Usage
Choosing the Right Model
Select a model based on your needs:
- Best Quality: Use
BF16orF16versions (requires more memory) - Balanced: Use
Q80version (recommended for most users) - Resource Constrained: Use
q4k_mversion (suitable for devices with limited memory)
Common Parameters
-p "prompt": Your input text for the model to respond to-c 2048: Context length (maximum number of tokens that can be processed)--hf-repo: Hugging Face repository name--hf-file: Model file name to use
Adjusting Generation Parameters
llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here" \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1
Parameter explanations:
--temp: Temperature (0.0-2.0), higher values produce more random output--top-p: Nucleus sampling parameter (0.0-1.0)--repeat-penalty: Repetition penalty to avoid repetitive content
Model Information
- Base Model: twinkle-ai/gemma-3-4B-T1-it
- Languages: English, Traditional Chinese
- License: Gemma
- Format: GGUF (converted via GGUF-my-repo)
Training Data
- Taiwan reasoning and instruction datasets
- Contract review and legal documents
- Multimodal and long-form content
- Instruction-following examples
Benchmarks
- TMMLU+: 47.44% accuracy
- MMLU: 59.13% accuracy
- TW Legal Benchmark: 44.18% accuracy
Troubleshooting
Common Issues
Q: Getting out of memory errors?
A: Try using a smaller quantized version like q4km, or reduce the context length parameter -c.
Q: How can I speed up inference?
A:
- Use GPU acceleration (add hardware-specific flags during compilation)
- Choose a smaller quantized model (like
q4km) - Reduce context length
Q: What prompt format does the model support?
A: This is an instruction-tuned model. Use a clear instruction format, for example:
Please analyze the main clauses of the following contract: [contract content]
Links
Contributing
If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.
Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.
π GGUF File List
| π Filename | π¦ Size | β‘ Download |
|---|---|---|
|
gemma-3-4b-t1-it-q8_0.gguf
LFS
Q8
|
4.51 GB | Download |
|
twinkle-ai-gemma-3-4B-T1-it-BF16.gguf
LFS
FP16
|
8.48 GB | Download |
|
twinkle-ai-gemma-3-4B-T1-it-F16.gguf
LFS
FP16
|
8.48 GB | Download |
|
twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf
LFS
Q8
|
4.51 GB | Download |
|
twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf
Recommended
LFS
Q4
|
2.67 GB | Download |