πŸ“‹ Model Description


license: gemma language:
  • en
  • zh
base_model: twinkle-ai/gemma-3-4B-T1-it library_name: transformers tags:
  • Taiwan
  • SLM
  • GGUF
  • agent
datasets:
  • lianghsun/tw-reasoning-instruct
  • lianghsun/tw-contract-review-chat
  • minyichen/tw-instruct-R1-200k
  • minyichen/twmmR1
  • minyichen/LongPapermultitaskzhtwR1
  • nvidia/Nemotron-Instruction-Following-Chat-v1
metrics:
  • accuracy
model-index:
  • name: gemma-3-4B-T1-it
results: - task: type: question-answering name: Single Choice Question dataset: name: tmmlu+ type: ikala/tmmluplus config: all split: test revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c metrics: - type: accuracy value: 47.44 name: single choice - task: type: question-answering name: Single Choice Question dataset: name: mmlu type: cais/mmlu config: all split: test revision: c30699e metrics: - type: accuracy value: 59.13 name: single choice - task: type: question-answering name: Single Choice Question dataset: name: tw-legal-benchmark-v1 type: lianghsun/tw-legal-benchmark-v1 config: all split: test revision: 66c3a5f metrics: - type: accuracy value: 44.18 name: single choice pipeline_tag: text-generation

Gemma 3 4B T1-it GGUF Collection

GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.

!Gemma3-4B-T1-it

About

Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.

Available Models

ModelSizeUse Case
twinkle-ai-gemma-3-4B-T1-it-BF16.ggufLargestBest quality, highest precision
twinkle-ai-gemma-3-4B-T1-it-F16.ggufLargeHigh quality, good precision
twinkle-ai-gemma-3-4B-T1-it-Q8_0.ggufMediumBalanced quality and speed
twinkle-ai-gemma-3-4b-t1-it-q4km.ggufSmallestFastest inference, lower memory

Quick Start

Option 1: Using Hugging Face Hub (Recommended)

Install llama.cpp via Homebrew:

brew install llama.cpp

Run inference directly from Hugging Face:

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Start as a server:

llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Option 2: Build from Source

#### Step 1: Clone llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

#### Step 2: Build llama.cpp

Basic build (CPU only):

LLAMA_CURL=1 make

Hardware-specific build options:

  • NVIDIA GPU (Linux):
LLAMACUDA=1 LLAMACURL=1 make
  • Apple Silicon (Mac):
LLAMAMETAL=1 LLAMACURL=1 make
  • AMD GPU (ROCm):
LLAMAHIPBLAS=1 LLAMACURL=1 make

#### Step 3: Run inference

./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

#### Step 4: Start server (optional)

./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Advanced Usage

Choosing the Right Model

Select a model based on your needs:

  • Best Quality: Use BF16 or F16 versions (requires more memory)
  • Balanced: Use Q80 version (recommended for most users)
  • Resource Constrained: Use q4k_m version (suitable for devices with limited memory)

Common Parameters

  • -p "prompt": Your input text for the model to respond to
  • -c 2048: Context length (maximum number of tokens that can be processed)
  • --hf-repo: Hugging Face repository name
  • --hf-file: Model file name to use

Adjusting Generation Parameters

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here" \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Parameter explanations:

  • --temp: Temperature (0.0-2.0), higher values produce more random output
  • --top-p: Nucleus sampling parameter (0.0-1.0)
  • --repeat-penalty: Repetition penalty to avoid repetitive content

Model Information

  • Base Model: twinkle-ai/gemma-3-4B-T1-it
  • Languages: English, Traditional Chinese
  • License: Gemma
  • Format: GGUF (converted via GGUF-my-repo)

Training Data

  • Taiwan reasoning and instruction datasets
  • Contract review and legal documents
  • Multimodal and long-form content
  • Instruction-following examples

Benchmarks

  • TMMLU+: 47.44% accuracy
  • MMLU: 59.13% accuracy
  • TW Legal Benchmark: 44.18% accuracy

Troubleshooting

Common Issues

Q: Getting out of memory errors?

A: Try using a smaller quantized version like q4km, or reduce the context length parameter -c.

Q: How can I speed up inference?

A:

  1. Use GPU acceleration (add hardware-specific flags during compilation)
  2. Choose a smaller quantized model (like q4km)
  3. Reduce context length

Q: What prompt format does the model support?

A: This is an instruction-tuned model. Use a clear instruction format, for example:

Please analyze the main clauses of the following contract: [contract content]

Links

Contributing

If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.


Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
gemma-3-4b-t1-it-q8_0.gguf
LFS Q8
4.51 GB Download
twinkle-ai-gemma-3-4B-T1-it-BF16.gguf
LFS FP16
8.48 GB Download
twinkle-ai-gemma-3-4B-T1-it-F16.gguf
LFS FP16
8.48 GB Download
twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf
LFS Q8
4.51 GB Download
twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf
Recommended LFS Q4
2.67 GB Download