πŸ“‹ Model Description


license: other license_name: prism-research license_link: LICENSE.md language:
  • en
  • zh
tags:
  • glm4
  • prism
  • moe
pipeline_tag: text-generation library_name: transformers

![Parameters]()
![Architecture]()
![Context]()

GLM-4.7-Flash-PRISM

An over-refusal/propaganda free version of ZAI's GLM-4.7-Flash with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.


Model Highlights

  • PRISM Ablation β€” State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
  • 30B-A3B MoE Architecture β€” 30 billion total parameters with ~3 billion active per token for fast, efficient inference
  • 128K Context Window β€” Extended context for complex tasks and large codebases
  • Interleaved Thinking β€” Multi-turn reasoning that persists across conversations with per-turn thinking control

Benchmarks

BenchmarkGLM-4.7-FlashQwen3-30B-A3B-Thinking-2507GPT-OSS-20B
AIME 202591.685.091.7
GPQA75.273.471.5
LCB v664.066.061.0
HLE14.49.810.9
SWE-bench Verified59.222.034.0
τ²-Bench79.549.047.7
BrowseComp42.82.2928.3

Usage

Transformers

Install the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Run inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"

tokenizer = AutoTokenizer.frompretrained(MODELPATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.applychattemplate(
messages,
tokenize=True,
addgenerationprompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)

generatedids = model.generate(inputs, maxnewtokens=128, dosample=False)
outputtext = tokenizer.decode(generatedids[0][inputs.input_ids.shape[1]:])
print(output_text)

vLLM

Install vLLM nightly:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

Serve the model:

vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --speculative-config.numspeculativetokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash-prism

SGLang

Install SGLang:

uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa

Launch the server:

python3 -m sglang.launch_server \
  --model-path Ex0bit/GLM-4.7-Flash-PRISM \
  --tp-size 4 \
  --tool-call-parser glm47  \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.8 \
  --served-model-name glm-4.7-flash-prism \
  --host 0.0.0.0 \
  --port 8000

Note: For Blackwell GPUs, add --attention-backend triton --speculative-draft-attention-backend triton to your SGLang launch command.

Recommended Parameters

Use CaseTemperatureTop-PMax New Tokens
Default1.00.95131072
Code (SWE-bench)0.71.016384
Agentic Tasks0.0β€”16384

License

This model is released under the PRISM Research License.

Citation

@misc{elbaz2026glm47flashPrism,
  author = {Elbaz, Eric},
  title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}}
}

Acknowledgments

Based on GLM-4.7-Flash by Z.AI. See the technical report for more details on the base model.

πŸ“‚ GGUF File List

No GGUF files available