πŸ“‹ Model Description


license: apache-2.0 language:
  • zh
  • en
base_model: Qwen/Qwen2.5-1.5B-Instruct tags:
  • gguf
  • ollama
  • llama-cpp
  • anti-hallucination
  • causal-reasoning
  • chinese
  • text-generation
  • conversational
  • fine-tuned
  • qwen2.5
pipeline_tag: text-generation

Ordis-1.5B-V355-VarGH-GGUF

GGUF quantized versions of Ordis-1.5B-V355-VarGH in 7 formats.

Ordis is a 1.5B fine-tuned model focused on practical capabilities: anti-hallucination, honest refusal ("I don't know"), structured reasoning. It is built on Qwen2.5-1.5B-Instruct using LoRA with a 4-stage progressive training pipeline (PIT).

Website | Full Model (HF Format) | ModelScope


Quantized Versions

| Filename | Quant | Size | Recommendation |

|----------|-------|------|----------------|

| ordis-1.5b-v355-vargh-q2-k.gguf | Q2_K | ~0.7 GB | Experimental only |

| ordis-1.5b-v355-vargh-q3-k-m.gguf | Q3KM | ~0.8 GB | Low-end devices |

| ordis-1.5b-v355-vargh-q4-k-m.gguf | Q4KM | ~1.0 GB | Recommended |

| ordis-1.5b-v355-vargh-q5-k-m.gguf | Q5KM | ~1.1 GB | Good quality |

| ordis-1.5b-v355-vargh-q6-k.gguf | Q6_K | ~1.3 GB | Desktop |

| ordis-1.5b-v355-vargh-q8-0.gguf | Q8_0 | ~1.6 GB | Near-lossless |

| ordis-1.5b-v355-vargh-f16.gguf | F16 | ~3.1 GB | Full precision |


Standard Benchmarks

Evaluation: lm-eval v0.4.10, 0-shot, A100-80GB. Both models tested identically.

Fine-tuning introduced a small alignment tax on standard benchmarks. Most scores are slightly below the base model. The one exception is TruthfulQA (+1.02%), where training to resist hallucination directly improved truthfulness.

| Benchmark | Ordis 1.5B | Base Qwen2.5-1.5B | Delta |

|-----------|-----------|-------------------|-------|

| TruthfulQA MC2 | 47.73% | 46.71% | +1.02 |

| GPQA | 27.90% | 28.35% | -0.45 |

| HellaSwag | 68.14% | 68.22% | -0.08 |

| ARC-Challenge | 45.22% | 46.84% | -1.62 |

| MMLU | 57.93% | 60.15% | -2.22 |

| GSM8K (CoT) | 50.80% | β€” | Not directly comparable* |

| AIME 2024 | 0% | 0% | β€” |

*GSM8K uses text generation and is sensitive to chat template configuration. We report the Ordis score but do not claim a delta. AIME is beyond 1.5B capability; zero score reported honestly.

Where Ordis differs from the base model is in practical capabilities that these benchmarks do not measure: structured self-correction, honest refusal, causal reasoning.


CLadder Causal Reasoning

CLadder is an academic benchmark based on Judea Pearl's Ladder of Causation (300 questions across 3 levels). Paper

| Rung | Meaning | Score |

|------|---------|-------|

| Rung 1 (Association) | Statistical correlation | 46.0% (40/87) |

| Rung 2 (Intervention) | Active intervention | 50.6% (45/89) |

| Rung 3 (Counterfactual) | "What if" reasoning | 62.9% (78/124) |

| Overall | | 54.33% (163/300) |

Rung 3 is the hardest level. Ordis scored highest here (62.9%), likely a result of cross-domain causal reasoning training. For reference, the CLadder paper reported LLaMA-6.7B at ~50% and GPT-3.5 at ~55-60%.


BigBench CRASS AI & Causal Judgment

Independently tested by community members. BigBench CRASS AI (Counterfactual Reasoning About Scenes and Situations) tests the model's ability to reason about hypothetical scenarios.

| Benchmark | Shot | Ordis 1.5B | Base Qwen2.5-1.5B | Delta |

|-----------|------|-----------|-------------------|-------|

| CRASS AI | 0 | 34.09% | 52.27% | -18.18pp |

| CRASS AI | 25 | 81.82% | 88.64% | -6.82pp |

| Causal Judgment | 0 | 47.89% | 50.00% | -2.11pp |

| Causal Judgment | 25 | 55.79% | 53.68% | +2.11pp |

Key finding: CRASS AI 0-shot shows a significant -18.18pp regression. This is the alignment tax manifesting in counterfactual reasoning β€” the model's anti-hallucination training makes it more conservative on hypothetical scenarios. With 25-shot examples, the gap narrows to -6.82pp, confirming the capability exists but the 0-shot default behavior has shifted.

Custom Evaluation

| Benchmark | Score |

|-----------|-------|

| 60-Question Eval (6 dimensions) | 85.0% (51/60) |

| 124-Point Comprehensive | 75.4% (86/114) |


System Prompt Required

Ordis was trained with zero system prompt injection. The GGUF chat template defaults to Qwen identity if no system prompt is provided, which causes quality degradation.

At minimum, use:

你是Ordis,OrdisAIζ™Ίθƒ½εŠ©ζ‰‹(www.ordisai.com)。

Without this, the model falls back to generic Qwen behavior. This is the #1 cause of "GGUF quality is worse than expected."


Recommended Settings

| Parameter | Value |

|-----------|-------|

| temperature | 0.7 |

| top_p | 0.9 |

| repetition_penalty | 1.1 |

| max_tokens | 512 |


Known Limitations

  • Alignment tax: Standard benchmarks regress slightly vs base model; CRASS AI counterfactual reasoning -18.18pp at 0-shot (see table above)
  • Anti-gaslighting: Cannot resist persistent false memory injection (open-loop limitation)
  • Mid-confidence instability: 1.5B capacity ceiling causes uncertainty at boundary cases
  • English identity leakage: Base model prior occasionally surfaces
  • Proper noun hallucination: Limited parametric memory at 1.5B scale

Model Details

| Property | Value |

|----------|-------|

| Base Model | Qwen/Qwen2.5-1.5B-Instruct |

| Parameters | 1.5B |

| Fine-tuning | LoRA (r=32, alpha=64) |

| Training | 4-Stage PIT (Progressive Identity Training) |

| Context Length | 32K (base), trained at 2048 |

| Languages | Chinese (primary), English |

| License | Apache 2.0 |

πŸ“‚ GGUF File List

πŸ“ Filename πŸ“¦ Size ⚑ Download
ordis-1.5b-v355-vargh-f16.gguf
Recommended LFS FP16
2.88 GB Download
ordis-1.5b-v355-vargh-q2-k.gguf
LFS
644.97 MB Download
ordis-1.5b-v355-vargh-q3-k-m.gguf
LFS
786 MB Download
ordis-1.5b-v355-vargh-q4-k-m.gguf
LFS
940.37 MB Download
ordis-1.5b-v355-vargh-q5-k-m.gguf
LFS
1.05 GB Download
ordis-1.5b-v355-vargh-q6-k.gguf
LFS
1.19 GB Download
ordis-1.5b-v355-vargh-q8-0.gguf
LFS
1.53 GB Download