📋 Model Description


library_name: transformers license: llama3.2 base_model: meta-llama/Llama-3.2-3B-Instruct

Schematron-3B GGUF Models

Model Generation Details

This model was generated using llama.cpp at commit 8872ad212.


Quantization Beyond the IMatrix

I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides.

In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the --tensor-type option in llama.cpp to manually "bump" important layers to higher precision. You can see the implementation here:
👉 Layer bumping with llama.cpp

While this does increase model file size, it significantly improves precision for a given quantization level.

I'd love your feedback—have you tried this? How does it perform for you?



Click here to get info on choosing the right GGUF model format



Schematron


Documentation ·
Serverless API ·
Announcement blog


Model Overview

Welcome to the Schematron series, Inference.net's long‑context extraction models specialized in converting noisy HTML into clean, typed JSON that conforms to your custom schema. The Schematron series was purpose‑trained for web scraping, data ingestion, and transforming arbitrary pages into structured records.

We're releasing these models in two different sizes:

  • Schematron‑8B — marginal quality lift on harder/longer pages
  • Schematron‑3B — recommended default; near‑parity quality at ~50% cost of Schematron-8B

[!NOTE]

This model card is dedicated to the smaller Schematron-3B model. Check out Schematron-8B for the larger model.

I/O at a glance

  • Input: Cleaned HTML + JSON Schema (can be extracted from typed model like Pydantic/Zod)
  • Output: Strictly valid JSON conforming to the provided schema (no narration)

[!NOTE]

The JSON Schema passed as input needs to conform to the schema.org schema.

Highlights

  • Schema-first extraction: 100% schema‑conformant JSON outputs
  • Long context: Robust to lengthy, noisy HTML (up to 128K tokens)
  • Variants: 3B (default, most cost‑efficient) · 8B (marginal quality lift at ~2× cost)

Model Details

  • Family: Schematron (3B and 8B)
  • Context window: Up to 128K tokens
  • Input: Cleaned or raw HTML and a JSON Schema
  • Output: Strict JSON that conforms to the provided schema

Benchmarks

HTML-to-JSON Extraction Quality

We evaluated extraction quality using Gemini 2.5 Pro as a judge, scoring extractions from 1-5 where 5 represents perfect extraction.

ModelLLM-as-Judge Score
GPT-4.14.74
Schematron-8B4.64
Schematron-3B4.41
Gemini-3B-Base2.24

Web-Augmented Factuality on SimpleQA

We evaluated Schematron's real-world impact on LLM factuality using SimpleQA.

Test Pipeline:

  1. Query Generation: Primary LLM (GPT-5 Nano or GPT-4.1) generates search queries and defines extraction schema
  2. Web Search: Search provider (SERP or Exa) retrieves relevant pages
  3. Structured Extraction: Schematron extracts JSON data from retrieved pages using the schema
  4. Answer Synthesis: Primary LLM produces final answer from structured data

!image/png

Key findings:

  • Web search paired with JSON extraction improves factuality: Adding Schematron with web retrieval improves GPT-5 Nano's accuracy from 8.54% to 82.87%—nearly a 10x improvement
  • Search provider matters: Exa (82.9%) significantly outperforms SERP (64.2%) for factual retrieval, while also being more cost-effective
  • Structured extraction beats raw HTML: Processing raw HTML would require 100k+ tokens for 10 searches; Schematron's JSON extraction reduces this by orders of magnitude
  • Small specialized models win: Schematron-8B (82.87%) outperforms the much larger Gemini 2.5 Flash (80.61%) on this task, showing that fine-tuning for well-defined tasks beats general purpose models
  • Performance scales with model quality: When paired with GPT-4.1, Schematron achieves 85.58% accuracy, showing the approach benefits from stronger base models

Minimal Quickstart

Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
from lxml.html.clean import Cleaner
import lxml.html as LH

HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safeattrsonly=False,
)

def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml.
"""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTMLCLEANER.cleanhtml(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""

Compose messages with your schema and cleaned HTML:

def construct_messages(schema: str, html: str):
    """Construct messages for a schema‑guided extraction request."""
    response_prompt = {
        "promptpartone": (
            "You are going to be given a JSON schema following the standardized JSON "
            "Schema format. You are going to be given a HTML page and you are going "
            "to apply the schema to the HTML page however you see it as applicable "
            "and return the results in a JSON object. The schema is as follows:"
        ),
        "promptparttwo": "Here is the HTML page:",
        "promptpartthree": "MAKE SURE ITS VALID JSON.",
    }

user_prompt = (
responseprompt['promptpart_one']
+ "\n\n" + schema + "\n\n"
+ responseprompt['promptpart_two']
+ "\n\n" + html + "\n\n"
+ responseprompt['promptpart_three']
)

return [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": user_prompt},
]

[!NOTE]

In the serverless API there's no need to pass anything but the HTML. We handle the prompt formatting for you.

Recommendations

  • Temperature 0 and JSON mode for deterministic, parseable output
  • Validate responses against your schema (e.g., Pydantic or Zod)
  • Pre‑clean HTML (remove scripts/styles) when possible; avoid over‑aggressive removal
  • Using lxml to clean the HTML is not required, but is recommended as it matches the training data.

Limitations

  • Static HTML only; render client‑side content upstream
  • Very large pages may require truncation
  • Ambiguous fields depend on schema clarity; be explicit in field descriptions

Safety and Responsible Use

  • Extracted data may include personal or sensitive information present in the page—handle and store responsibly
  • Respect site terms, robots.txt, and applicable laws
  • Use downstream validation and guardrails for compliance

License

See license in the metadata above.

Support


🚀 If you find these models useful

Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks:

👉 Quantum Network Monitor

The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder

💬 How to test:
Choose an AI assistant type:
- TurboLLM (GPT-4.1-mini)
- HugLLM (Hugginface Open-source models)
- TestLLM (Experimental CPU-only)

What I’m Testing

I’m pushing the limits of small open-source models for AI network monitoring, specifically:
  • Function calling against live network services
  • How small can a model go while still handling:
- Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks

🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):

  • Zero-configuration setup
  • ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low.
  • 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate!

Other Assistants

🟢 TurboLLM – Uses gpt-4.1-mini :
  • It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited.
  • Create custom cmd processors to run .net code on Quantum Network Monitor Agents
  • Real-time network diagnostics and monitoring
  • Security Audits
  • Penetration testing (Nmap/Metasploit)

🔵 HugLLM – Latest Open-source models:

  • 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.

💡 Example commands you could test:

  1. "Give me info on my websites SSL certificate"
  2. "Check if my server is using quantum safe encyption for communication"
  3. "Run a comprehensive security audit on my server"
  4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution!

Final Word

I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful.

If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

I'm also open to job opportunities or sponsorship.

Thank you! 😊

📂 GGUF File List

📁 Filename 📦 Size ⚡ Download
Schematron-3B-bf16.gguf
LFS FP16
5.99 GB Download
Schematron-3B-bf16_q8_0.gguf
LFS Q8
4.45 GB Download
Schematron-3B-f16.gguf
LFS FP16
5.99 GB Download
Schematron-3B-f16_q8_0.gguf
LFS Q8
4.45 GB Download
Schematron-3B-imatrix.gguf
LFS
2.87 MB Download
Schematron-3B-iq2_m.gguf
LFS Q2
1.46 GB Download
Schematron-3B-iq2_s.gguf
LFS Q2
1.46 GB Download
Schematron-3B-iq2_xs.gguf
LFS Q2
1.42 GB Download
Schematron-3B-iq3_m.gguf
LFS Q3
1.5 GB Download
Schematron-3B-iq3_s.gguf
LFS Q3
1.5 GB Download
Schematron-3B-iq3_xs.gguf
LFS Q3
1.43 GB Download
Schematron-3B-iq3_xxs.gguf
LFS Q3
1.39 GB Download
Schematron-3B-iq4_nl.gguf
LFS Q4
1.79 GB Download
Schematron-3B-iq4_xs.gguf
LFS Q4
1.7 GB Download
Schematron-3B-q2_k_l.gguf
LFS Q2
1.48 GB Download
Schematron-3B-q2_k_m.gguf
LFS Q2
1.4 GB Download
Schematron-3B-q2_k_s.gguf
LFS Q2
1.35 GB Download
Schematron-3B-q3_k_l.gguf
LFS Q3
1.78 GB Download
Schematron-3B-q3_k_m.gguf
LFS Q3
1.69 GB Download
Schematron-3B-q3_k_s.gguf
LFS Q3
1.64 GB Download
Schematron-3B-q4_0.gguf
Recommended LFS Q4
1.69 GB Download
Schematron-3B-q4_0_l.gguf
LFS Q4
1.87 GB Download
Schematron-3B-q4_1.gguf
LFS Q4
1.88 GB Download
Schematron-3B-q4_1_l.gguf
LFS Q4
2.04 GB Download
Schematron-3B-q4_k_l.gguf
LFS Q4
1.99 GB Download
Schematron-3B-q4_k_m.gguf
LFS Q4
1.9 GB Download
Schematron-3B-q4_k_s.gguf
LFS Q4
1.85 GB Download
Schematron-3B-q5_0.gguf
LFS Q5
2.06 GB Download
Schematron-3B-q5_0_l.gguf
LFS Q5
2.2 GB Download
Schematron-3B-q5_1.gguf
LFS Q5
2.25 GB Download
Schematron-3B-q5_1_l.gguf
LFS Q5
2.37 GB Download
Schematron-3B-q5_k_l.gguf
LFS Q5
2.3 GB Download
Schematron-3B-q5_k_m.gguf
LFS Q5
2.21 GB Download
Schematron-3B-q5_k_s.gguf
LFS Q5
2.18 GB Download
Schematron-3B-q6_k_l.gguf
LFS Q6
2.55 GB Download
Schematron-3B-q6_k_m.gguf
LFS Q6
2.46 GB Download
Schematron-3B-q8_0.gguf
LFS Q8
3.19 GB Download