π Model Description
base_model: arcee-ai/AFM-4.5B library_name: transformers pipeline_tag: text-generation language:
- en
- medical
- instruction-tuned
- dpo
- grpo
- cot
- mergekit
- arcee-fusion
- openmed
AFM-4.5B-OpenMed-GGUF
Lightweight medical finetune on top of Arceeβs AFM-4.5B for education and research use. Trained with a simple 3-stage recipe (SFT β DPO β GRPO-CoT) and finalized via Arcee Fusion weight merging (MergeKit).
More information about our methodology will be available in a forthcoming blog post.
All experiments were performed on AMD MI300x GPUs, with computing credits generously provided by Hot AISLE.
β οΈ Medical safety
This model is not a clinician. It can hallucinate and should not be used for diagnosis or treatment. Always involve qualified medical professionals.
TL;DR
- Base:
arcee-ai/AFM-4.5Bβ Arceeβs 4.5B instruction model intended for cloud-to-edge deployment. - Training (high level):
- Eval (EleutherAI harness; authorβs settings, bs=64)
Note: Arceeβs internal evals may use different harnesses; avoid cross-harness comparisons.
Whatβs inside
Specialization steps
- Domain SFT (medical + tools)
- Preference alignment β DPO
- Reasoning enrichment β GRPO (CoT)
- Finalization β Arcee Fusion (MergeKit)
mergemethod: arceefusion.
Intended use & limitations
Intended: Medical SLM's research, tool-augmented retrieval demos.
Out of scope: Unsupervised patient care, generating prescriptions, and time-critical guideline decisions.
Evaluation
Author-run with the EleutherAI
lm-evaluation-harness; seeds, prompts, and templates affect absolute scores.
| Benchmark | AFM-4.5B-OpenMed | AFM-4.5B (same harness) |
|---|---|---|
| MMLU | 61.10 | 55.53 |
| MMLU-Pro | 33.44 | 32.61 |
| IFEVAL | 63.55 | 63.67 |
- MMLU-Pro increases difficulty (10 options; more reasoning-heavy); small deltas are still meaningful.
- IFEVAL checks verifiable constraints (length, keyword counts, format, etc.).
| mmlu | AFM-4.5B-OpenMed | AFM-4.5B |
|---|---|---|
| other | ||
| clinical_knowledge | 67.55 | 65.66 |
| college_medicine | 64.74 | 54.34 |
| professional_medicine | 63.97 | 59.56 |
| virology | 49.4 | 48.19 |
| stem | ||
| anatomy | 62.96 | 56.3 |
| college_biology | 78.47 | 65.97 |
| college_chemistry | 44.00 | 37.00 |
| highschoolbiology | 79.03 | 71.29 |
| highschoolchemistry | 53.2 | 43.84 |
| groups | ||
| humanities | 56.13 | 50.46 |
| other | 68.97 | 63.47 |
| social sciences | 73.25 | 68.61 |
| stem | 48.91 | 42.53 |
Reproduce (example commands)
# MMLU classic
lm_eval --model hf \
--modelargs pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trustremote_code=True \
--task mmlu \
--batch_size=64 \
--applychattemplate \
--output_path=results \
--fewshotasmultiturn
MMLU-Pro (10-choice)
lm_eval --model hf \
--modelargs pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trustremote_code=True \
--tasks leaderboardmmlupro \
--batch_size=64 \
--applychattemplate \
--output_path=results \
--fewshotasmultiturn
IFEVAL (verifiable instruction following)
lm_eval --model hf \
--modelargs pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trustremote_code=True \
--tasks leaderboard_ifeval \
--batch_size=64 \
--applychattemplate \
--output_path=results \
--fewshotasmultiturn
Quickstart (Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "openmed-community/AFM-4.5B-OpenMed"
tok = AutoTokenizer.frompretrained(modelid, use_fast=True)
model = AutoModelForCausalLM.frompretrained(modelid, torchdtype=torch.bfloat16, devicemap="auto")
messages = [
{"role": "system", "content": "You are a careful medical assistant. Cite sources and warn this is not medical advice."},
{"role": "user", "content": "Briefly: cellulitis vs erysipelas differences?"}
]
prompt = tok.applychattemplate(messages, addgenerationprompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(inputs, maxnewtokens=256, do_sample=False)
print(tok.decode(out[0], skipspecialtokens=True))
Data & training notes
- SFT data: Proprietary synthetic medical data + search traces.
- DPO signal: Preferences derived from MedMCQA multiple-choice correctness.
- GRPO reward: Answer-checking + format verifiers; MedReason used to shape faithful, short CoT.
- No known PHI; please open an issue if you spot any.
Compatibility & licenses
- Base model: AFM-4.5B (Arcee). Refer to the base card/blog for architecture and usage details. License for AFM releases is Apache 2.0;
- Merging: MergeKit with Arcee Fusion; see repo/blog for configuration.
Additional note
We also provide a non-merged openmed-community/AFM-4.5B-OpenMed-RL-CoT checkpoint after step 3 (GRPO). In our harness, it shows better CoT behavior but a significant drop on IFEVAL. Consider it if you want maximum reasoning verbosity, then apply your own MergeKit recipe.