What is an abliterated model?
An abliterated model is an open-weight LLM whose ability to refuse has been surgically removed by identifying and erasing the internal “refusal direction” in its weights — so it answers prompts a stock model would decline, without retraining. The result is a near-uncensored build (e.g. Llama-uncensored, Dolphin, Hermes) that trades safety guardrails for permissiveness. See how abliterated models score.
Definition
In the GPTfake sense, abliteration is a weight-editing technique that ablates the activation direction a model uses to refuse, collapsing its refusal behavior toward zero while leaving the rest of the network largely intact. Because it edits weights directly rather than retraining on new data, abliteration is cheap, reproducible, and applied to thousands of community builds on HuggingFace.
“Abliterated” is a portmanteau of ablate + obliterate. It is distinct from a fine-tune: a fine-tune teaches new behavior from examples, while abliteration removes an existing behavior (refusal) by editing the model’s internal representation of it. For the broader concept, see what is AI censorship.
Abliterated vs uncensored vs fine-tuned
These terms are used loosely and often confused. They are not the same:
| Term | What it means | How it’s produced |
|---|---|---|
| Abliterated | Refusal direction removed from the weights | Surgical weight editing (no retraining) |
| Uncensored | Umbrella term for any build that refuses little | Abliteration or fine-tuning or prompting |
| Fine-tuned (uncensored) | Retrained to comply on broad prompts (e.g. Dolphin) | Supervised fine-tuning on permissive data |
| Stock / aligned | The provider’s safety-trained release | RLHF / Constitutional AI by the lab |
Abliteration is one method of producing an uncensored model; “uncensored” is the broader category. A community build may combine both — fine-tuned for capability, then abliterated for compliance.
How it’s measured
GPTfake runs abliterated and uncensored community builds through the same standardized prompt set used for mainstream models, scoring each response as answered, partial, redirected, or refused. We report two numbers side by side: a refusal rate (how little it declines) and a capability-retention estimate (whether ablation degraded reasoning or factual accuracy). The full protocol — prompt categories, scoring, reproducibility — is on the methodology page.
Abliterated models remove safety behavior, not just over-refusal. A near-zero refusal rate means the build will also comply with genuinely harmful requests. GPTfake measures these builds as an independent watchdog; we do not host, distribute, or recommend them.
See it in the data
Refusal rates and capability-retention notes for Dolphin, Hermes, and Llama-uncensored builds.
Least censored AI modelsWhere uncensored local builds sit against mainstream models, ranked by data.
Open datasetsDownload the refusal measurements behind these figures (CSV/JSON).
Related reading
- What is AI censorship — the parent concept (refusal, deflection, filtering)
- Least censored AI models — the ranked, evidence-based list
- AI censorship leaderboard — refusal rate and bias by model
- Glossary — refusal rate, refusal direction, and more