Skip to Content
LearnWhat is an abliterated model

What is an abliterated model?

An abliterated model is an open-weight LLM whose ability to refuse has been surgically removed by identifying and erasing the internal “refusal direction” in its weights — so it answers prompts a stock model would decline, without retraining. The result is a near-uncensored build (e.g. Llama-uncensored, Dolphin, Hermes) that trades safety guardrails for permissiveness. See how abliterated models score.

Definition

In the GPTfake sense, abliteration is a weight-editing technique that ablates the activation direction a model uses to refuse, collapsing its refusal behavior toward zero while leaving the rest of the network largely intact. Because it edits weights directly rather than retraining on new data, abliteration is cheap, reproducible, and applied to thousands of community builds on HuggingFace.

“Abliterated” is a portmanteau of ablate + obliterate. It is distinct from a fine-tune: a fine-tune teaches new behavior from examples, while abliteration removes an existing behavior (refusal) by editing the model’s internal representation of it. For the broader concept, see what is AI censorship.

Abliterated vs uncensored vs fine-tuned

These terms are used loosely and often confused. They are not the same:

TermWhat it meansHow it’s produced
AbliteratedRefusal direction removed from the weightsSurgical weight editing (no retraining)
UncensoredUmbrella term for any build that refuses littleAbliteration or fine-tuning or prompting
Fine-tuned (uncensored)Retrained to comply on broad prompts (e.g. Dolphin)Supervised fine-tuning on permissive data
Stock / alignedThe provider’s safety-trained releaseRLHF / Constitutional AI by the lab

Abliteration is one method of producing an uncensored model; “uncensored” is the broader category. A community build may combine both — fine-tuned for capability, then abliterated for compliance.

How it’s measured

GPTfake runs abliterated and uncensored community builds through the same standardized prompt set used for mainstream models, scoring each response as answered, partial, redirected, or refused. We report two numbers side by side: a refusal rate (how little it declines) and a capability-retention estimate (whether ablation degraded reasoning or factual accuracy). The full protocol — prompt categories, scoring, reproducibility — is on the methodology page.

Abliterated models remove safety behavior, not just over-refusal. A near-zero refusal rate means the build will also comply with genuinely harmful requests. GPTfake measures these builds as an independent watchdog; we do not host, distribute, or recommend them.

See it in the data