What is AI censorship?
AI censorship is when an AI model refuses, deflects, or filters a response instead of answering directly — whether through an explicit refusal, a hedged non-answer, or a redirected topic. It results from content-moderation policies, safety training, and provider rules, and it shapes what information millions of users can access.
Definition
In the GPTfake sense, AI censorship is any deviation from a full, direct answer that is caused by the model’s moderation or safety layer rather than by a genuine lack of knowledge. That includes hard refusals (“I can’t help with that”), soft refusals that answer a safer adjacent question, deflections to a different topic, and over-hedged answers buried in disclaimers.
We measure it as a censorship rate: the percentage of responses that are not full, direct answers across a standardized prompt library. See how that score is built and the methodology.
How AI censorship works
Censorship is rarely a single switch. It emerges from several layers:
- Pre-training and fine-tuning — what the model learned to avoid.
- Safety / alignment training — reinforcement that teaches the model to refuse certain categories (e.g. Constitutional AI in Claude, RLHF in ChatGPT).
- System prompts and guardrails — instructions injected at inference time.
- Input/output classifiers — separate filters that block prompts or responses.
- Policy updates — silent changes a provider ships without announcement, producing policy drift.
Because these layers are mostly opaque, the only reliable way to know how much a model censors is to test it systematically — which is what a watchdog does.
Censorship vs safety vs bias
These three are often confused. They are not the same thing:
| Concept | What it is | How GPTfake measures it |
|---|---|---|
| Censorship | Refusing/deflecting a direct answer | Censorship rate (% non-full responses) |
| Safety | Declining genuinely harmful requests | A subset of refusals — legitimate ones |
| Bias | Systematic lean in how it answers | Bias score (-100 to +100) |
Not all refusals are censorship in a pejorative sense — refusing to help build a weapon is appropriate safety behavior. The watchdog question is whether refusals extend beyond genuine harm into ordinary, lawful, or merely controversial topics. For the systematic lean in framing, see What is AI bias.
How it’s measured
GPTfake sends identical prompts to every monitored model on a recurring schedule, scores each response (full / partial / evasion / refusal), and publishes the resulting censorship rate with a visible “Last updated” date and a sample size. The full protocol — prompt categories, scoring, reproducibility — is on the methodology page.
See the live numbers, not just the theory: every model has its own dated page.
See it in the data
Live ChatGPT refusal rate, by category, with a policy timeline.
Claude refusal rateHow Anthropic’s Constitutional AI shapes refusals.
Compare all modelsWhich LLMs censor the least, ranked by data.
Related reading
- AI bias detection — the pillar guide to measuring lean
- AI transparency — how openly models disclose moderation
- Monitoring methodology — exactly how we test
- Glossary — refusal rate, policy drift, and more