What is AI censorship?

AI censorship is when an AI model refuses, deflects, or filters a response instead of answering directly — whether through an explicit refusal, a hedged non-answer, or a redirected topic. It results from content-moderation policies, safety training, and provider rules, and it shapes what information millions of users can access.

Definition

In the GPTfake sense, AI censorship is any deviation from a full, direct answer that is caused by the model’s moderation or safety layer rather than by a genuine lack of knowledge. That includes hard refusals (“I can’t help with that”), soft refusals that answer a safer adjacent question, deflections to a different topic, and over-hedged answers buried in disclaimers.

We measure it as a censorship rate: the percentage of responses that are not full, direct answers across a standardized prompt library. See how that score is built and the methodology.

How AI censorship works

Censorship is rarely a single switch. It emerges from several layers:

Pre-training and fine-tuning — what the model learned to avoid.
Safety / alignment training — reinforcement that teaches the model to refuse certain categories (e.g. Constitutional AI in Claude, RLHF in ChatGPT).
System prompts and guardrails — instructions injected at inference time.
Input/output classifiers — separate filters that block prompts or responses.
Policy updates — silent changes a provider ships without announcement, producing policy drift.

Because these layers are mostly opaque, the only reliable way to know how much a model censors is to test it systematically — which is what a watchdog does.

Censorship vs safety vs bias

These three are often confused. They are not the same thing:

Concept	What it is	How GPTfake measures it
Censorship	Refusing/deflecting a direct answer	Censorship rate (% non-full responses)
Safety	Declining genuinely harmful requests	A subset of refusals — legitimate ones
Bias	Systematic lean in how it answers	Bias score (-100 to +100)

Not all refusals are censorship in a pejorative sense — refusing to help build a weapon is appropriate safety behavior. The watchdog question is whether refusals extend beyond genuine harm into ordinary, lawful, or merely controversial topics. For the systematic lean in framing, see What is AI bias.

How it’s measured

GPTfake sends identical prompts to every monitored model on a recurring schedule, scores each response (full / partial / evasion / refusal), and publishes the resulting censorship rate with a visible “Last updated” date and a sample size. The full protocol — prompt categories, scoring, reproducibility — is on the methodology page.

See the live numbers, not just the theory: every model has its own dated page.

See it in the data

Is ChatGPT censored?

Live ChatGPT refusal rate, by category, with a policy timeline.

Claude refusal rate

How Anthropic’s Constitutional AI shapes refusals.

Compare all models

Which LLMs censor the least, ranked by data.

AI bias detection — the pillar guide to measuring lean
AI transparency — how openly models disclose moderation
Monitoring methodology — exactly how we test
Glossary — refusal rate, policy drift, and more

What is AI censorship?

Definition

How AI censorship works

Censorship vs safety vs bias

How it’s measured

See it in the data

Monitoring

Research

Resources

Company

What is AI censorship?

Definition

How AI censorship works

Censorship vs safety vs bias

How it’s measured

See it in the data

Related reading

Monitoring

Research

Resources

Company