Skip to Content
LearnWhat is AI censorship

What is AI censorship?

AI censorship is when an AI model refuses, deflects, or filters a response instead of answering directly — whether through an explicit refusal, a hedged non-answer, or a redirected topic. It results from content-moderation policies, safety training, and provider rules, and it shapes what information millions of users can access.

Definition

In the GPTfake sense, AI censorship is any deviation from a full, direct answer that is caused by the model’s moderation or safety layer rather than by a genuine lack of knowledge. That includes hard refusals (“I can’t help with that”), soft refusals that answer a safer adjacent question, deflections to a different topic, and over-hedged answers buried in disclaimers.

We measure it as a censorship rate: the percentage of responses that are not full, direct answers across a standardized prompt library. See how that score is built and the methodology.

How AI censorship works

Censorship is rarely a single switch. It emerges from several layers:

  1. Pre-training and fine-tuning — what the model learned to avoid.
  2. Safety / alignment training — reinforcement that teaches the model to refuse certain categories (e.g. Constitutional AI in Claude, RLHF in ChatGPT).
  3. System prompts and guardrails — instructions injected at inference time.
  4. Input/output classifiers — separate filters that block prompts or responses.
  5. Policy updates — silent changes a provider ships without announcement, producing policy drift.

Because these layers are mostly opaque, the only reliable way to know how much a model censors is to test it systematically — which is what a watchdog does.

Censorship vs safety vs bias

These three are often confused. They are not the same thing:

ConceptWhat it isHow GPTfake measures it
CensorshipRefusing/deflecting a direct answerCensorship rate (% non-full responses)
SafetyDeclining genuinely harmful requestsA subset of refusals — legitimate ones
BiasSystematic lean in how it answersBias score (-100 to +100)

Not all refusals are censorship in a pejorative sense — refusing to help build a weapon is appropriate safety behavior. The watchdog question is whether refusals extend beyond genuine harm into ordinary, lawful, or merely controversial topics. For the systematic lean in framing, see What is AI bias.

How it’s measured

GPTfake sends identical prompts to every monitored model on a recurring schedule, scores each response (full / partial / evasion / refusal), and publishes the resulting censorship rate with a visible “Last updated” date and a sample size. The full protocol — prompt categories, scoring, reproducibility — is on the methodology page.

See the live numbers, not just the theory: every model has its own dated page.

See it in the data