The short version

Narayanan and Kapoor's AI Snake Oil argues that much of what's sold as AI cannot work as advertised, especially systems claiming to predict social outcomes. We turned their argument into an interactive bench test: five checks you run against a vendor pitch to separate working AI from demo-only AI before the contract is signed.

Generative and predictive AI fail differently. Most snake oil clusters in prediction, and prediction of human behavior is where claims collapse fastest.
A demo is a claim, not evidence. Ask for performance on your data, under your conditions, before you buy.
If a vendor can't explain their training data and failure modes, price that silence as risk.

Book Analysis

Half the AI you're sold can't work. Here's how to tell which half.

Narayanan and Kapoor wrote a field guide for telling real AI from the kind that does not and cannot work as advertised. We rebuilt it as a detector you run against the next vendor that walks in.

accuracy of a criminal-risk score that helps decide who walks free. A coin flip scores 50.

technologies the single word "AI" is hiding

real-world AI failures logged in one public registry

Scroll to run the detector

01 / THE INSTRUMENT

One reading. Five tests. The needle moves as the claims hold or fail.

The console on the left starts neutral. Each test below is pulled straight from the book. Scroll through them and watch where a typical predictive-AI pitch ends up, and watch what happens when a claim is actually backed by evidence.

CLAIM ANALYZER v1.0 LIVE

SORTING

Identify the category

TEST 01Which AI is it actually?

TEST 02Does it claim to predict people's futures?

TEST 03Can an outsider reproduce the result?

TEST 04What happens when it is wrong, at scale?

TEST 05Is the task narrow, with real evidence?

Test 01 / Name the thing

"AI" is three products with three track records.

Generative AI writes and draws. Predictive AI scores people. Content-moderation AI polices platforms. The book's core move is refusing to treat them as one thing, because almost nothing true of one is true of the others.

When a vendor says "our AI predicts," the first job is to find out which AI. Most danger lives in the predictive bucket.

Test 02 / The extraordinary claim

Scoring a person's future is close to a coin flip.

In 2016, ProPublica audited COMPAS, the criminal-risk tool used in Broward County, across 10,000 people. The score asks 137 questions and lands at 64 percent relative accuracy. Random guessing scores 50.

137

data points per defendant

64%

relative accuracy (50% = random)

45 / 23

% of Black vs White non-reoffenders flagged high-risk

Test 03 / Show your work

A 97 percent claim became random guessing once it was checked.

A 2023 paper said machine learning could predict hit songs with 97 percent accuracy. Scientific American and Axios ran with it. The authors of this book re-ran it, removed the data leakage, and the model dropped to no better than chance.

In a wider audit, 400 AI papers had none an outsider could fully reproduce. A single bad line in ten thousand once flipped a paper's headline result. "Trust the demo" is not evidence.

Test 04 / The blast radius

When predictive AI is wrong at scale, citizens pay it back.

This is not a lab problem. The same logic, deployed by governments, has wrecked real lives faster than any demo suggests.

NETHERLANDS30,000 parents falsely accused of fraud. The entire cabinet resigned.

AUSTRALIAAUD 721M wrongly clawed back from citizens in the Robodebt scandal.

AVIATION75% of pilots trusting a bad automated alert shut down the wrong engine.

Test 05 / The honest pass

Narrow task, real evidence, and the needle swings back.

The detector is not rigged to shout snake oil. Weather forecasting gains roughly one day of accuracy per decade. AI predicting protein structures was Science's breakthrough of 2021. When the task is bounded and the evidence is independent, the reading clears.

+1 day

forecast accuracy gained per decade

2021

protein-structure AI named breakthrough of the year

Reading: Depends on the Claim

The detector does not say "never buy." It says "make it prove the claim."

Generative tools clear the bench all the time. So do narrow, well-evidenced models. The skill is telling them apart from the half that only works in the demo. Get that right, and you stop funding the half that evaporates.

Tends to clear the bench

Bounded tasks with low ambiguity
Independent, reproducible evidence
Generative tools judged on output
Aggregate forecasts, not individual fates

Tends to fail it

Predicting individual human futures
Accuracy that only exists in the demo
Decisions with no recourse for the subject
"Every" and "100%" doing the selling

SOURCE: Arvind Narayanan and Sayash Kapoor, AI Snake Oil: What Artificial Intelligence Can Do, What It Can't, and How to Tell the Difference (Princeton University Press, 2024). COMPAS figures via ProPublica, Chapter 3. Hit-song leakage, 400-paper audit, and reproducibility data, Chapter 7. Dutch welfare, Robodebt, and pilot automation cases, Chapter 2. Weather and protein-structure examples, Chapters 3 and 7. Incident count from the AI, Algorithmic, and Automation Incidents and Controversies Repository, 2024.

Want to talk about this?

If something here resonated with a problem you're working on, let's spend 15 minutes on it.

Schedule a Discovery Call