Abstraction × Judgment: A Theory of Why AI Projects Fail

Every technology that reshaped how people work over the last half-century arrived as an extension of a physical or mental activity it replaced. The computer replaced the paper folder, the physical messy desktop, and the office trash can. The success came from piggybacking the analogue office so the mental model would transfer for free. E-commerce was a store you didn’t drive to, but again, it was an extension of a physical experience; even the shopping cart was designed to match the mental model and semiotics. The internet search engine was the faster extension of the yellow pages. Each of these innovations carried what we can call a referent practice: a thing people already did, ported into a new medium. Adoption was easy, not because the technology was simple, but because nobody had to learn a new kind of activity, only a new venue for an old one. The tool told you what it was for.

Lior Zelering

6/30/202411 min read

photo of girl laying left hand on white digital robot
photo of girl laying left hand on white digital robot

The Theory

Every technology that reshaped how people work over the last half-century arrived as an extension of a physical or mental activity it replaced. The computer replaced the paper folder, the physical messy desktop, and the office trash can. The success came from piggybacking the analogue office so the mental model would transfer for free. E-commerce was a store you didn’t drive to, but again, it was an extension of a physical experience; even the shopping cart was designed to match the mental model and semiotics. The internet search engine was the faster extension of the yellow pages. Each of these innovations carried what we can call a referent practice: a thing people already did, ported into a new medium. Adoption was easy, not because the technology was simple, but because nobody had to learn a new kind of activity, only a new venue for an old one. The tool told you what it was for.

This is the first category of technology, and the internet, the personal computer, and the mobile phone all belong to it. We should resist the temptation to call their adoption frictionless, as there were many missteps along the way (and still are), as evidenced by the dot-com crash, the ERP graveyard, and a decade of abandoned enterprise apps. But their failures were failures of execution, timing, and business model. Nobody in 1999 was confused about what the internet was or what it replaced. The referent was clear; only the building was hard.

AI breaks this pattern, and the break is the whole theory. It carries no referent practice. Ask what it replaces, and the answer fractures: a task? a role? the analyst team? an entire function? It is the same technology at every level, which means it arrives unscoped. The buyer, not the tool, must decide what it is for. This is why AI does not belong in the first category at all. It belongs with more abstract and open discoveries or innovations, like electricity.

It is important to frame the idea I am making correctly. We use AI all the time to summarise meetings, conduct research, and write e-mails; these are tasks AI does very well when viewed through the lens of everyday use. Even with agentic AI capabilities, we might claim that it aligns well with referent-practice theory. But businesses want more, as they rightly believe that AI, as a concept, can and should transform how they operate, create new revenue streams, restructure their companies, and so on. In other words, they are asking, “What can AI do for me now?”

The first axis: abstraction

Electricity replaced nothing in particular and everything in general. It had no single referent; it displaced candles, steam engines, and iceboxes at once, and for roughly forty years, it produced almost no measurable productivity gain, because firms had to physically reorganise themselves around it before the value appeared. Economic historians call this the dynamo paradox (Paul David); its modern restatement is Erik Brynjolfsson’s “productivity J-curve,” in which general-purpose technologies cost before they pay, because the technology is the easy part and the reorganisation is the hard part. AI is a general-purpose technology in exactly this sense, an abstract substrate waiting to be poured into practices we have not yet designed. This is the first axis of the theory, and I will call it abstraction. It explains why AI is hard to outline and slow to land.

Abstraction cannot be the whole story, and here is the test that proves it. Take an AI task and scope it completely; kill the abstraction. A low-judgment, scoped task, such as transcription or classification, works beautifully. But a high-judgment, scoped task — “draft first-pass contract summaries,” perfectly bounded, clear deliverable — still forces you to read every output to catch the clause it invented. You removed the ambiguity, and a residue remained. That residue is the second axis, the part of this theory that electricity cannot explain.

The second axis: judgment

Electricity, once wired in, ran unattended and deterministically. A motor does not need you to check whether it turned the right number of times. AI is the first general-purpose substrate whose outputs require judgment to trust, and therefore the first that cannot be wired in and left alone. I will call this axis judgment-content. It is not fixable by scoping because it scales with the amount of judgment the task itself carries, regardless of how cleanly the task is bounded.

The finished model has two terms: failure ≈ abstraction × judgment-content. Low-abstraction, low-judgment work succeeds (transcription). Low-abstraction, high-judgment work limps and demands constant supervision (scoped drafting). High-abstraction, low-judgment work lags but recovers (the broad rollout whose outputs are still checkable). And high-abstraction, high-judgment work, “replace the analyst team with AI,” unscoped and unverifiable, is where things might take an ugly turn — and notice, I did not even start talking about token economy, infrastructure, teams, training, and deployment; this is still on the theoretical side of the project.

Aligning with what is already known

The judgment axis is not new to AI as a phenomenon; it is new to AI as a technology. Lisanne Bainbridge described its shape in 1983 in “Ironies of Automation”: automating a task leaves the human with the harder residual job of monitoring the automation and catching its errors, while eroding the skill needed to do so. To put it simply, humans have outsourced map-reading to apps to the point that we can no longer read maps very well.

Principal-agent theory in economics provides the formal cost: a principal who delegates to an agent incurs monitoring costs and bears the residual loss between what was wanted and what was delivered. That apparatus was never applied to technology, because a hammer is not an agent. AI is the first tool to land on the agent side of that line, which is exactly what words like “advises” and “collaborates” are reaching for. And the routine-biased technical change literature (Autor, Levy, Murnane) predicted that non-routine cognitive work was the ground machines could not touch. AI is disruptive precisely because it inverts that prediction and reaches into the generative layer.

The sharpest objection is the human consultant, who looks as though he should fail the same way and manifestly does not. You delegate genuine judgment to an analyst, cannot fully predict the output, must review the work, yet consulting is a trillion-dollar industry that works. The resolution reveals the theory rather than wounding it: the consultant is scoped (a statement of work) and sits inside a trust scaffold — reputation, liability, credentials, a repeated game — that substitutes for verifying every deliverable from scratch. AI has the judgment without the scaffold: no reputation at stake, no liability, no accountability, and no reliable read of what you actually meant.

Agents: the manufactured referent

If the consultant works because of a scaffold the AI lacks, the rise of AI agents is best read as the market trying to manufacture the missing referent rather than the missing scaffold. An agent wraps the abstract substrate in a familiar costume — a worker, a teammate, a named persona with a job title — so the mental model transfers for free, exactly as the desktop and the shopping cart once did. The promise is legible because it borrows a practice every manager already owns: hire someone and hand them the work. This is why agent companies lean so hard on human names and titles; the anthropomorphism is not decoration, it is the referent.

But here the manufactured referent betrays itself, and the betrayal is instructive. The desktop metaphor was honest — a digital folder really does behave like a folder, so the borrowed model fit the underlying mechanics. The agent-as-colleague metaphor is not honest in the same way. It imports the expectations of human delegation — that you can offload the task and trust the loop to close — without importing the two things that make delegation work: the trust scaffold and a reliable read of what you meant. The costume triggers the instinct to offload and forget at precisely the moment that instinct is most dangerous. So the metaphor does its damage in both directions at once: it lowers the perceived judgment-content at adoption, making the agent feel as easy to trust as a hire, while leaving the actual judgment-content untouched at delivery, where every output still has to be checked. The very thing that makes agents easy to adopt is what sets them up to disappoint — and however capable they become, an agent remains a presentation layer over the substrate, not the whole of its power.

The open variable

This identifies the one open variable and determines the theory’s reach. Is the judgment axis temporary — will better models, citation layers, and accountability structures slowly shrink it until AI behaves like electricity that mostly runs itself? Or is it structural, a permanent property of handing generative cognition to a non-agent that cannot be held to account? If temporary, the theory describes a hard transition. If structural, it describes a permanent condition, and the top-right quadrant never empties.

Either way, the theory earns its keep by predicting where failure concentrates: not around the hungriest-for-data systems or the largest models, but around the tasks where no automated signal can ever confirm the output is good, so a human must, forever. That is the difference between technology that replaces the yellow pages and technology that, like electricity, we are still learning how to wire in.

PART TWO

The Operational Instrument

Operationalizing the judgment axis — a scoring instrument and test protocol

The theory’s weak point is that “judgment-content” has so far been defined by example. To stop explaining wreckage after the fact and start predicting it, the construct must be scoreable from the task description alone — before anyone knows whether the project failed. This document turns the judgment axis into a measurable score, does the same for abstraction, and specifies the protocol that makes the result a genuine test rather than a post-hoc story.

1. The spine: the verification ratio

The single most useful quantity is the ratio of the cost to verify one output to the cost to produce it:

Verification ratio = (effort for a competent human to confirm one output is acceptable) ÷ (effort to produce that output from scratch)

Three regimes follow, and they map directly onto observed behaviour:

Ratio near zero. Verification is trivial relative to production. The calculator hands you a number you trust without redoing the arithmetic; transcription is confirmed by a quick read-along. The loop closes cheaply. These tasks work.

Ratio near one. Verification costs roughly what production did. To know whether the strategy memo is good, you have to re-reason it. The “saving” from outsourcing is illusory because the work merely relocated from production to verification.

Ratio above one. Verification costs more than production, because you must reconstruct reasoning you never performed. Here outsourcing is net-negative: using the tool increases total cost. This regime gives the instrument its sharpest practical output (Section 7).

The verification ratio is the spine, but judgment-content has more structure than one number captures. The next section decomposes it into four named dimensions, each scoreable before the outcome is known.

2. Scoring the judgment axis

Score each dimension 0, 1, or 2 by answering the question about the task, never about the result. Add the four for a judgment score (0–8); divide by 8 to put it on a 0–1 scale.

Dimension

Question (asked before any output exists)

0

1

2

Verification ratio

What share of production effort must a reviewer spend to confirm one output?

<20% (skim, spot-check, run the test)

~20–80% (close re-read, recompute key parts)

≥80% (essentially re-derive it)

Ground-truth signal

Is there an objective signal that confirms correctness without human judgment, and how fast?

Yes, automatic and fast (tests pass, conversion measured, checksum)

Exists but delayed and/or noisy (sales next quarter, slow A/B)

None — only human judgment ever arbitrates

Silent-failure risk

When the output is wrong, how obvious is the error on inspection?

Glaring (garbled text, broken code)

Findable with deliberate effort

Plausible and silent (fabricated citation, subtly wrong figure, omitted clause)

Intent gap

Can “correct” be fully specified in advance, or does acceptability depend on tacit context the requester holds but didn’t state?

Fully specifiable; a rubric exists

Mostly, with human calls at the edges

“I’ll know it when I see it” — acceptability is tacit

The ground-truth dimension hides a lever worth emphasising: its delay. A task can have a ground-truth signal and still fail if that signal arrives too late to close the loop in time — the operator flies blind in the interval, accumulating systematic error before reality reports back. The Zillow case in Section 6 is exactly this failure.

3. Scoring the abstraction axis

The other axis, scored the same way. Add the three for an abstraction score (0–6); divide by 6 to normalise.

Dimension

Question

0

1

2

Scope determinacy

Is the unit of analysis given, or must the org invent it?

Given (this task, this field)

Suggested; org refines boundaries

Org must invent the unit (“transform our operations”)

Referent availability

Is there a prior practice this maps onto?

Clean analog

Loose analog, partly misleading

No prior practice to anchor to

Reorganisation load

What must change to capture the value?

Drops into an existing role/owner

Some workflow change

New roles, ownership, and acceptance criteria that don’t yet exist

4. Placing the project: the risk grid

Plot the project on the two axes. The prediction is ordinal — failure risk rises as both scores rise — and the four corners give the qualitative map:

Low abstraction, low judgment — Works. Transcription, classification, autocomplete. The loop closes on its own.

Low abstraction, high judgment — Limps; needs babysitting. Scoped drafting, contract review. Bounded, but no output ships unchecked.

High abstraction, low judgment — Lags, then recovers. Broad rollouts whose outputs remain checkable. The electricity pattern: slow, then fine.

High abstraction, high judgment — The graveyard. “Replace the analyst team.” Unscoped and unverifiable. Predicted near-certain failure.

5. Making it predictive, not post-hoc

A rubric scored after you already know the outcome proves nothing. Four rules convert it into a real test:

Pre-register or score blind. Scores must be assigned from the task description by someone who does not know whether the project succeeded — or recorded before the outcome is observed. This is the single rule that separates prediction from rationalisation.

Use two independent raters and report agreement. If two scorers diverge widely on the judgment score, the construct is still too vague, and the disagreement is itself the finding — it tells you which dimension needs tighter anchors. High agreement is a precondition for taking any correlation seriously.

Control for the boring rivals. The ordinary failure literature blames data quality, budget, model capability, and executive sponsorship. The theory only earns its keep if the two scores predict failure after holding those constant — that is, if they add explanatory power beyond them. Collect them as covariates.

State the falsifiers up front. The theory is wrong if: failures scatter randomly across the grid; or they cluster by model size and data volume rather than by the two scores; or low-abstraction, low-judgment projects fail at the same rate as high-abstraction, high-judgment ones. Any of these kills it, and that is the point.

6. Worked demonstration

The table below applies the rubric to recognisable cases, reporting each as Low / Moderate / High / Very high on the two axes rather than raw decimals, for readability. Read it as a demonstration of the instrument, not as evidence for the theory — these are cases whose outcomes I already knew, which is precisely the post-hoc trap Section 5 exists to avoid. Real support requires blind scoring of cases whose outcomes the scorer does not know.

Case

Abstraction

Judgment

Predicted

What happened

Speech-to-text / transcription

Low

Low

Works

Ubiquitous

Code autocomplete (tests in loop)

Low

Low

Works

Widely adopted

Recommendation engine

Moderate

Low

Works

Embedded, mature

Algorithmic iBuying (Zillow Offers)

Moderate

High

Fragile

Shut down 2021

Tier-1 support chatbot (agent replacement)

High

High

Troubled

Often rolled back

AI contract review / redlining

Moderate

Very high

Needs babysitting

Rarely left unattended

Watson for Oncology

Very high

Very high

Fails

Shelved / divested

“AI strategy advisor” replacing the analyst team

Very high

Very high

Fails badly

The canonical graveyard

Two things stand out even in a rigged demonstration. First, Zillow lands a “Fragile” rather than “Fails” prediction and looks only moderately abstract, yet it collapsed — because its ground-truth signal, though real, arrived too late. Resale prices confirmed the mispricing only months downstream, by which point the losses had compounded. Delay is its own failure mode, and the instrument catches it through the ground-truth dimension even when the headline bands look survivable. Second, the cases that work are not the ones with less data or smaller models; transcription and recommendation engines are enormously data-hungry. They succeed because their loops close without a human — exactly what the theory claims, and what the data-volume rival hypothesis cannot explain.

7. The practical payoff: a decision rule

Beyond explanation, the verification ratio yields an actionable rule that a tool or strategy framework cannot:

Verification ratio under ~0.3 — delegate freely; the loop closes cheaply.

Between ~0.3 and 1 — delegate, but staff the verification explicitly and budget for it; the “saving” is smaller than it looks.

At or above 1 — do not outsource the task to AI as-is. Either decompose it into sub-tasks with a lower ratio, or supply a trust scaffold that substitutes for per-output verification — the same thing that lets firms delegate judgment to consultants: credentials, liability, an audit trail, a track record, a repeated relationship. Absent that scaffold, outsourcing a task whose verification costs as much as the work is net-negative by construction, and the project belongs in the graveyard quadrant whether or not anyone has admitted it yet.

This is where the theory stops being a lens and becomes a procedure: score the task before you build, and if it lands top-right, the correct move is not a better model but a smaller unit of analysis or a deliberately engineered scaffold of trust.

Contact

Questions or ideas? Reach out anytime.

Email

Phone

Liorzelering1@gmail.com

© 2025. All rights reserved.