The Chinese Room Goes to Work

In 1980 the philosopher John Searle published a thought experiment that has been annoying computer scientists ever since.¹

Imagine a room. Inside the room sits a person who speaks no Chinese. Through a slot, slips of paper come in covered in Chinese characters. The person consults a thick rulebook — written in English — that tells them, for any given input, exactly which Chinese characters to write down and pass back out. They follow the rules. The output, to anyone outside the room, is fluent, contextually appropriate Chinese.

The question Searle asks is simple. Does the person in the room understand Chinese?

His answer is no. The person is shuffling symbols according to instructions. They have no idea what the symbols mean. Whatever appears to be happening in the room — a conversation, a translation, the comprehension of language — is happening in the rulebook, and the rulebook is just a stack of paper. There is no understanding anywhere. There is only the appearance of understanding, which Searle takes to be a different and much less interesting thing.

The argument has been picked at for forty-five years. The standard counter-moves are the systems reply (the person doesn’t understand Chinese, but the system as a whole — person plus rulebook plus room — does), the robot reply (give the room sensors and effectors so it can interact with the world, and now it understands), and the brain simulation reply (write a rulebook that simulates the firing of every neuron in a Chinese speaker’s brain, and you’ve reproduced understanding). Searle has answers to all of them, and his answers are, depending on your taste in philosophy of mind, either devastating or stubborn.

What I want to do is take the room out of the academy and put it in an office.

Family resemblance

The Chinese Room and a large language model are not the same thing, but the family resemblance is hard to miss.

Both are systems that produce fluent linguistic output without being conscious in any sense we’d recognize. Both work by transforming symbols according to rules. Both pass a surface-level Turing test for many tasks and fail it for others in characteristic ways.

The differences matter, though, and they’re more interesting than the similarities.

Searle’s rulebook is written by hand. Some unspecified team of linguists has worked out, in advance, what Chinese sentence to produce in response to every Chinese input. The rulebook is finite, deterministic, explicit. If you want to know why the room said what it said, you can look up the rule.

A language model’s rulebook — to the extent the metaphor holds — is learned from data. Nobody wrote it. It emerged from gradient descent over a few trillion tokens of text. It is enormous, statistical, opaque. If you want to know why the model said what it said, you can’t look up the rule. Nobody wrote one.

That distinction sounds technical, but it has a practical consequence. Searle’s room is incredibly expensive to build (someone had to write the rulebook) and almost free to run. A language model is the opposite — incredibly expensive to train and very cheap, per query, to run. The genuinely surprising trick of deep learning is that the rulebook gets compiled automatically in the sense that no human has to sit and write it. You feed in the text and the rulebook compiles itself.

This is the move that makes the Chinese Room feel suddenly relevant again, because it answers the obvious objection to the original thought experiment. Who could possibly write that rulebook? Turns out — nobody had to. We made one anyway.

Now make it do knowledge work

Searle’s rulebook converts Chinese in to Chinese out. Useful for a translator. Less useful for an analyst. So let’s update the experiment. The person in the room — let’s call her Alice, because traditionally at least one person in Physics hypotheticals is called Alice — receives a Chinese-character version of an email from her boss saying “Pull together a competitive brief on three players in the smart sensor market by Friday, focus on industrial deployments, no fluff.” Alice looks up the rule. The rulebook tells her exactly which Chinese characters to put on paper and pass back out. The output is a competitive brief.

Plausible? In principle, yes. We just need a much larger rulebook. The rulebook now has to encode not only how to translate the request, but how to do the work the request describes. It needs entries for what counts as a competitive brief at this company, which three players are interesting given recent conversations Alice had with the sales team, what “no fluff” means coming from this particular boss, what level of confidence to use given that nobody has actually run the numbers. It needs entries for the dozens of micro-judgments that turn an instruction into a deliverable.

This is the version of the Chinese Room that maps onto the discussion we’re actually having about AI.

The current commentary is, roughly, that we have built — or are about to build — a rulebook large enough to do white-collar work. Mustafa Suleyman told the Financial Times in February that essentially all white-collar tasks would be fully automated within eighteen months. Dario Amodei has talked about half of entry-level white-collar jobs being eliminated. Sam Altman is fond of the word category: entire categories of work, gone. The capital flowing into the sector — the figure I’ve seen most often is six hundred billion dollars in 2026 alone — assumes the rulebook is essentially in hand.²

The demos support the claim. A current frontier model can draft a memo, build a financial model, assemble a deck, summarise a hundred pages, navigate a spreadsheet, fill out a form, click through a web app at something approaching human reliability for routine tasks. If you watched a screen recording of a competent agent working through a task, with the part of the screen showing the agent system blurred out, you would not always be able to tell it from a junior employee.

So why hasn’t the average company moved its org chart? Why has the headcount not dropped at the 2,000-person manufacturer or the 500-person consultancy or the regional bank?

Because the specific rulebook problem is harder than the general rulebook makes it look.

The bit you can’t see

Here is the awkward shape of the argument. The model itself is generic. It has been trained on the public internet. It knows in general what a competitive brief is, in general what good prose looks like, in general how a spreadsheet works. It does not know what this company calls a competitive brief, which competitors are interesting to us, which sources we trust and which we ignore, who reads our briefs, what they want to take away in three minutes, and what color green the brand uses in chart bars.

For the model to do Alice’s job, somebody has to tell it all of that. Someone has to write that part of the rulebook.

Some of it is straightforward to write. Brand guidelines, templates, spec sheets, examples of past work — all of that can be extracted, cleaned, and given to the model as context. There is a whole stack of products being built to make this faster: tooling for connecting models to enterprise systems (the integration layer), tooling for managing agents like employees (the platform layer), tooling for letting them operate ordinary applications by clicking and typing (the computer-use layer). The plumbing is real and it is improving fast.³

But a lot of the rulebook is in nobody’s documentation. It is in Alice’s head. It is in her muscle memory, her relationships, her sense of what the VP of Engineering will react to, her knowledge that approved from her boss means “I read it” while approved from his boss means “I saw the subject line.” It is in the dozens of small calibrations she has made over four years that, taken together, distinguish her work from work that is technically correct and organizationally useless.

To get the Chinese Room to write Alice’s brief at Alice’s standard, someone has to write that part of the rulebook down. Either Alice does it herself — by which I mean, she sits down and tries to articulate, in writing, how she does her job — or someone else does it by observing her and inferring the rules. Both are extraordinarily expensive. The first is bounded by Alice’s ability to introspect on processes she has never had to make explicit. The second is the bill rate of the consultancy you hired to do business process documentation, multiplied by the number of months it takes them to produce something Alice will say is not quite right but I can’t put my finger on why.

This is what people mean, or should mean, when they say AI has a human problem. The model is the “cheap” bit. The rulebook for any specific deployment, in any specific organization, with any specific person’s job in mind, is the expensive bit. And the rulebook can only be written by humans, because the humans are the only ones who have ever had it.

So is anyone doing this work?

Yes — and the way they are doing it is informative.

Within enterprises, there is a small but growing population of people whose job is essentially to write the missing rulebook. Their titles vary: AI strategist, automation lead, internal AI advisor, prompt engineer (a phrase that should be retired but isn’t). Their work is part technical, part organizational. They sit with subject-matter experts, watch them do the job, ask the questions a new hire would ask and a few a new hire wouldn’t, and try to externalize what is tacit. Then they encode that into prompts, workflows, and agents.

Outside enterprises, the consulting firms are positioning themselves as the industrial-scale version of the same role. McKinsey, KPMG, Deloitte, BCG, and Accenture have all built agentic-organisation practices in the last eighteen months. The pitch is roughly: we will help you redesign your business so that agents can do the work. The deliverables are slide decks describing M-shaped supervisors and T-shaped experts and the new organizational shapes the AI era requires. They are often interesting. They are also, almost by definition, strategic — written at the level above the level where anyone actually has to write a rulebook.⁴

Then there are the platform vendors, who are building tools that embed a default rulebook. OpenAI’s Frontier explicitly models the agent as an employee — the agent gets onboarded, it gets a permission scope, it gets feedback, it gets reviewed. Anthropic’s Cowork ships pre-built plugins for sales, finance, legal, marketing, HR, engineering, design — bundles of skills that approximate what someone in that function might do. These are useful starting points. They are not Alice.

The honest thing to say is that the rulebook for any specific job at any specific company is not going to be written by McKinsey, by Frontier, by Cowork, or by the platform’s pre-built plugin. It is going to be written by someone inside the company, because only someone inside the company knows the things that need writing down. The platform vendors will provide the editor. The consultants will provide the framework. The actual sentences will get written by the AI strategist who has been there for two years and has finally got the team they are working with to explain how the month-end close actually runs.

That work is real and it is interesting. It is also, on a per-job basis, slow.

The arithmetic question

Now we can ask the question that has been hovering since the start.

If the rulebook has to be written, and if writing it requires a competent human to spend weeks observing a job and translating tacit work into explicit instructions, then for any given job we can run a simple comparison.

Cost to write the rulebook. Cost of having the human keep doing the work.

For some jobs the comparison goes one way. If the work is high-volume, repetitive, already semi-formal (someone wrote a procedure for it ten years ago and it’s mostly still right), the rulebook investment pays back fast. Customer-service triage, invoice processing, contract review against a template, first-pass code review — all of these have been pulling in this direction for years, and AI extends the reach. The replacement stories you read about — IBM, Klarna, BT, Duolingo — tend to cluster here.⁵ They are not random.

For other jobs the comparison goes the other way, and it goes the other way hard. If the work is bespoke, low-volume, judgment-heavy, and changes meaningfully with each engagement — most of what a senior individual contributor does — the rulebook is large, expensive to write, and obsolete six months after you finish. By the time you have documented how Alice writes a competitive brief well enough to automate it, Alice has changed how she writes them, because the market changed and the boss’s preferences changed and the competitor list changed. The rulebook is an anachronism, a Polaroid of a moving target.

In between is the interesting middle. The work that might pay back, depending on how stable the organization is, how disciplined the documentation, how patient the leadership, and how much of the job is actually doable by an agent with a thoroughly written rulebook versus how much is the residue that the rulebook will never quite capture.

I think the honest position, in 2026, is that we don’t know where the line is. We are in the early years of finding out, the way we were in the early years of finding out which workflows belonged in spreadsheets and which didn’t. Some companies will spend a fortune writing rulebooks for jobs that should have stayed in human heads. Some companies will save a fortune by writing rulebooks for jobs that everyone else still does by hand. The ratio between those two kinds of company is what the next five years are going to settle.

Back to the room

I started with Searle, and I should end there, because the original thought experiment has one more thing to say.

Searle’s claim was that the room doesn’t understand Chinese. The interesting move, for our purposes, is not whether he was right about understanding. It is that the room worked — produced fluent, appropriate Chinese — only because someone had written the rulebook. The rulebook was the load-bearing thing. The room was just a way to execute it.

The current AI conversation has, I think, mistaken the room for the load-bearing thing. We talk about model capability as if the model is what does the work. The model is the room. The work is in the rulebook, and most of the rulebook hasn’t been written, and most of the people who could write it have other things to do. In AI expert blogpost-speak the rulebook is context. And if we are smart about how we do context engineering we can have the model do all sorts of things. But it amounts to writing a rulebook for a Chinese Room.

Maybe we will write it anyway, because the prize is large enough and the tooling will keep getting better. Maybe we will discover that for a surprising amount of knowledge work the rulebook costs more than the work. I don’t know which. I suspect both, in different proportions in different places, for the next several years.

What I am fairly sure of is that the question is not can the model do the job. The model can do an alarming number of jobs. The question is who is going to write the rulebook, what does that cost, and is it cheaper than just letting Alice keep doing it. That is a much less exciting question than the one we keep asking. It is also the one with the money in it.

John Searle, “Minds, Brains, and Programs,” Behavioral and Brain Sciences 3, no. 3 (1980): 417–424. ↩︎
The Suleyman quote is from his February 2026 Financial Times interview; the Amodei and Altman positions have been articulated in multiple venues over the last year. The $600B 2026 figure is the one most often quoted in industry coverage; methodologies vary and I would treat it as directionally rather than precisely accurate. ↩︎
Anthropic released the Model Context Protocol in November 2024; it had reached 97 million monthly SDK downloads by March 2026 across the major providers, and was donated to the Linux Foundation’s Agentic AI Foundation in December 2025. ↩︎
McKinsey’s “The Agentic Organization” (September 2025) and the February 2026 “State of Organizations” report are the most-cited entries in this category. Their five-pillar framework is genuinely useful as a map. It is also a map. ↩︎
KPMG’s Q4 2025 AI Pulse Survey reports that 64% of organisations have already altered entry-level hiring approaches due to agent adoption. ↩︎