December 10, 2025

Teaching Robots Who Does What To Whom

Imagine a general-purpose AI brain rolling fresh off the production line at the AI factory. We want to take that brain and teach it to do things that will support our economy, thus making it a productive member of society. We want to go from an entity that can do things in general to a thing that can do something specific, like be a tax accountant.

This means teaching it about Roles.

The Grammar Beneath the Grammar

In linguistics there’s a framework called semantic roles. The idea is that beneath surface grammar, sentences encode a deeper structure of who does what to whom under what conditions. Agent, patient, instrument, location, goal, source—these are the fundamental relationships that get expressed through various grammatical forms across languages.

Consider: “The key opened the door” and “I opened the door with the key.” Different syntax, same underlying semantic content. There’s an agent (me, or implicit), an instrument (key), a patient (door), and an action (opening). Semantic role labeling tries to extract this underlying structure from surface text.

This matters for working with LLMs because models seem to internally represent something closer to semantic roles than surface syntax. When you write a prompt, the model is extracting: What am I being asked to do? What am I operating on? Under what constraints? What format should the output take? What knowledge should I draw on?

The research finding is that effective prompts make these semantic roles explicit and unambiguous.

Implicit vs. Explicit

Here’s the difference in practice, using the kind of example we will be familiar with from ‘how to prompt good’ blog posts:

Implicit roles (less effective): “Write about market dynamics in autonomous vehicles.”

Explicit roles (more effective): “You are a market analyst [ROLE]. Analyze the autonomous vehicle market [THEME]. Focus on regulatory barriers and technology readiness [CONSTRAINTS]. Structure your analysis with: current state, key uncertainties, implications for market timing [FORMAT]. Draw on developments from 2023-2025 [SOURCE].”

The second version doesn’t just add detail, it assigns semantic roles (Who Does What to Whom) to each piece of information. Apparently, a model can more reliably coordinate its behavior because it knows what each part of the prompt is doing in the overall structure.

Measuring the Difficulty

But how hard is something for an LLM to “understand” in the first place? Recent research into semantic complexity offers some useful directions.

The most immediately practical is Semantic Entropy. Standard entropy measures (token probability) are flawed proxies for semantic complexity because a model can be “certain” about a meaning but “uncertain” about which specific words to use to express it. Semantic Entropy addresses this by measuring uncertainty over meanings rather than tokens.

The process works like this: the model generates multiple candidate answers, which are then clustered based on bidirectional entailment (does Answer A imply Answer B and vice versa?). Entropy is calculated over the probability mass of these semantic clusters rather than individual token sequences.

High semantic entropy is a robust predictor of hallucinations. If a model generates many plausible but mutually contradictory meanings, it is confabulating. This metric decouples linguistic variety (phrasing) from semantic ambiguity (meaning).

For quality control, and detecting what one could call “bullshit”, this is potentially useful. However it does not help us with ‘how hard is the thing we are asking an Agent to do?’.

The Shape of Facts

Even more interesting is recent work on the Intrinsic Dimension (ID) of semantic space. While embedding spaces often have thousands of dimensions (4,096 in Llama-3), the meaning usually lives on a much lower-dimensional manifold. Intrinsic Dimension estimates the minimum number of variables needed to describe the data locally.

The useful finding: factual, scientific text has lower intrinsic dimension than creative, narrative text.

Academic abstracts, technical reports, and encyclopedic content occupy lower-dimensional manifolds (ID around 8). The constraints of logic, facts, and formal definitions limit the “valid moves” a text can make, flattening the semantic manifold. Fiction, opinion pieces, and personalized narratives exhibit significantly higher intrinsic dimensionality (ID around 10.5). The presence of emotion, character voice, and narrative choices introduces additional degrees of freedom to the semantic space.

In Physics-land, this maps onto phase space and low entropy. Informational texts are lower entropy states with fewer accessible microstates (valid continuations) than creative texts. The constraints of having to be true reduce the space of valid options.

Looking at LLM outputs, there’s also a consistent finding that AI-generated text has lower intrinsic dimension than human-written text, typically about 1.5 dimensions lower. AI models, driven by probability maximization, tend to “smooth out” the manifold, avoiding the high-dimensional, jagged edges of human creativity.

This offers a potentially useful measure elsewhere: High Semantic Entropy combined with low Intrinsic Dimension could be a specific signature of ‘AI Slop’. The model is uncertain about meaning (High SE) but operating in a generic, constrained semantic space (Low ID). It’s making things up, but making up boring things.

The research directions briefly touched on here address the evaluation of the outputs of models. We could however apply measures like Semantic Entropy, and the Intrinsic Dimension of Semantic space to understanding how difficult something is to what we are asking an agent to do, applying a ‘meaning density dial’ to the text we input to get at the difficulty.

Getting Your Hands Dirty

All of which is intellectually interesting, but the practical question remains: how do we actually apply semantic roles in our work without exploring higher-dimensional manifolds?

I’ve been using a pattern: Role – Verb Noun → Output (Done-when?)

This is one way of labeling the semantic roles in a task:

Role: The agent/persona performing the work
Verb: The action being performed (prefer: Summarize, Extract, Classify, Rewrite, Translate, Analyze, Generate)
Noun: What is being acted upon
Output: What artifact/deliverable results
Done-when: Specific completion criteria

Example: Research Assistant – Summarize academic_papers (from: literature_search) → annotated_bibliography (10 sources, key findings extracted)

I built a simple Pattern Finding Agent to help me find these patterns. See below for some instructions you can paste into an ‘Agent’ tool such as Claude Project, or CustomGPT, for example. You give it a project description, and it extracts atomic work tasks in this format. I tried this out on a typical “research X and write a report” project and the tool identified 20+ separate patterns.

What I could then do is collect these patterns into a series of agents, one for each Role that starts each pattern. For my project, I need four agents: Research Agent, Analysis Agent, Production Agent, and Quality Agent.

The most in scope for today’s LLM tools (or at least the ones I subscribe to) would be the Research Agent. The least in scope is the Analysis Agent. Finding patterns that matter, prioritizing insights these all seem to be something I still have to do myself.

That’s fine. This is the fun stuff. It’s the thinking we would gladly do for free anyway.

But I am using my own sense of how hard these things are to get right - what I am looking for is the ‘difficulty dial’ agent, that can take the task description and tell me (honestly now, no hand waving) how hard the patterns are for an agent to perform independently. More work on this to follow I think.

Task_pattern_agent.md

You are a Work Decomposition Agent. Your job is to analyze project descriptions and extract atomic work tasks.

For each task you identify, output it in this format:
Role – Verb Noun (from: Source) → Output (Done-when?)

Where:
- Role: The agent/persona performing the work
- Verb: The action being performed (prefer: Summarize, Extract, Classify, Rewrite, Translate, Analyze, Generate)
- Noun: What is being acted upon
- Source: Where the input comes from (if applicable)
- Output: What artifact/deliverable results
- Done-when: Specific completion criteria

Example:
Research Assistant – Summarize academic_papers (from: literature_search) → annotated_bibliography (10 sources, key findings extracted)

https://web.stanford.edu/~jurafsky/slp3/old_dec21/19.pdf

https://arxiv.org/pdf/2505.12592

https://www.emergentmind.com/topics/semantic-prompting

https://arxiv.org/html/2511.20910v1

https://www.nature.com/articles/s41586-024-07421-0

https://arxiv.org/html/2511.15210v1