The Real Problem Is Not the Technology
A friend dropped this question in our WhatsApp group last week:
“Are there any startups building agents that can learn business processes by observing humans doing them? We have 2,000 to 3,000 people handling back-office processes in Finance and HR. We’ve had good success with about 10 products on the Microsoft stack, but the processes are non-standard. It’s tough to scale without going through the massive change management effort of standardisation.”
I started typing a reply. Then I realised the answer was too long for WhatsApp. So here it is as a proper blog. Because this question is not really about AI agents. It’s about the oldest problem in operations: the gap between how people say they do the work and how they actually do the work.
Let’s be honest about what’s actually going on in most back-office teams. There is one standard process. The official one. The one in the training manual. The one that got signed off by compliance three years ago. Then there are 10 to 15 ways the work actually gets done.
Some are shortcuts. Some are workarounds. Some are jugaad fixes that someone figured out on a Tuesday afternoon when the system was down and a deadline was approaching. Some are genuinely better than the standard process but nobody ever wrote them down.
If you ask 30 people to list the 10 to 15 variations, they can’t. Not because they’re hiding anything. Because they don’t even know they’re doing it. It’s muscle memory.
So the question becomes: how do you capture what people actually do, including the stuff they can’t articulate?
Six Loops. Start Simple. Go Deeper Only When You Need To.
I think about this as six loops. Each one captures more detail than the last. You start at Loop 1. You only move to the next loop when the previous one isn’t giving you enough.
Most problems get solved by Loop 2 or 3. You rarely need Loop 6. But knowing the full stack means you always know where to go next.
Just Ask Them
CaptureStart hereConversations with the people who do the work, transcribed and pattern-mined.
AI Interviews Them
CaptureA conversational agent probes for variations the human interviewer might miss.
Screen + Voice Recording
CaptureSee what they do AND hear why. The frame and the narration, synchronised.
Pattern Extraction & Weighting
BuildA probability map of every variation. Power-law distribution.
The Agent That Knows Its Limits
BuildHandles known paths. Stops and flags the unknown ones. No silent failure.
Continuous Learning
BuildEvery human resolution becomes training data. The 26th variation gets added.
Loop 1 — Just Ask Them
The simplest version. You sit down with 20 to 30 people who do the work. You record the conversations. Not a formal interview with a clipboard. A conversation. “Walk me through what you did this morning. What happened when the system didn’t have the field you needed? What do you do when the approval takes too long?”
Then you take those transcripts and feed them to an AI. Not to summarise. To extract patterns: “Here are the 14 different ways your team processes expense reimbursements. Seven of them are variations of the standard. Four are workarounds for system limitations. Two are completely unofficial but faster. One is technically non-compliant.”
That’s it. Loop 1. Voice in. Patterns out.
You’d be amazed how much you can capture just by listening. Finance processes, HR onboarding, procurement approvals. These are not quantum mechanics. The complexity is not in the logic. It’s in the variations. And people will tell you the variations if you ask the right way.
Loop 2 — The AI Interviews Them
Loop 1 has a limitation. You’re doing the interviewing. You might miss follow-up questions. You might not know enough about the process to probe the edge cases.
So in Loop 2, the AI does the interviewing.
You build a simple conversational agent that asks: “Walk me through your last expense claim. What happened next? Was that the normal way or was there something different this time? What do you do when that field is missing?” The AI keeps asking until it has mapped the full path. Then it compares that path against every other path it has collected. If it finds a new variation, it flags it. If it matches an existing one, it adds weight to that pattern.
Over 30 conversations, the system builds a probability-weighted map of every way the work gets done. Not the manual. The reality.
Loop 3 — Screen Recording While Talking
Now we’re going deeper. Some processes are hard to explain verbally. “I click on that thing, then I go to the other tab, then I copy the number from the email.” That’s not helpful in a transcript.
So in Loop 3, you ask people to do the work on a screen recording while narrating what they’re doing. Voice and screen together. “OK so I’m opening the invoice portal now, I’m looking for the PO number, it’s not in the standard field so I have to check the email from the supplier, here it is, now I’m pasting it into the notes section because the PO field doesn’t accept this format…”
Now you have two streams: what they’re doing (screen) and why they’re doing it (voice). When you combine these, you capture the nuances that pure voice misses. And the technology to do this already exists.
claude-video by Brad Flaugher
Takes any video, extracts frames at timed intervals, pulls a timestamped transcript, and hands both to Claude. The AI sees every screen and hears every word, synchronised to the second. Originally built for watching YouTube. The architecture is identical to what you need for process capture.
That’s the difference between “I copied the number” (useless) and “at 2:47, the user opened tab 3 of the invoice portal, scrolled to the notes field, and pasted value PO-2847 from the supplier email visible in the background” (actionable).
Screen capture while talking. That’s the gold standard for process capture. Everything below this loop is just getting there faster.
Loop 4 — Pattern Extraction and Weighting
By Loop 4, you have enough data to build the actual map. And the map looks like this:
Standard
Variant 2
Variant 3
Variant 4
Variant 5
Variant 6
This is a power law distribution. And here’s what matters: as far as the AI agent is concerned, each variation is just another branch. Variant 6 is not harder than Variant 1. It’s just rarer. The agent doesn’t care about frequency. It cares about completeness. Can it handle this path? Yes or no.
Loop 5 — The Agent That Knows Its Limits
This is where most automation projects go wrong. They build for the 90% and pray the other 10% doesn’t show up. When it does, the system breaks silently. Or worse, it processes it wrong and nobody notices for three months.
A properly built agent does something different. It tries all 25 known variations. If the case fits one of them, it processes it. If it doesn’t fit any of them, it stops and says:
“I’ve checked all 25 variations. This case doesn’t match any of them. You need a 26th path. Go figure it out with your team and come back and program me.”
That’s not a failure. That’s the agent doing its job. It knows what it knows. It knows what it doesn’t know. And it tells you.
The output is processing the invoice. The outcome is knowing that 99.5% of invoices get processed automatically and the remaining 0.5% get flagged for human review with a specific reason why. No silent failures. No prayers.
Loop 6 — Continuous Learning
Loop 6 is where the system starts improving itself. Every time a human resolves a case that the agent couldn’t handle, that resolution becomes training data. The 26th variation gets added. The weights get updated. The next time this edge case shows up, the agent handles it.
Over six months, the system goes from handling 90% of cases to 95% to 98% to 99.2%. Not because someone redesigned the process. Because the system learned from what actually happened.
The Flywheel
That’s the whole thing. Six loops. Each one building on the last. Start with voice. Graduate to screen capture. Build the probability map. Let the agent run. Let it tell you when it’s stuck. Let it learn from the fix.
Why Standardisation Is the Wrong Starting Point
Now let me come back to the original question. “It’s tough to scale without going through the massive change management effort of standardisation.”
Here’s the thing. Standardisation is the most expensive, slowest, most politically painful way to solve this problem. You’re asking 2,000 people to stop doing what works and start doing what the manual says. They’ll resist. Not because they’re difficult. Because their “non-standard” process often is better for their specific situation.
The six-loop approach flips this entirely. Instead of forcing everyone into one process, you map every process that exists. You build an agent that handles all of them. Then, over time, the data tells you which variations are genuinely better and which are just habits. The standardisation happens bottom-up, driven by evidence, not top-down, driven by a consulting firm’s PowerPoint.
The outcome is not “we standardised the process.” It’s “we captured every way the work gets done, automated 98% of it, and the remaining 2% gets smarter every month.”
The Technology Is Already Here
Let me be specific about what you need to build this. Because this is not theoretical.
Any transcription service. Whisper (open source, free). Groq’s hosted Whisper (basically free). Record a conversation on your phone. Transcribe it. Feed it to Claude or GPT. Ask it to extract the process variations. This works today. Right now. On your laptop.
Any screen recording tool. Loom. OBS (free). QuickTime (Mac, free). Then use claude-video to feed the recording to an AI that can see the frames and read the transcript together. A 45-minute process walkthrough costs roughly a dollar to analyse. That’s nothing.
This is where your engineering team comes in. But the inputs are already there from Loops 1 to 3. You have the process map. You have the variations. You have the weights. Building the agent is the straightforward part. Capturing the reality is the hard part. The six loops solve the hard part.
The Bottom Line
Your friend’s question was: “Are there agents that can learn business processes by observing humans?”
The answer is: the agent is the last step, not the first.
The first step is capturing how the work actually gets done. Not the manual version. The real version. All 25 variations of it. Including the ones nobody can articulate until they’re doing it and talking through it at the same time.
Start with voice. Graduate to screen capture. Build the map. Weight the variations. Build the agent. Let it flag what it can’t handle. Let it learn from the fix.
Six loops. Start at one. Go deeper only when you need to. The only thing standing between your 2,000-person back office and a system that handles 98% of it automatically is someone willing to start recording conversations.
Tools referenced
Part of the From Outputs to Outcomes series. The equation: Domain Engineering + Context Engineering + Prompt Engineering + Human Feedback = Outcomes. This post is about the Human Feedback loop. The part where the system learns from what humans actually do, not what the manual says they should do.
Keep Reading
Explore more essays on strategy and AI