ON THIS PAGE

01The Development 02Why This Matters 03The Structural Mapping 04Three-Layer Validation 05Clean Metrics vs. Messy Objectives 06The Progression Timeline 07The Convergent Pattern 08The Investor Implication 09What This Changes 10Conclusion 11References

Addendum

March 2026

The Arena Is the Product.

How Karpathy's Autoresearch Validates the Domain Operating Frame Thesis — From a Completely Different Direction

Mohamed Anis, Founder, Human-Edge.AI

~20 min read

Addendum to: From Outputs to Outcomes: Context Is Not Enough

The Development

On March 7, 2026 — the same week this paper was being finalised — Andrej Karpathy open-sourced autoresearch, a system that runs approximately 100 ML experiments on a single GPU overnight with no human intervention after initial setup. The system accumulated 83 experiments and 15 kept improvements in its first published run. A larger version running on 8xH100 logged 276 experiments and 29 kept improvements. Karpathy described it as "part code, part sci-fi, and a pinch of psychosis."

The system has exactly three files that matter. prepare.py handles one-time data preparation and is never modified. train.py contains the full model, optimiser, and training loop — the agent edits this freely. program.md is where the human shapes the agent's research strategy. That third file is the only thing the human writes.

prepare.py

One-time data prep

train.py

Agent edits freely

program.md

Human writes frame

The agent operates in an autonomous loop: it modifies the training code, trains for exactly five minutes, checks the score (validation bits per byte), keeps or discards the result, commits improvements to a git branch, and repeats. All night. Without the human.

When asked why the agent doesn't iterate on the prompt too, Karpathy responded: "definitely. the current one is already 90% AI written I ain't writing all that."

Experiments (1x GPU)

15 improvements kept

276

Experiments (8x H100)

29 improvements kept

Garry Tan, president of Y Combinator, wrote the most incisive analysis of the pattern: "You can't just ask the agent to self-improve. You have to design the arena and give it a push based on your knowledge of how it all works." His closing line: "The best program.md wins."

Why This Matters for Human-Edge

At first glance, autoresearch has nothing to do with what we are building. Karpathy is optimising neural network hyperparameters. We are building context infrastructure for business domains. The domains are unrelated.

The architecture is identical.

The Structural Mapping

Create a beautiful comparison table/card grid mapping:

Karpathy's Autoresearch

program.md

Human-Edge.AI

Domain Operating Frame

Karpathy's Autoresearch

train.py

Human-Edge.AI

Human Graph + Job Graph

Karpathy's Autoresearch

val_bpb

Human-Edge.AI

Outcome telemetry (reply rates, meetings booked, etc.)

Karpathy's Autoresearch

5-minute training runs

Human-Edge.AI

Individual actions (emails, introductions, etc.)

Karpathy's Autoresearch

Git commits of improvements

Human-Edge.AI

Calibration signals

Karpathy's Autoresearch

The autonomous loop

Human-Edge.AI

The Execution and Feedback Loop

Karpathy built a Domain Operating Frame for ML research and called it program.md. He closed the feedback loop with a clean metric. He let agents iterate autonomously. The system compounds improvements without human intervention.

That is exactly what we described in Sections 4, 9, and 10 of the main paper. We arrived at the same architecture from the business domain direction. Karpathy arrived at it from the ML research direction. The convergence is not coincidental. It is structural.

Three-Layer Validation

Frame + Model + Feedback Loop = Compounding Improvement

What happens when each is removed:

Without the Frame

Agent has no constraints, seed-hacks, overfits.

Without the Model

Frame is inert. No execution layer.

Without the Loop

Random search, not research. No compounding.

The paper's core equations validate this:

Domain Operating Frame + Frontier Model = Aligned Decisions

Aligned Decisions + Execution Loop + Feedback Loop = Business Outcomes

Karpathy proved the architecture works. The open question is whether it transfers from domains with clean loss functions to domains with multi-dimensional objective functions. That is our research and engineering challenge.

Clean Metrics vs. Messy Objectives

Karpathy's arena has one metric: validation bits per byte. Lower is better. Unambiguous, immediate, comparable.

Karpathy's World

Validation Bits Per Byte

One metric. Lower is better. Unambiguous.

Business domains: Fundraising success is multi-dimensional — speed to close, valuation achieved, dilution accepted, strategic fit of investor, founder time invested, relationship capital preserved or depleted, signalling effects on future rounds. These dimensions trade off.

This is precisely why the Domain Operating Frame must be engineered, not just written in a Markdown file.

The Domain Operating Frame must encode six components:

The ontology

The workflow state

The constraints

The actor archetypes

The operational norms

The multi-dimensional objective function

A program.md for ML research is roughly 2,000 words of structured guidance. A Domain Operating Frame for fundraising is a living knowledge graph with hundreds of typed nodes, weighted edges, temporal dynamics, and an evolving objective function calibrated against real outcomes.

The architecture is the same. The arena is harder. That is the opportunity.

The Progression Timeline

Pre-2025

Gen 1: Raw Prompting

Human writes code

Feb 2025

Gen 2: Prompt Engineering

Human directs the work

Feb 2026

Transition: Agentic Engineering

Human orchestrates agents

Mar 2026

Gen 3: Context-Driven Interaction

Human designs the arena; agents execute autonomously

Each step removes one layer of human involvement. The human goes from writing code, to directing an agent that writes code, to designing the frame within which agents operate autonomously. The human's value migrates from execution to judgment to architecture.

This is exactly the shift Human-Edge is building for.

The Convergent Pattern

Three independent developments converging:

Mid-2025

The Ralph Wiggum Technique

Geoffrey Huntley

Bash loop, one agent, specs as success criteria. $50,000 contract delivered for $297 in API costs.

March 2026

Autoresearch

Andrej Karpathy

One agent, ML experiments, structured research frame. 100 experiments overnight.

January 2026

Gas Town

Steve Yegge

20-30 coding agents in parallel, 75,000 lines of code, 2,000 commits, 17 days.

The pattern:

01A structured frame defines the arena
02An autonomous loop executes within the frame
03A feedback mechanism determines what to keep/discard
04The human designs the arena, then steps back

The only difference is the domain.

The Investor Implication

A year from now, the companies winning won't be the ones with the most engineers or the most compute. They'll be the ones whose agents never stopped running. The best program.md wins.

— Garry Tan, President, Y Combinator

Translate into our language: the best Domain Operating Frame wins.

The model is commodity. The agents are commodity. The arena is the product. The frame is the moat.

That is what Human-Edge.AI is building.

What This Changes in Our Architecture

The autonomy gradient: how much freedom the agent has depends on the stakes.

Full Autonomy

Research, analysis, drafts, tracking

Human-in-the-Loop

Outreach sequencing, deck revisions, follow-up timing

Human Approval Required

Investor comms, data room sharing, term sheet responses

The frame governs not just what the agent does, but how much freedom the agent has at each stage.

Conclusion

Karpathy did not set out to validate the Human-Edge thesis. He set out to automate ML research. But the architecture he built — a structured frame, an autonomous execution loop, a clear feedback mechanism, and accumulated compounding improvement — is structurally identical to what this paper proposes for business domains.

The convergence from independent directions is the strongest possible signal that the pattern is real.

The arena is the product. The frame is the moat. The best Domain Operating Frame wins.

We intend to build it.

This addendum should be read alongside the main paper:

From Outputs to Outcomes: Context Is Not Enough

References

[8] Andrej Karpathy, "autoresearch," GitHub repository, March 7, 2026. github.com/karpathy/autoresearch

[9] Garry Tan, "Karpathy Just Turned One GPU Into a Research Lab," March 8, 2026.

[10] Geoffrey Huntley, "The Ralph Wiggum Technique," ghuntley.com, 2025.

[11] Steve Yegge, "Welcome to Gas Town," Medium, January 2026.

[12] Andrej Karpathy, @karpathy, Twitter/X, "agentic engineering" post, February 8, 2026.

Human-Edge.AI — March 2026