Meta-Harness
A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier.
Meta-Harness (Stanford IRIS Lab) strips the evolutionary outer loop to its bare minimum: there is no parent selection, no archive policy, and no mutation operator. The entire search lives in a skill-steered coding-agent proposer that reads the whole candidate history — every prior program, its score, its report, and its execution trace — and writes a fixed number of brand-new full programs each round. The outer loop's only jobs are to validate each candidate's interface, evaluate the valid ones, append their outcomes to a running summary, and recompute a Pareto frontier. The run's product is that frontier, not a single best program.
The proposer is constrained by near-verbatim steering rules carried in an editable `SKILL.md` file rather than in code, because the paper's practical-tips appendix found that editing the skill text moved results more than any loop constant. Those rules forbid parameter-only variants ("identical except constants => rewrite"), forbid dataset-specific hardcoding, forbid early stopping, cap each candidate's report at 30 lines, and rotate the search across six exploitation axes so successive rounds explore different mechanism families instead of clustering on one.
This is a faithful port of the reference implementation (the canonical `text_classification` example), with the code treated as ground truth wherever paper and code diverge. The original proposer is a Claude Code session with filesystem tools steered through `--append-system-prompt`; a chat proposer cannot browse, so this port serializes the exact slice the skill's reading list points the agent at — the evolution-summary table, the Pareto frontier, recent reports, errors-first trace excerpts, and the full source of the top frontier members — into a single prompt and FIFO-dispenses the parsed candidates one per Galapagos iteration.
Because Galapagos runs one child per iteration while the reference evaluates k candidates per proposer session, a reference run of N iterations maps to N*k Galapagos iterations. The bundled budget of 60 is 20 reference iterations times k=3 candidates. The cost axis of the frontier is the genome character count by default — the universal analogue of the reference's injected-context character count — though any task metric key may be named instead.
The six components this scaffold snaps together. Each block names its concrete implementation.
The set of candidate solutions in play — the gene pool the search evolves over.
Decides which genomes survive and reproduce — tournament, elitism, novelty, or your own policy.
Assembles the context handed to the model — parents, feedback, instructions, examples.
The LLM-driven variation operator — proposes new candidates by mutation and crossover.
Scores each candidate against the task — the fitness signal that drives selection.
Persists discoveries across generations — archives, islands, and lineage for the search.
Meta-Harness: End-to-End Optimization of Model Harnesses (Stanford IRIS Lab, arXiv:2603.28052); reference implementation in references/test_time_search_scaffolds/meta_harness (text_classification example)