Meta-Harness/meta_harness

Meta-Harness

A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier.

Test-time searchMIT

# Meta-Harness (`meta_harness`) > A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier. ## Overview Meta-Harness (Stanford IRIS Lab) strips the evolutionary outer loop to its bare minimum: there is no parent selection, no archive policy, and no mutation operator. The entire search lives in a skill-steered coding-agent proposer that reads the whole candidate history — every prior program, its score, its report, and its execution trace — and writes a fixed number of brand-new full programs each round. The outer loop's only jobs are to validate each candidate's interface, evaluate the valid ones, append their outcomes to a running summary, and recompute a Pareto frontier. The run's product is that frontier, not a single best program. The proposer is constrained by near-verbatim steering rules carried in an editable `SKILL.md` file rather than in code, because the paper's practical-tips appendix found that editing the skill text moved results more than any loop constant. Those rules forbid parameter-only variants ("identical except constants => rewrite"), forbid dataset-specific hardcoding, forbid early stopping, cap each candidate's report at 30 lines, and rotate the search across six exploitation axes so successive rounds explore different mechanism families instead of clustering on one. This is a faithful port of the reference implementation (the canonical `text_classification` example), with the code treated as ground truth wherever paper and code diverge. The original proposer is a Claude Code session with filesystem tools steered through `--append-system-prompt`; a chat proposer cannot browse, so this port serializes the exact slice the skill's reading list points the agent at — the evolution-summary table, the Pareto frontier, recent reports, errors-first trace excerpts, and the full source of the top frontier members — into a single prompt and FIFO-dispenses the parsed candidates one per Galapagos iteration. Because Galapagos runs one child per iteration while the reference evaluates k candidates per proposer session, a reference run of N iterations maps to N*k Galapagos iterations. The bundled budget of 60 is 20 reference iterations times k=3 candidates. The cost axis of the frontier is the genome character count by default — the universal analogue of the reference's injected-context character count — though any task metric key may be named instead. ## Algorithm Each Galapagos iteration dispenses one candidate from an internal queue. The queue is refilled by a single model call only when it is empty — that call is one "proposer round" and corresponds to one reference iteration. Selection is nominal: the policy hands back `population.best()` (the frontier's top member) purely as the lineage anchor and diff target, and tells the proposer in the prompt that every prior candidate is an equally valid base. The proposer parses k candidate sections from the response, compile-gates each one, queues the valid ones, and records the invalid ones as failed rows immediately. The population is append-only — nothing is ever pruned — and the Pareto frontier is recomputed after every admitted add. ``` # one proposer round = one model call = k candidates = k Galapagos iterations on each iteration: parent <- population.best() # frontier top; nominal anchor only frontier <- population.frontier() # the copy-then-edit source pool publish signals: iteration t of N, k, frontier table, axis_hint = AXES[(round) % 6] # rotates once per k iterations if proposer queue is empty: # refill: the "proposer session" prompt <- SKILL.md steering (system) + serialized filesystem view (user): task context, "iteration t of N", EVOLUTION SUMMARY table (every evaluated candidate), Pareto frontier (score, cost), R most recent <=30-line reports, sampled traces (errors first, then successes), full source of top_k_sources frontier members, current best program as the LAST fenced python block resp <- model.generate(prompt); record cost for each "### CANDIDATE i: name" section: source <- last fenced python block if compile(source) fails: memory.write(failed row, outcome="failed") # never evaluated else: queue.append(candidate) cand <- queue.pop(0) # FIFO: exactly one per iteration child <- parent.child(cand.source) # stamp candidate_name, report evaluate child (task-supplied evaluator) admit: if first add (seed): admit unconditionally # Phase-0 baseline elif eval-failure (validity in {0,-1} or hard error): reject, update nothing else: append (no eviction); recompute Pareto frontier after_step: if a gated child hijacked state.best: restore population.best(), tick staleness write evolution-summary row {name, iteration, score, cost, outcome} with the evaluator's text_feedback as the replayable trace store the candidate's <=30-line report finalize: summary["frontier"] <- [(name, combined_score, cost) for g in frontier] ``` The Pareto frontier is the exact port of the reference's `compute_pareto_frontier`: sort points by `(-score, cost)`, then sweep keeping every point whose cost is `<=` the running minimum. Dominance is strict on both axes — a point survives unless another has strictly higher score AND strictly lower cost — and exact ties on both axes are all kept. `best()` is `pareto[0]`, the highest-score member with ties broken by lower cost. ## Components A Galapagos scaffold composes six components — Population, SelectionPolicy, PromptBuilder, Proposer, Evaluator, and Memory. Meta-Harness's distinctive move is that the SelectionPolicy carries no real selection rule and there is no mutation operator: both are delegated to the skill-steered Proposer, leaving the outer loop to only validate, evaluate, record, and recompute the frontier. | Slot | Implementation | Role | |---|---|---| | Population | `MetaHarnessPopulation` (append_only_pareto) | Append-only candidate store; recomputes a strict-dominance Pareto frontier over (maximize combined_score, minimize cost) after every add; eval-failure admission gate. | | SelectionPolicy | `MetaHarnessPolicy` (proposer_delegated_frontier_anchor) | No selection rule; returns the frontier top as a nominal parent/anchor and the full frontier as inspirations; publishes the proposer-facing signal bus (iteration, k, frontier table, rotating axis hint). | | PromptBuilder | `MetaHarnessPromptBuilder` (skill_steered_filesystem_view) | Injects the `SKILL.md` steering as the system prompt and serializes the filesystem view (summary table, frontier, reports, sampled traces, frontier sources) as the user prompt, with the current best program as the last fenced block. | | Proposer | `MetaHarnessProposer` (k_candidate_queue) | One model call parses k `### CANDIDATE` sections; compile-gates each; FIFO-dispenses one child per iteration; records compile failures as failed rows at proposal time. | | Evaluator | task-supplied | The Galapagos task's evaluator; its `text_feedback` is persisted as the candidate's replayable trace. | | Memory | `MetaHarnessMemory` (evolution_summary_reports) | Append-only `evolution_summary.jsonl` rows plus the `reports/` store of <=30-line per-candidate analyses; read back to render the summary table and reports. | ## Configuration Keys this scaffold actually reads, from `config.yaml` and `build_components`: - `general.max_iterations` (60) — total Galapagos iterations; reference N iterations times k (20 x 3). Published on the signal bus as the prompt's "iteration t of N". - `seed` (0) — seeds the policy rng and the PromptBuilder's trace-sampling rng; identical seeds render byte-identical prompts. - `population.cost_metric` (`genome_chars`) — the frontier's minimize axis; `genome_chars` = `len(content)`, else any task score key (falls back to `genome_chars` when absent). - `proposer.candidates_per_proposal` (3) — k; the candidates produced per proposer round. Enforced socially by the steering; the Policy and PromptBuilder reuse it. - `prompt_builder.skill` (`skills/meta-harness/SKILL.md`) — the proposer steering, the search's primary hyperparameter; resolved against the scaffold dir first, then cwd, absolute paths as-is; `{candidates_per_proposal}` / `{exploitation_axes}` substituted at load. - `prompt_builder.top_k_sources` (3) — number of frontier members whose full source enters the prompt (the copy-then-edit pool). - `prompt_builder.reports_in_prompt` (6) — most recent <=30-line candidate reports replayed. - `prompt_builder.trace_errors` (2) / `prompt_builder.trace_successes` (1) — trace excerpts sampled errors-first then successes. - `prompt_builder.trace_max_chars` (1500) — per-excerpt clip. - `prompt_builder.summary_max_rows` (200) — evolution-summary rows rendered (most recent kept, never fewer than the most recent 50). ## When to use Reach for Meta-Harness when the bottleneck is producing genuinely different full programs rather than nudging an existing one, and when you want a (score x cost) frontier instead of a single winner. It shines when a capable coding-agent proposer can be trusted to read the whole history and choose its own bases, and when the search is best steered by editing prose rules (`SKILL.md`) rather than tuning loop constants. If you want explicit, mechanistic parent selection or a structured archive, a sibling fits better: `topk` keeps a greedy top-1-plus-context loop, `best_of_n` reuses a fixed parent under a counter, and `adaevolve`/`evox` carry real selection and (for evox) strategy co-evolution. Choose Meta-Harness precisely when you want the outer loop to get out of the way. ## Source Meta-Harness: End-to-End Optimization of Model Harnesses (Stanford IRIS Lab, arXiv:2603.28052). This is a faithful Galapagos port of the reference implementation's `text_classification` example, with the two domain SKILL.md files (text_classification + terminal_bench_2) bundled verbatim under `skills/meta-harness/references/`.