Meta-Harness
A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier.
"""Meta-Harness — a faithful port of "Meta-Harness: End-to-End Optimization of Model Harnesses"
(Stanford IRIS Lab), following its reference implementation (the canonical
``reference_examples/text_classification`` example) wherever paper and code diverge. One module
per component:
population.py -> MetaHarnessPopulation (append-only filesystem-D analogue + Pareto frontier)
selection_policy.py -> MetaHarnessPolicy (no selection rule: nominal frontier-top parent + signals)
prompt_builder.py -> MetaHarnessPromptBuilder (SKILL.md steering + serialized filesystem view)
proposer.py -> MetaHarnessProposer (one call -> k candidates, FIFO-dispensed; compile gate)
evaluator.py -> MetaHarnessEvaluator (task-supplied)
memory.py -> MetaHarnessMemory (evolution_summary.jsonl + reports analogue)
scaffold.py -> MetaHarnessScaffold (the orchestrator that composes the six)
The method is a deliberately MINIMAL outer loop: no parent selection, no archive policy, no
mutation operator — everything is delegated to a skill-steered proposer that reads the whole
candidate history and writes k new full programs per round. The outer loop only validates,
evaluates, appends rows to the evolution summary, and recomputes the Pareto frontier
(maximize combined_score, minimize cost); the run's product is the frontier, not just the best.
The scaffold adds the pieces of reference control flow that sit outside the components:
* ``before_step`` — publishes ``general.max_iterations`` on the signal bus (the prompt's
"iteration t of N") and, once at iteration 1, appends the task seed's evolution-summary row.
NOTE this row is a declared deviation (adaptation 5 below): reference Phase 0 benchmarks the
baselines into ``D`` (``val.json`` + frontier) but writes NO ``evolution_summary.jsonl`` rows
for them. Also snapshots ``_stale`` for the exact gated-child repair below.
* ``after_step`` — the gated-child best repair plus the evolution-summary bookkeeping. Eval-gated
children (``eval_failed`` metadata OR an invalid ``EvalResult``) are rejected by the
population's admission gate and must update nothing — including ``state.best``: the base loop
promotes any fitter child *before* ``after_step``, so when a gated child hijacked ``state.best``
the hook restores ``population.best()`` and redoes the staleness tick exactly. Every evaluated
child then gets its summary row ({name, iteration, score, cost, outcome} — score 0 and
``"failed"`` for gated children, like the reference's crashed candidates) with its evaluator
``text_feedback`` persisted as the replayable trace, is stamped ``metadata["iteration"]`` /
``metadata["outcome"]`` (completing the genome metadata contract; both stripped again at
dispense time), and its <=30-line report is stored. ``result is None`` (NO_DIFF / abandoned
proposal) records nothing — the reference writes nothing for a failed proposer session either.
* ``_finalize`` — adds ``summary["frontier"]`` (names + both objective values): the reference's
Algorithm 1 returns the Pareto frontier, not a single best program.
Sanctioned adaptations (each documented at its implementation site):
1. **k-candidates-per-proposal queue** — the galapagos base loop is one-child-per-iteration, so
the proposer FIFO-dispenses ONE of the k parsed candidates per loop iteration and re-calls the
model only on an empty queue; reference N iterations x k candidates = N*k galapagos iterations
(bundled budget 60 = 20 x 3). (proposer.py)
2. **compile() interface validation** — the safe in-process analogue of the reference's 30 s
subprocess import-check, with failed candidates recorded as ``outcome: "failed"`` rows at
proposal time even when sibling candidates were valid (the reference silently drops those
rows; an information-loss quirk not reproduced). (proposer.py)
3. **Trace persistence** — the reference's shipped val-only config computes per-example traces
and discards them; this port persists what galapagos evaluators emit (``text_feedback``) and
replays stratified errors-first excerpts, aligned with the paper's ablation that raw traces
beat summaries. (memory.py / prompt_builder.py)
4. **Minimal-diff skill instantiation** — the bundled SKILL.md instantiates, for galapagos task
programs, the UNION of the official repo's two domain skills (text_classification +
terminal_bench_2 — both bundled VERBATIM, byte-identical, under
``skills/meta-harness/references/``). Only domain nouns are substituted (memory
systems/agents → programs, the ``predict()``/``learn_from_batch()`` method pair → the
EVOLVE-BLOCK region, the six exploitation-axis labels) and mechanically-impossible I/O steps
rewritten (pending_eval.json/``/tmp`` scripts/subagents → the in-response chat contract);
domain-neutral text is kept verbatim — including the published-approach list
"(DSPy, OPRO, Reflexion, CEIL, etc.)" and terminal_bench_2's "**Never mention task names**"
anti-overfitting rule. (selection_policy.py / prompt_builder.py)
5. **Seed summary row** — reference Phase 0 writes NO ``evolution_summary.jsonl`` rows for the
baselines (their scores reach the proposer only via the frontier); the chat port appends one
``{name: "seed", iteration: 0}`` row at iteration 1 so the serialized summary table — the chat
proposer's primary "what's been tried" channel — covers every evaluated candidate, seed
included. (scaffold.py ``before_step``)
"""
from __future__ import annotations
import logging
from ...config import GalapagosConfig
from ...models import GalapagosModel
from ...records import Genome, RunResult
from ..base_scaffold import GalapagosScaffold
from ..registry import register_scaffold
# one module per component (the Meta-Harness scaffold method)
from .memory import MetaHarnessMemory
from .population import MetaHarnessPopulation, display_name
from .prompt_builder import DEFAULT_SKILL, MetaHarnessPromptBuilder
from .proposer import MetaHarnessProposer
from .selection_policy import MetaHarnessPolicy
log = logging.getLogger(__name__)
@register_scaffold("meta_harness")
class MetaHarnessScaffold(GalapagosScaffold):
name = "meta_harness"
@classmethod
def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict:
seed = int(config.seed)
k = int(config.proposer.candidates_per_proposal)
pb = config.prompt_builder
return {
"population": MetaHarnessPopulation(
cost_metric=str(config.population.cost_metric),
),
"selection_policy": MetaHarnessPolicy(
seed=seed,
candidates_per_proposal=k,
),
"prompt_builder": MetaHarnessPromptBuilder(
candidates_per_proposal=k,
top_k_sources=int(pb.top_k_sources),
reports_in_prompt=int(pb.reports_in_prompt),
trace_errors=int(pb.trace_errors),
trace_successes=int(pb.trace_successes),
trace_max_chars=int(pb.trace_max_chars),
summary_max_rows=int(pb.summary_max_rows),
seed=seed,
skill=str(pb.skill if pb.skill is not None else DEFAULT_SKILL),
),
"proposer": MetaHarnessProposer(candidates_per_proposal=k),
"memory": MetaHarnessMemory(),
}
# ---- before_step: signal bus + the Phase-0 seed row ------------------------------------------
def before_step(self) -> None:
"""Publish the budget for the prompt's "iteration t of N" header, snapshot ``_stale`` for
the gated-child repair, and — once, at iteration 1 — record the task seed's summary row
(the open mirror of Phase 0 seeding ``D`` with the evaluated baselines)."""
self._stale_before = self._stale
sig = self.state.signals.setdefault("meta_harness", {})
sig["max_iterations"] = self.general.max_iterations
memory = self.memory
if self.state.iteration == 1 and hasattr(memory, "rows") and not memory.rows():
for genome in self.population.all():
memory.write("", kind="row", name=display_name(genome), iteration=0,
score=(genome.fitness if genome.fitness != float("-inf") else 0.0),
cost=self.population.cost_of(genome), outcome="evaluated",
trace=str(genome.artifacts.get("text_feedback") or ""))
# ---- after_step: gated-best repair + evolution-summary bookkeeping ----------------------------
def after_step(self, child: Genome, result) -> None:
if result is None: # NO_DIFF / abandoned proposal — the reference records nothing either
return
# eval-gated children must update NOTHING — including state.best: the base loop promotes
# any fitter child and resets _stale BEFORE after_step (base_scaffold.step), so restore
# the population's true best and redo the staleness tick exactly
gated = bool(child.metadata.get("eval_failed")) or not result.valid
if gated and self.state.best is child:
population_best = self.population.best()
if population_best is not child:
self.state.best = population_best
self._stale = getattr(self, "_stale_before", 0) + 1
# one evolution-summary row per evaluated candidate (update_evolution_summary): score 0 /
# "failed" for gated children (a reference failed row ALWAYS carries 0 — outcome is the
# literal "failed" exactly when avg_val == 0), the evaluator's text_feedback persisted as
# the trace (the EvalResult field first; artifacts mirror as the fallback)
memory = self.memory
name = child.metadata.get("candidate_name") or display_name(child)
outcome = "failed" if gated else "evaluated"
score = 0.0 if gated else (child.fitness if child.fitness != float("-inf") else 0.0)
# the genome carries its own bookkeeping (the mapping's metadata contract); children strip
# both keys at dispense time (proposer._STALE_KEYS)
child.metadata.update(iteration=self.state.iteration, outcome=outcome)
memory.write("", kind="row", name=name, iteration=self.state.iteration, score=score,
cost=self.population.cost_of(child), outcome=outcome,
trace=str((result.text_feedback
or child.artifacts.get("text_feedback")) or ""))
report = str(child.metadata.get("report") or "")
if report: # the <=30-line per-candidate report (the reports/ compression layer)
memory.write("", kind="report", name=name, iteration=self.state.iteration,
report=report)
# ---- run end: the Pareto frontier is the run's product ----------------------------------------
def _finalize(self) -> RunResult:
"""Algorithm 1 "Return Pareto frontier": the summary carries the frontier (names + BOTH
objective values, score-desc), not just the best program."""
result = super()._finalize()
result.summary["frontier"] = [
{"name": display_name(g), "combined_score": g.fitness,
"cost": self.population.cost_of(g)}
for g in self.population.frontier()
]
log.debug("pareto frontier — %d program(s)", len(result.summary["frontier"]))
return result