Meta-Harness/meta_harness

Meta-Harness

A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier.

Test-time searchMIT

"""Meta-Harness — a faithful port of "Meta-Harness: End-to-End Optimization of Model Harnesses" (Stanford IRIS Lab), following its reference implementation (the canonical ``reference_examples/text_classification`` example) wherever paper and code diverge. One module per component: population.py -> MetaHarnessPopulation (append-only filesystem-D analogue + Pareto frontier) selection_policy.py -> MetaHarnessPolicy (no selection rule: nominal frontier-top parent + signals) prompt_builder.py -> MetaHarnessPromptBuilder (SKILL.md steering + serialized filesystem view) proposer.py -> MetaHarnessProposer (one call -> k candidates, FIFO-dispensed; compile gate) evaluator.py -> MetaHarnessEvaluator (task-supplied) memory.py -> MetaHarnessMemory (evolution_summary.jsonl + reports analogue) scaffold.py -> MetaHarnessScaffold (the orchestrator that composes the six) The method is a deliberately MINIMAL outer loop: no parent selection, no archive policy, no mutation operator — everything is delegated to a skill-steered proposer that reads the whole candidate history and writes k new full programs per round. The outer loop only validates, evaluates, appends rows to the evolution summary, and recomputes the Pareto frontier (maximize combined_score, minimize cost); the run's product is the frontier, not just the best. The scaffold adds the pieces of reference control flow that sit outside the components: * ``before_step`` — publishes ``general.max_iterations`` on the signal bus (the prompt's "iteration t of N") and, once at iteration 1, appends the task seed's evolution-summary row. NOTE this row is a declared deviation (adaptation 5 below): reference Phase 0 benchmarks the baselines into ``D`` (``val.json`` + frontier) but writes NO ``evolution_summary.jsonl`` rows for them. Also snapshots ``_stale`` for the exact gated-child repair below. * ``after_step`` — the gated-child best repair plus the evolution-summary bookkeeping. Eval-gated children (``eval_failed`` metadata OR an invalid ``EvalResult``) are rejected by the population's admission gate and must update nothing — including ``state.best``: the base loop promotes any fitter child *before* ``after_step``, so when a gated child hijacked ``state.best`` the hook restores ``population.best()`` and redoes the staleness tick exactly. Every evaluated child then gets its summary row ({name, iteration, score, cost, outcome} — score 0 and ``"failed"`` for gated children, like the reference's crashed candidates) with its evaluator ``text_feedback`` persisted as the replayable trace, is stamped ``metadata["iteration"]`` / ``metadata["outcome"]`` (completing the genome metadata contract; both stripped again at dispense time), and its <=30-line report is stored. ``result is None`` (NO_DIFF / abandoned proposal) records nothing — the reference writes nothing for a failed proposer session either. * ``_finalize`` — adds ``summary["frontier"]`` (names + both objective values): the reference's Algorithm 1 returns the Pareto frontier, not a single best program. Sanctioned adaptations (each documented at its implementation site): 1. **k-candidates-per-proposal queue** — the galapagos base loop is one-child-per-iteration, so the proposer FIFO-dispenses ONE of the k parsed candidates per loop iteration and re-calls the model only on an empty queue; reference N iterations x k candidates = N*k galapagos iterations (bundled budget 60 = 20 x 3). (proposer.py) 2. **compile() interface validation** — the safe in-process analogue of the reference's 30 s subprocess import-check, with failed candidates recorded as ``outcome: "failed"`` rows at proposal time even when sibling candidates were valid (the reference silently drops those rows; an information-loss quirk not reproduced). (proposer.py) 3. **Trace persistence** — the reference's shipped val-only config computes per-example traces and discards them; this port persists what galapagos evaluators emit (``text_feedback``) and replays stratified errors-first excerpts, aligned with the paper's ablation that raw traces beat summaries. (memory.py / prompt_builder.py) 4. **Minimal-diff skill instantiation** — the bundled SKILL.md instantiates, for galapagos task programs, the UNION of the official repo's two domain skills (text_classification + terminal_bench_2 — both bundled VERBATIM, byte-identical, under ``skills/meta-harness/references/``). Only domain nouns are substituted (memory systems/agents → programs, the ``predict()``/``learn_from_batch()`` method pair → the EVOLVE-BLOCK region, the six exploitation-axis labels) and mechanically-impossible I/O steps rewritten (pending_eval.json/``/tmp`` scripts/subagents → the in-response chat contract); domain-neutral text is kept verbatim — including the published-approach list "(DSPy, OPRO, Reflexion, CEIL, etc.)" and terminal_bench_2's "**Never mention task names**" anti-overfitting rule. (selection_policy.py / prompt_builder.py) 5. **Seed summary row** — reference Phase 0 writes NO ``evolution_summary.jsonl`` rows for the baselines (their scores reach the proposer only via the frontier); the chat port appends one ``{name: "seed", iteration: 0}`` row at iteration 1 so the serialized summary table — the chat proposer's primary "what's been tried" channel — covers every evaluated candidate, seed included. (scaffold.py ``before_step``) """ from __future__ import annotations import logging from ...config import GalapagosConfig from ...models import GalapagosModel from ...records import Genome, RunResult from ..base_scaffold import GalapagosScaffold from ..registry import register_scaffold # one module per component (the Meta-Harness scaffold method) from .memory import MetaHarnessMemory from .population import MetaHarnessPopulation, display_name from .prompt_builder import DEFAULT_SKILL, MetaHarnessPromptBuilder from .proposer import MetaHarnessProposer from .selection_policy import MetaHarnessPolicy log = logging.getLogger(__name__) @register_scaffold("meta_harness") class MetaHarnessScaffold(GalapagosScaffold): name = "meta_harness" @classmethod def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict: seed = int(config.seed) k = int(config.proposer.candidates_per_proposal) pb = config.prompt_builder return { "population": MetaHarnessPopulation( cost_metric=str(config.population.cost_metric), ), "selection_policy": MetaHarnessPolicy( seed=seed, candidates_per_proposal=k, ), "prompt_builder": MetaHarnessPromptBuilder( candidates_per_proposal=k, top_k_sources=int(pb.top_k_sources), reports_in_prompt=int(pb.reports_in_prompt), trace_errors=int(pb.trace_errors), trace_successes=int(pb.trace_successes), trace_max_chars=int(pb.trace_max_chars), summary_max_rows=int(pb.summary_max_rows), seed=seed, skill=str(pb.skill if pb.skill is not None else DEFAULT_SKILL), ), "proposer": MetaHarnessProposer(candidates_per_proposal=k), "memory": MetaHarnessMemory(), } # ---- before_step: signal bus + the Phase-0 seed row ------------------------------------------ def before_step(self) -> None: """Publish the budget for the prompt's "iteration t of N" header, snapshot ``_stale`` for the gated-child repair, and — once, at iteration 1 — record the task seed's summary row (the open mirror of Phase 0 seeding ``D`` with the evaluated baselines).""" self._stale_before = self._stale sig = self.state.signals.setdefault("meta_harness", {}) sig["max_iterations"] = self.general.max_iterations memory = self.memory if self.state.iteration == 1 and hasattr(memory, "rows") and not memory.rows(): for genome in self.population.all(): memory.write("", kind="row", name=display_name(genome), iteration=0, score=(genome.fitness if genome.fitness != float("-inf") else 0.0), cost=self.population.cost_of(genome), outcome="evaluated", trace=str(genome.artifacts.get("text_feedback") or "")) # ---- after_step: gated-best repair + evolution-summary bookkeeping ---------------------------- def after_step(self, child: Genome, result) -> None: if result is None: # NO_DIFF / abandoned proposal — the reference records nothing either return # eval-gated children must update NOTHING — including state.best: the base loop promotes # any fitter child and resets _stale BEFORE after_step (base_scaffold.step), so restore # the population's true best and redo the staleness tick exactly gated = bool(child.metadata.get("eval_failed")) or not result.valid if gated and self.state.best is child: population_best = self.population.best() if population_best is not child: self.state.best = population_best self._stale = getattr(self, "_stale_before", 0) + 1 # one evolution-summary row per evaluated candidate (update_evolution_summary): score 0 / # "failed" for gated children (a reference failed row ALWAYS carries 0 — outcome is the # literal "failed" exactly when avg_val == 0), the evaluator's text_feedback persisted as # the trace (the EvalResult field first; artifacts mirror as the fallback) memory = self.memory name = child.metadata.get("candidate_name") or display_name(child) outcome = "failed" if gated else "evaluated" score = 0.0 if gated else (child.fitness if child.fitness != float("-inf") else 0.0) # the genome carries its own bookkeeping (the mapping's metadata contract); children strip # both keys at dispense time (proposer._STALE_KEYS) child.metadata.update(iteration=self.state.iteration, outcome=outcome) memory.write("", kind="row", name=name, iteration=self.state.iteration, score=score, cost=self.population.cost_of(child), outcome=outcome, trace=str((result.text_feedback or child.artifacts.get("text_feedback")) or "")) report = str(child.metadata.get("report") or "") if report: # the <=30-line per-candidate report (the reports/ compression layer) memory.write("", kind="report", name=name, iteration=self.state.iteration, report=report) # ---- run end: the Pareto frontier is the run's product ---------------------------------------- def _finalize(self) -> RunResult: """Algorithm 1 "Return Pareto frontier": the summary carries the frontier (names + BOTH objective values, score-desc), not just the best program.""" result = super()._finalize() result.summary["frontier"] = [ {"name": display_name(g), "combined_score": g.fitness, "cost": self.population.cost_of(g)} for g in self.population.frontier() ] log.debug("pareto frontier — %d program(s)", len(result.summary["frontier"])) return result