Meta-Harness/meta_harness

Meta-Harness

A minimal outer loop that delegates selection AND mutation to a skill-steered proposer over an append-only candidate history, returning a (score x cost) Pareto frontier.

Test-time searchMIT

# Meta-Harness — faithful port of the Meta-Harness reference defaults (stanford-iris-lab/ # meta-harness, reference_examples/text_classification — the canonical example; the code is the # ground truth wherever it diverges from the paper, e.g. k=3 per the shipped SKILL.md). # Sections mirror the six core components. seed: 0 general: max_iterations: 60 # reference --iterations 20 x k=3 candidates: the galapagos loop is # one-candidate-per-iteration, so 20 reference iters = 60 here population: # MetaHarnessPopulation (Pareto frontier) cost_metric: genome_chars # frontier cost axis (minimize). genome_chars = len(genome content), # the universal analogue of the reference's memory_context_chars # (characters, not tokens); may name any task metric key instead # (falls back to genome_chars when the key is absent) proposer: # MetaHarnessProposer candidates_per_proposal: 3 # k — SKILL.md "implement 3 new memory systems every iteration" # (enforced by the steering; the Policy and PromptBuilder reuse k) prompt_builder: # MetaHarnessPromptBuilder skill: skills/meta-harness/SKILL.md # the proposer steering — the search's PRIMARY HYPERPARAMETER (the # paper's practical-tips appendix: editing the skill text moved # results more than any loop constant). A real Agent-Skills SKILL.md, # dir-per-skill like the reference's .claude/skills/meta-harness/; # we ship it under skills/ (NOT .claude/) because the package tree # and the hub Files tab exclude dot-directories. Relative paths # resolve against the scaffold package dir first, then the cwd; # absolute paths are taken as-is — so the researcher workflow is: # copy the bundled SKILL.md, edit it, point this key at the copy. # Body tokens {candidates_per_proposal} / {exploitation_axes} are # substituted at load time (str.replace on that documented set only). top_k_sources: 3 # full sources of this many frontier members in the prompt — the # skill Step 3 copy-then-edit pool (the agent reads them from agents/) reports_in_prompt: 6 # most recent <=30-line candidate reports replayed (reports/ analogue) trace_errors: 2 # execution-trace excerpts sampled errors-first ... trace_successes: 1 # ... then successes ("deep-read failed AND successful trajectories") trace_max_chars: 1500 # clip per excerpt (the chat-port trace budget) summary_max_rows: 200 # evolution-summary rows rendered into the prompt — ALL rows up to # this cap (most recent kept, never fewer than the most recent 50) # NOTE (deviations from the upstream config surface, all documented in scaffold.py): # - the reference's 30 s subprocess import-check is compile(source) in-process (safe for arbitrary # task programs); its 2400 s proposer timeout / 7200 s benchmark timeout are owned by the model # host and the task evaluator in galapagos and are not re-exposed here. # - the held-out test phase (Phase Final) is not ported: galapagos tasks own their split # discipline; the run returns the Pareto frontier in result.summary instead.