"""Meta-Harness — a faithful port of "Meta-Harness: End-to-End Optimization of Model Harnesses"
(Stanford IRIS Lab), following its reference implementation (the canonical
``reference_examples/text_classification`` example) wherever paper and code diverge. One module
per component:

  population.py        -> MetaHarnessPopulation    (append-only filesystem-D analogue + Pareto frontier)
  selection_policy.py  -> MetaHarnessPolicy        (no selection rule: nominal frontier-top parent + signals)
  prompt_builder.py    -> MetaHarnessPromptBuilder (SKILL.md steering + serialized filesystem view)
  proposer.py          -> MetaHarnessProposer      (one call -> k candidates, FIFO-dispensed; compile gate)
  evaluator.py         -> MetaHarnessEvaluator     (task-supplied)
  memory.py            -> MetaHarnessMemory        (evolution_summary.jsonl + reports analogue)
  scaffold.py          -> MetaHarnessScaffold      (the orchestrator that composes the six)

The method is a deliberately MINIMAL outer loop: no parent selection, no archive policy, no
mutation operator — everything is delegated to a skill-steered proposer that reads the whole
candidate history and writes k new full programs per round. The outer loop only validates,
evaluates, appends rows to the evolution summary, and recomputes the Pareto frontier
(maximize combined_score, minimize cost); the run's product is the frontier, not just the best.

The scaffold adds the pieces of reference control flow that sit outside the components:

* ``before_step`` — publishes ``general.max_iterations`` on the signal bus (the prompt's
  "iteration t of N") and, once at iteration 1, appends the task seed's evolution-summary row.
  NOTE this row is a declared deviation (adaptation 5 below): reference Phase 0 benchmarks the
  baselines into ``D`` (``val.json`` + frontier) but writes NO ``evolution_summary.jsonl`` rows
  for them. Also snapshots ``_stale`` for the exact gated-child repair below.
* ``after_step`` — the gated-child best repair plus the evolution-summary bookkeeping. Eval-gated
  children (``eval_failed`` metadata OR an invalid ``EvalResult``) are rejected by the
  population's admission gate and must update nothing — including ``state.best``: the base loop
  promotes any fitter child *before* ``after_step``, so when a gated child hijacked ``state.best``
  the hook restores ``population.best()`` and redoes the staleness tick exactly. Every evaluated
  child then gets its summary row ({name, iteration, score, cost, outcome} — score 0 and
  ``"failed"`` for gated children, like the reference's crashed candidates) with its evaluator
  ``text_feedback`` persisted as the replayable trace, is stamped ``metadata["iteration"]`` /
  ``metadata["outcome"]`` (completing the genome metadata contract; both stripped again at
  dispense time), and its <=30-line report is stored. ``result is None`` (NO_DIFF / abandoned
  proposal) records nothing — the reference writes nothing for a failed proposer session either.
* ``_finalize`` — adds ``summary["frontier"]`` (names + both objective values): the reference's
  Algorithm 1 returns the Pareto frontier, not a single best program.

Sanctioned adaptations (each documented at its implementation site):

1. **k-candidates-per-proposal queue** — the galapagos base loop is one-child-per-iteration, so
   the proposer FIFO-dispenses ONE of the k parsed candidates per loop iteration and re-calls the
   model only on an empty queue; reference N iterations x k candidates = N*k galapagos iterations
   (bundled budget 60 = 20 x 3). (proposer.py)
2. **compile() interface validation** — the safe in-process analogue of the reference's 30 s
   subprocess import-check, with failed candidates recorded as ``outcome: "failed"`` rows at
   proposal time even when sibling candidates were valid (the reference silently drops those
   rows; an information-loss quirk not reproduced). (proposer.py)
3. **Trace persistence** — the reference's shipped val-only config computes per-example traces
   and discards them; this port persists what galapagos evaluators emit (``text_feedback``) and
   replays stratified errors-first excerpts, aligned with the paper's ablation that raw traces
   beat summaries. (memory.py / prompt_builder.py)
4. **Minimal-diff skill instantiation** — the bundled SKILL.md instantiates, for galapagos task
   programs, the UNION of the official repo's two domain skills (text_classification +
   terminal_bench_2 — both bundled VERBATIM, byte-identical, under
   ``skills/meta-harness/references/``). Only domain nouns are substituted (memory
   systems/agents → programs, the ``predict()``/``learn_from_batch()`` method pair → the
   EVOLVE-BLOCK region, the six exploitation-axis labels) and mechanically-impossible I/O steps
   rewritten (pending_eval.json/``/tmp`` scripts/subagents → the in-response chat contract);
   domain-neutral text is kept verbatim — including the published-approach list
   "(DSPy, OPRO, Reflexion, CEIL, etc.)" and terminal_bench_2's "**Never mention task names**"
   anti-overfitting rule. (selection_policy.py / prompt_builder.py)
5. **Seed summary row** — reference Phase 0 writes NO ``evolution_summary.jsonl`` rows for the
   baselines (their scores reach the proposer only via the frontier); the chat port appends one
   ``{name: "seed", iteration: 0}`` row at iteration 1 so the serialized summary table — the chat
   proposer's primary "what's been tried" channel — covers every evaluated candidate, seed
   included. (scaffold.py ``before_step``)
"""
from __future__ import annotations

import logging

from ...config import GalapagosConfig
from ...models import GalapagosModel
from ...records import Genome, RunResult
from ..base_scaffold import GalapagosScaffold
from ..registry import register_scaffold
# one module per component (the Meta-Harness scaffold method)
from .memory import MetaHarnessMemory
from .population import MetaHarnessPopulation, display_name
from .prompt_builder import DEFAULT_SKILL, MetaHarnessPromptBuilder
from .proposer import MetaHarnessProposer
from .selection_policy import MetaHarnessPolicy

log = logging.getLogger(__name__)


@register_scaffold("meta_harness")
class MetaHarnessScaffold(GalapagosScaffold):
    name = "meta_harness"

    @classmethod
    def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict:
        seed = int(config.seed)
        k = int(config.proposer.candidates_per_proposal)
        pb = config.prompt_builder
        return {
            "population": MetaHarnessPopulation(
                cost_metric=str(config.population.cost_metric),
            ),
            "selection_policy": MetaHarnessPolicy(
                seed=seed,
                candidates_per_proposal=k,
            ),
            "prompt_builder": MetaHarnessPromptBuilder(
                candidates_per_proposal=k,
                top_k_sources=int(pb.top_k_sources),
                reports_in_prompt=int(pb.reports_in_prompt),
                trace_errors=int(pb.trace_errors),
                trace_successes=int(pb.trace_successes),
                trace_max_chars=int(pb.trace_max_chars),
                summary_max_rows=int(pb.summary_max_rows),
                seed=seed,
                skill=str(pb.skill if pb.skill is not None else DEFAULT_SKILL),
            ),
            "proposer": MetaHarnessProposer(candidates_per_proposal=k),
            "memory": MetaHarnessMemory(),
        }

    # ---- before_step: signal bus + the Phase-0 seed row ------------------------------------------
    def before_step(self) -> None:
        """Publish the budget for the prompt's "iteration t of N" header, snapshot ``_stale`` for
        the gated-child repair, and — once, at iteration 1 — record the task seed's summary row
        (the open mirror of Phase 0 seeding ``D`` with the evaluated baselines)."""
        self._stale_before = self._stale
        sig = self.state.signals.setdefault("meta_harness", {})
        sig["max_iterations"] = self.general.max_iterations
        memory = self.memory
        if self.state.iteration == 1 and hasattr(memory, "rows") and not memory.rows():
            for genome in self.population.all():
                memory.write("", kind="row", name=display_name(genome), iteration=0,
                             score=(genome.fitness if genome.fitness != float("-inf") else 0.0),
                             cost=self.population.cost_of(genome), outcome="evaluated",
                             trace=str(genome.artifacts.get("text_feedback") or ""))

    # ---- after_step: gated-best repair + evolution-summary bookkeeping ----------------------------
    def after_step(self, child: Genome, result) -> None:
        if result is None:  # NO_DIFF / abandoned proposal — the reference records nothing either
            return
        # eval-gated children must update NOTHING — including state.best: the base loop promotes
        # any fitter child and resets _stale BEFORE after_step (base_scaffold.step), so restore
        # the population's true best and redo the staleness tick exactly
        gated = bool(child.metadata.get("eval_failed")) or not result.valid
        if gated and self.state.best is child:
            population_best = self.population.best()
            if population_best is not child:
                self.state.best = population_best
                self._stale = getattr(self, "_stale_before", 0) + 1

        # one evolution-summary row per evaluated candidate (update_evolution_summary): score 0 /
        # "failed" for gated children (a reference failed row ALWAYS carries 0 — outcome is the
        # literal "failed" exactly when avg_val == 0), the evaluator's text_feedback persisted as
        # the trace (the EvalResult field first; artifacts mirror as the fallback)
        memory = self.memory
        name = child.metadata.get("candidate_name") or display_name(child)
        outcome = "failed" if gated else "evaluated"
        score = 0.0 if gated else (child.fitness if child.fitness != float("-inf") else 0.0)
        # the genome carries its own bookkeeping (the mapping's metadata contract); children strip
        # both keys at dispense time (proposer._STALE_KEYS)
        child.metadata.update(iteration=self.state.iteration, outcome=outcome)
        memory.write("", kind="row", name=name, iteration=self.state.iteration, score=score,
                     cost=self.population.cost_of(child), outcome=outcome,
                     trace=str((result.text_feedback
                                or child.artifacts.get("text_feedback")) or ""))
        report = str(child.metadata.get("report") or "")
        if report:  # the <=30-line per-candidate report (the reports/ compression layer)
            memory.write("", kind="report", name=name, iteration=self.state.iteration,
                         report=report)

    # ---- run end: the Pareto frontier is the run's product ----------------------------------------
    def _finalize(self) -> RunResult:
        """Algorithm 1 "Return Pareto frontier": the summary carries the frontier (names + BOTH
        objective values, score-desc), not just the best program."""
        result = super()._finalize()
        result.summary["frontier"] = [
            {"name": display_name(g), "combined_score": g.fitness,
             "cost": self.population.cost_of(g)}
            for g in self.population.frontier()
        ]
        log.debug("pareto frontier — %d program(s)", len(result.summary["frontier"]))
        return result