galapagos
DocsHubLeaderboardPlaygroundNews
galapagos

six blocks · any task ·
better solutions emerge.

Platform

  • Hub
  • Leaderboard
  • Playground

Resources

  • Docs
  • API reference
  • Card spec

Community

  • GitHub
  • Contribute

Updates

  • News
  • Releases

© 2026 Galapagos. Licensed under Apache-2.0.

Build your own scaffold.

Hub/Scaffolds/SkyDiscover/EvoX

SkyDiscover/evox

EvoX

Co-evolves the search strategy with the solutions: the parent/context selection policy is itself LLM-written code, scored by windowed improvement and hot-swapped on stagnation.

Test-time searchApache-2.0
Scaffold cardFiles and versions
evox/scaffold.py
704 lines · 38.9 KBpythonDownload
"""EvoX — a faithful port of "EvoX: Meta-Evolution for Automated Discovery" (UC Berkeley),
following its reference implementation in SkyDiscover (``search/evox/``) wherever paper and code
diverge: J uses the code's ``(1 + ln(1 + max(0, start)))`` weight, stagnation is the per-iteration
consecutive-counter (not the paper's fixed windows), the meta-parent is the deterministic argmax-J
strategy, and the horizon normalizer is fixed at ``switch_interval`` even when a strategy runs
longer. One module per component:

  population.py        -> EvoXPopulation        (hosts the ACTIVE evolved strategy + φ statistics)
  selection_policy.py  -> EvoXPolicy            (thin adapter over the strategy's sample())
  prompt_builder.py    -> EvoXPromptBuilder     (operator-labeled default template + all prompts)
  proposer.py          -> EvoXProposer          (diff proposer + parent_info/context_ids stamps)
  evaluator.py         -> EvoXEvaluator         (task-supplied)
  memory.py            -> EvoXStrategyMemory    (the strategy history H)
  scaffold.py          -> EvoXScaffold          (the orchestrator that composes the six)

plus two non-component infra modules:

  strategy.py          -> StrategyBase + load_strategy_from_source + validate_strategy (Valid(·))
  seed_strategy.py     -> the S0 GENOME (read as text by the meta loop AND executed as the
                          initial strategy)

The scaffold owns the ``CoEvolutionController`` control flow that sits outside select/observe:

* ``setup`` — resolve ``switch_interval`` (explicit config or ``max(1, int(0.10 * T))``), snapshot
  the start φ, reset the scoring window, and run the ONE-time variation-operator generation
  (one ``self.model`` call with the verbatim COMBINED/EXPLORE/EXPLOIT prompts; ANY failure → both
  labels ``""``, the reference free-form-only fallback; ``auto_generate_variation_operators:
  false`` → the verbatim static DEFAULT templates).
* ``after_step`` — scoring-window + budget accounting. Upstream's ``_run_iteration(retry_times=3)``
  retries a parse/eval failure up to 3x against the SAME parent and reports ``attempts_used`` (1..3),
  which steps the J-scoring window AND advances the solution-iteration counter by that many (a
  3-retry failure burns 3 budget units + 3 window ticks). The base ``_attempt`` performs that
  in-iteration retry (``inner_retry_times=3`` + the fence-free ``feedback_section`` override) and
  stamps the attempts it used on ``child.metadata['inner_attempts_used']``; this hook ticks the
  window that many times and advances ``state.iteration`` by ``attempts_used - 1`` (the base step
  already counted +1). Best-tracking is admission-gated by the base loop, exactly like upstream's
  ``if was_added``, so no separate gated-best repair is needed.
* ``periodic`` — runs every iteration: detect runtime strategy fallbacks (count the failed
  evolution, drop the pending entry), then the verbatim stagnation counter (consecutive
  iterations with best-score gain <= 0.01 absolute; reset on improvement; trigger at
  ``switch_interval``; never on the final iteration). On trigger: finalize the pending strategy
  (deferred J + end φ → memory; the FIRST event scores + inserts the seed strategy first), build
  the meta prompt (adapted ``evox_search_sys_prompt`` system + the five-section user message with
  guide-LLM compression, each call degrading gracefully to raw text), ONE meta model call per
  attempt with up to ``meta_max_retries`` validation retries feeding failures back as
  "## Previous Failed Attempts"; all failures keep the current strategy (the counter was reset at
  trigger, so the next trigger is naturally >= switch_interval away). On success:
  ``population.swap_strategy`` (full migration + fallback) and a fresh scoring window.
* ``_finalize`` — a run-end pending strategy is finalized into H (upstream end of
  ``run_discovery``).

Sanctioned adaptations (documented per the port contract): the meta system/user prompts are
adapted to the galapagos contract (``EvolvedStrategy(StrategyBase)`` + Genome field names) since
the LLM codes against OUR base class; the meta user message renders the inspiration strategies
BEFORE the current strategy and replaces the trailing fenced response example with an unfenced
instruction, so the CURRENT STRATEGY SOURCE is the LAST fenced python block (the
``parse_full_rewrite`` target); package discovery for operator generation is a static
standard-scientific-stack line; the upstream ``search_horizon``/``search_window_horizon`` key
mismatch is resolved by storing both keys on every strategy entry.

Further sanctioned deviations:

* meta "Focus areas" baseline — the port compares the current strategy's J against the PREVIOUS
  strategy entry in H. Upstream compares J against the newest SOLUTION program's
  ``combined_score`` (the meta loop reuses the solution database's ``previous_programs``), an
  apples-to-oranges quirk that renders a "declined" line at essentially every event; that upstream
  bug is deliberately NOT reproduced.
* the three guide-LLM compression calls run sequentially through ``_guide`` (the galapagos
  framework is synchronous) rather than via ``asyncio.gather`` — identical prompt content and
  graceful degradation; the only difference is latency.
* the upstream ``gap_to_SOTA`` φ lines never render: galapagos exposes no SOTA knob, so
  ``db_stats`` never carries an ``SOTA_score`` (upstream only renders them when it does).
* solution-side retry-with-failure-feedback is now IN-iteration, faithful to upstream: the base
  ``_attempt`` retries a parse/eval failure up to ``inner_retry_times`` (=3) times WITHIN the
  iteration against the SAME sampled parent, folding each failed attempt into the retry prompt via
  the fence-free :meth:`EvoXPromptBuilder.feedback_section` (rendered without code fences so the
  current program stays the last fenced block). ``attempts_used`` (1..3) advances both the
  solution-iteration counter and the J-scoring window (``after_step``), matching
  ``CoEvolutionController``'s ``iteration += attempts_used`` and per-attempt window stepping.
"""
from __future__ import annotations

import hashlib
import logging
import math
import random
import re
from pathlib import Path

from ...config import GalapagosConfig
from ...models import GalapagosModel
from ...models.base import Prompt
from ...records import Genome
from ..base_scaffold import GalapagosScaffold
from ..registry import register_scaffold
# one module per component (the EvoX scaffold method)
from .memory import EvoXStrategyMemory
from .population import EvoXPopulation
from .prompt_builder import (BATCH_INSTRUCTIONS, BATCH_PER_PROGRAM, BATCH_SUMMARY_SYSTEM,
                             COMBINED_SYSTEM_PROMPT, DEFAULT_DIVERGE_TEMPLATE,
                             DEFAULT_REFINE_TEMPLATE, DIVERGE_TEMPLATE, META_SYSTEM_PROMPT,
                             META_TASK_TAIL, PROBLEM_SUMMARY_SYSTEM, PROBLEM_TEMPLATE,
                             REFINE_TEMPLATE, STATS_INSIGHT_SYSTEM, EvoXPromptBuilder, _defence,
                             filter_stats_by_horizon, format_current_strategy,
                             format_population_state, format_search_window_context,
                             format_stats_diff, format_strategy_inspirations,
                             identify_strategy_focus_areas, parse_batch_summaries)
from .proposer import EvoXProposer
from .selection_policy import EvoXPolicy
from .strategy import validate_strategy

log = logging.getLogger(__name__)

_FENCE = re.compile(r"```(?:python|py)?\s*\n(.*?)```", re.DOTALL)
_MAX_EVALUATOR_CODE_CHARS = 12000

# package discovery adapted to galapagos (upstream reads requirements.txt/pyproject/uv pip list)
_PACKAGES_LINE = ("Standard scientific Python stack (numpy, scipy, pandas, sympy, networkx, "
                  "scikit-learn). Do not assume packages that require extra installation.")

# variation_operator_generator._build_operator_prompt — verbatim structure
_OPERATOR_USER = """\
Please analyze this problem and generate BOTH guidance blocks.

## Problem Description:
```
{system_message}
```

## Available Packages in Environment
The following packages are available in the current uv environment:
```
{packages_list}
```

## Evaluator Code:
```python
{evaluator_code}
```

Generate BOTH the EXPLORATION (different approaches) and EXPLOITATION (refinement/intensification) guidance blocks now.

For EXPLORATION guidance block, focus on DIFFERENT algorithmic approaches and structural changes.
For EXPLOITATION guidance block, focus on INTENSIFYING within existing approaches - e.g., computational budget (e.g., increase max iterations), better seeds, tighter tolerances, local polish stages.
"""


# ---------------------------------------------------------------------------------------------
# Operator-response parsing — variation_operator_generator.py ports
# ---------------------------------------------------------------------------------------------

def _extract_examples(response: str, is_diverge: bool = True) -> str:
    """``_extract_examples``: keep from the ``EXAMPLES OF ...`` line onward (fences stripped),
    falling back to the whole section."""
    needle = "EXAMPLES OF DIFFERENT" if is_diverge else "EXAMPLES OF REFINEMENT"
    examples_lines: list[str] = []
    in_examples = False
    for line in response.strip().split("\n"):
        if needle in line.upper():
            in_examples = True
            examples_lines.append(line)
        elif in_examples:
            if line.strip().startswith("Format:") or line.strip().startswith("Your solution"):
                break
            if line.strip() == "```":
                continue
            examples_lines.append(line)
    if examples_lines:
        while examples_lines and not examples_lines[-1].strip():
            examples_lines.pop()
        return "\n".join(examples_lines)
    return response.strip()


def _parse_combined_response(response: str) -> tuple[str, str]:
    """``_parse_combined_response``: split on the EXPLORATION/EXPLOITATION headers."""
    exploration = exploitation = ""
    current_section: str | None = None
    current_lines: list[str] = []
    for line in response.split("\n"):
        upper = line.upper().strip()
        if "### EXPLORATION" in upper or "EXPLORATION (DIVERGE" in upper:
            if current_section == "exploitation":
                exploitation = "\n".join(current_lines)
            current_section, current_lines = "exploration", []
        elif "### EXPLOITATION" in upper or "EXPLOITATION (REFINE" in upper:
            if current_section == "exploration":
                exploration = "\n".join(current_lines)
            current_section, current_lines = "exploitation", []
        elif current_section:
            current_lines.append(line)
    if current_section == "exploration":
        exploration = "\n".join(current_lines)
    elif current_section == "exploitation":
        exploitation = "\n".join(current_lines)
    return _extract_examples(exploration, True), _extract_examples(exploitation, False)


# ---------------------------------------------------------------------------------------------
# LogWindowScorer — search/evox/utils/search_scorer.py, verbatim semantics
# ---------------------------------------------------------------------------------------------

class LogWindowScorer:
    """J = (running_best - start) * (1 + ln(1 + max(0, start))) / sqrt(int(horizon)) — the CODE
    formula (the paper omits the ``1 +``); the normalizer stays ``switch_interval`` even when the
    strategy ran longer (an intentional bonus for long-lived improving strategies)."""

    def __init__(self) -> None:
        self._start_score: float | None = None
        self._start_iteration: int | None = None
        self._best_scores: list[float] = []

    def reset_window(self, start_score: float | None, start_iteration: int | None = None) -> None:
        self._start_score = float(start_score) if start_score is not None else 0.0
        self._start_iteration = start_iteration
        self._best_scores = []

    def record_step(self, best_score: float | None) -> None:
        if self._start_score is None:
            self.reset_window(best_score)
        if best_score is None:
            best_score = self._best_scores[-1] if self._best_scores else self._start_score
        self._best_scores.append(float(best_score))

    def get_window_size(self) -> int:
        return len(self._best_scores)

    def get_start_score(self) -> float | None:
        return self._start_score

    def compute_metrics(self, start_score: float | None = None,
                        best_scores: list[float] | None = None, horizon: int | None = None,
                        start_iteration: int | None = None) -> dict:
        if start_iteration is None:
            start_iteration = self._start_iteration
        start = float(start_score if start_score is not None else (self._start_score or 0.0))
        scores_to_use = best_scores if best_scores is not None else self._best_scores
        observed = len(scores_to_use) if scores_to_use else 0
        horizon_int = int(horizon) if horizon else max(1, observed)
        running_best = start
        for score in scores_to_use:
            running_best = max(running_best, float(score))
        improvement = running_best - start
        log_weight = 1.0 + math.log(1.0 + max(0.0, start))
        combined_score = improvement * log_weight / math.sqrt(horizon_int)
        return {
            "combined_score": combined_score,
            "window_start_iteration": start_iteration,
            "search_window_start_score": start,
            "search_window_end_score": running_best,
            "search_horizon": horizon_int,
        }


# ---------------------------------------------------------------------------------------------
# The scaffold
# ---------------------------------------------------------------------------------------------

@register_scaffold("evox")
class EvoXScaffold(GalapagosScaffold):
    name = "evox"

    DEFAULT_SWITCH_RATIO = 0.10           # evolve search after 10% of total iterations stagnate
    DEFAULT_IMPROVEMENT_THRESHOLD = 0.01  # τ

    @classmethod
    def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict:
        seed = int(config.seed)
        pop = config.population
        sel = config.selection_policy

        policy = EvoXPolicy(seed=seed,
                            num_context_programs=int(sel.num_context_programs))
        seed_source = (Path(__file__).parent / "seed_strategy.py").read_text(encoding="utf-8")
        return {
            "population": EvoXPopulation(
                seed_source=seed_source,
                rng=policy.rng,  # determinism: the strategy's randomness IS the policy's rng
                statistics_k=int(pop.statistics_k),
                improvement_threshold=float(pop.improvement_threshold),
            ),
            "selection_policy": policy,
            "prompt_builder": EvoXPromptBuilder(),
            "proposer": EvoXProposer(),
            "memory": EvoXStrategyMemory(),
        }

    def __init__(self, **kw):
        super().__init__(**kw)
        cfg = self.config
        self.improvement_threshold = float(cfg.population.improvement_threshold)
        self._statistics_k = int(cfg.population.statistics_k)
        self._meta_num_context = int(cfg.meta.meta_num_context_programs)
        self._meta_max_retries = int(cfg.meta.meta_max_retries)
        self._auto_operators = bool(cfg.meta.auto_generate_variation_operators)
        self._use_stats_insight = bool(cfg.meta.use_llm_stats_insight)
        self._use_problem_summary = bool(cfg.meta.use_problem_summary)
        self._use_batch_summaries = bool(cfg.meta.use_batch_summaries)
        self._max_strategy_chars = int(cfg.meta.max_strategy_chars)
        self._scorer = LogWindowScorer()
        self._switch_interval: int = 1            # resolved in setup (needs the final budget)
        self._completed_solution_iter = 0         # last completed solution iter (pre-advance; for _evolve_search)
        self._stagnant_count = 0
        self._last_tracked_best_score: float | None = None
        self._pending: dict | None = None         # deployed-but-unscored strategy
        self._best_search_score: float | None = None
        self._num_strategy_evolutions = 0
        self._counted_fallbacks = 0
        self._diverge_label = ""
        self._refine_label = ""
        self._start_stats: dict = {}
        self._problem_summary_cache: dict[str, str] = {}

    # ---- setup: window + one-time variation operators ------------------------------------------
    def setup(self, task) -> None:
        # EvoX co-evolution is strictly sequential: after_step advances state.iteration by
        # attempts_used and steps the shared J-scoring window / stagnation counter / pending-strategy
        # state, none of which is concurrency-safe. The base parallel loop would race these, so clamp.
        if int(self.general.max_parallel_iteration) > 1:
            log.warning("evox requires sequential iterations; forcing max_parallel_iteration=1 "
                        "(was %d)", self.general.max_parallel_iteration)
            self.general.max_parallel_iteration = 1
        super().setup(task)
        explicit = self.config.meta.switch_interval
        self._switch_interval = (int(explicit) if explicit else max(
            1, int(self.general.max_iterations * self.DEFAULT_SWITCH_RATIO)))
        self._start_stats = self._statistics()
        self._reset_window(start_iteration=0)
        self._generate_variation_operators()

    # ---- per-iteration hooks ---------------------------------------------------------------------
    def after_step(self, child: Genome, result) -> None:
        """Per-iteration scoring-window + budget accounting (``CoEvolutionController.run_discovery``'s
        ``for _ in range(attempts_used): _record_search_window_step()`` and ``iteration +=
        attempts_used``). Upstream's ``_run_iteration(retry_times=3)`` retries a parse/eval failure up
        to 3x against the SAME parent (the base ``_attempt`` now does exactly this via
        ``inner_retry_times=3`` + the fence-free :meth:`EvoXPromptBuilder.feedback_section`), and its
        ``attempts_used`` (1..3) advances BOTH the solution-iteration counter and the J-scoring window:
        a 3-retry failure burns 3 budget units and 3 window ticks. The base ``_attempt`` stamps the
        attempts it used on ``child.metadata['inner_attempts_used']``; this hook replays both. Best /
        ``state.best`` tracking is already admission-gated by the base loop, exactly like upstream's
        ``if was_added`` — no separate repair is needed."""
        sig = self.state.signals.setdefault("evox", {})
        attempts_used = 1
        if child is not None:
            try:
                attempts_used = max(1, int(child.metadata.get("inner_attempts_used", 1)))
            except (TypeError, ValueError):
                attempts_used = 1
        best_now = self._get_best_score()       # unchanged across the attempts of this iteration
        for _ in range(attempts_used):          # one window step per attempt (upstream attempts_used)
            self._scorer.record_step(best_now)
        # completed_solution_iter = the iteration BEFORE the counter advances (upstream passes it to
        # _evolve_search). NO offset: galapagos's state.iteration is ALREADY 1-based over solution
        # steps and so is upstream's, so the two window_start values match step-for-step. Galapagos
        # seeds the population OUTSIDE the loop and the base loop does state.iteration += 1 BEFORE each
        # step (first solution step -> 1). Upstream is identical: runner.py adds the seed at
        # iteration_found=0 OUTSIDE the loop, then starts the solution loop at
        # discovery_start = start_iteration + 1 = 1 (runner.py:167, should_add_initial=True for a
        # fresh run), so its first completed_solution_iter is 1 too (controller.py:125/150). The seed
        # STRATEGY window separately uses start_iteration=0 in both (_register_seed_strategy ==
        # controller.py:237). Do NOT subtract 1 here — that would make this carrier (and the two
        # display sinks it feeds: the meta-prompt "start at iteration N" line and the
        # window_start_iteration metric "ran from iteration X to Y") 0-based and DIVERGE from upstream.
        # Then advance state.iteration by the extra attempts the base loop has not yet counted (base
        # step already did +1; upstream advances by attempts_used).
        self._completed_solution_iter = self.state.iteration
        self.state.iteration += attempts_used - 1
        if self.state.best is not None:
            sig["global_best"] = self.state.best.fitness

    def periodic(self) -> None:
        sig = self.state.signals.setdefault("evox", {})
        # runtime strategy errors (policy sample / population add restored the fallback): count
        # the failed evolution and drop the pending entry — the broken strategy is never scored
        while self._counted_fallbacks < self.population.runtime_fallbacks:
            self._counted_fallbacks += 1
            self._num_strategy_evolutions += 1
            self._pending = None
            sig["last_runtime_error"] = self.population.last_runtime_error
        # stagnation check every iteration; never trigger on an iteration the run stops on —
        # for ANY stop reason (max_iterations, target_score, max_usd, wallclock, patience), not
        # just the iteration bound: a meta event fired here would burn meta/guide calls and
        # finalize a never-run J=0 ghost strategy into H. The base scaffold checks the same
        # side-effect-free _should_stop() right after periodic(), so this is exact (upstream
        # short-circuits before _should_evolve_search on the last loop pass).
        if not self._should_stop() and self._should_evolve_search():
            self._evolve_search(self._completed_solution_iter)
        sig.update(num_strategy_evolutions=self._num_strategy_evolutions,
                   strategies_in_memory=len(self.memory),
                   stagnant_count=self._stagnant_count,
                   switch_interval=self._switch_interval)

    def _finalize(self):
        # upstream run_discovery scores a still-pending strategy at the end of the run
        if self._pending is not None:
            self._finalize_pending()
        return super()._finalize()

    # ---- stagnation + scoring window ---------------------------------------------------------
    def _get_best_score(self) -> float:
        """``_get_best_score``: the best ``combined_score`` (0.0 when absent/non-numeric)."""
        best = self.population.best()
        if best is not None:
            score = best.scores.get("combined_score")
            if isinstance(score, (int, float)):
                return float(score)
        return 0.0

    def _reset_window(self, start_iteration: int | None = None) -> None:
        self._scorer.reset_window(self._get_best_score(), start_iteration=start_iteration)

    def _should_evolve_search(self) -> bool:
        """The verbatim consecutive-stagnation counter: gain > 0.01 absolute resets it, anything
        else increments; firing resets it again (so a failed generation event is naturally
        re-tried no sooner than ``switch_interval`` iterations later)."""
        current = self._get_best_score()
        if self._last_tracked_best_score is None:
            self._stagnant_count = 0
        elif (current - self._last_tracked_best_score) > self.improvement_threshold:
            self._stagnant_count = 0
        else:
            self._stagnant_count += 1
        self._last_tracked_best_score = current
        if self._stagnant_count >= self._switch_interval:
            self._stagnant_count = 0
            return True
        return False

    def _statistics(self) -> dict:
        return self.population.statistics(improvement_threshold=self.improvement_threshold,
                                          k=self._statistics_k)

    # ---- the strategy-evolution event ------------------------------------------------------------
    def _evolve_search(self, solution_iter: int) -> None:
        """``_evolve_search``: register/score the seed on the first event, finalize any pending
        strategy, then generate + validate + hot-swap a new one."""
        if self.model is None:
            return
        log.info("strategy evolution triggered at iter %d (stagnation=%d)", solution_iter,
                 self._switch_interval)
        if len(self.memory) == 0:
            self._register_seed_strategy()
        elif self._pending is not None:
            self._finalize_pending()
        self._reset_window()
        self._generate_and_validate(solution_iter)

    def _make_entry(self, source: str, metrics: dict, start_stats: dict) -> dict:
        metrics = dict(metrics)
        # upstream uses both key names (scorer: search_horizon; formatters: search_window_horizon)
        metrics["search_window_horizon"] = metrics.get("search_horizon")
        return {"source": source, "metrics": metrics, "start_stats": start_stats,
                "end_stats": self._statistics()}

    def _register_seed_strategy(self) -> None:
        """``_initialize_first_search_program``: score the seed strategy over the window recorded
        so far and insert it into H before the first generation."""
        start = self._scorer.get_start_score() or 0.0
        metrics = self._scorer.compute_metrics(start_score=start, best_scores=None,
                                               horizon=self._switch_interval, start_iteration=0)
        entry = self._make_entry(self.population.seed_source, metrics, self._start_stats)
        self.memory.add_strategy(entry)
        self._best_search_score = float(metrics.get("combined_score") or 0.0)
        self._num_strategy_evolutions += 1

    def _finalize_pending(self) -> None:
        """``_finalize_pending_search`` + ``_assign_search_score``: deferred J over the deployed
        strategy's window + end-of-window φ, then into H."""
        if self._pending is None:
            return
        if self._scorer.get_window_size() > 0:
            metrics = self._scorer.compute_metrics(horizon=self._switch_interval)
        else:
            start = self._scorer.get_start_score() or 0.0
            metrics = self._scorer.compute_metrics(start_score=start,
                                                   best_scores=[self._get_best_score()],
                                                   horizon=self._switch_interval)
        score = float(metrics.get("combined_score", 0.0) or 0.0)
        entry = self._make_entry(self._pending["source"], metrics, self._pending["start_stats"])
        self.memory.add_strategy(entry)
        if self._best_search_score is None or score > self._best_search_score:
            self._best_search_score = score
        self._pending = None
        self._num_strategy_evolutions += 1

    def _generate_and_validate(self, solution_iter: int) -> None:
        """One meta event: argmax-J parent + <=2 random inspirations from H → meta prompt →
        ONE model call per attempt, up to ``meta_max_retries`` with failure feedback → Valid(·)
        → hot-swap. All failures keep the current strategy."""
        sig = self.state.signals.setdefault("evox", {})
        parent_entry = self.memory.best()
        if parent_entry is None:  # defensive; the seed is always registered first
            return
        # deterministic per-event rng for meta inspirations + the validator's behavioral tests
        event_rng = random.Random(self.seed * 7919 + self._num_strategy_evolutions * 101 + 17)
        inspirations = self.memory.inspirations(self._meta_num_context, event_rng)
        db_stats = self._statistics()

        failed: list[tuple[str, str]] = []
        source: str | None = None
        new_class = None
        for _attempt in range(max(1, self._meta_max_retries)):
            user = self._build_meta_user(solution_iter, parent_entry, inspirations, db_stats,
                                         failed)
            try:
                gen = self.model.generate(Prompt(system=META_SYSTEM_PROMPT, user=user))
            except Exception as e:  # noqa: BLE001 — a flaky model call is a failed attempt
                failed.append((f"model call failed: {e}", ""))
                continue
            self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
            blocks = _FENCE.findall(gen.text or "")
            code = blocks[-1].strip("\n") if blocks else ""
            if not code.strip():
                failed.append(("no fenced python code block found in the reply", ""))
                continue
            if len(code) > self._max_strategy_chars:  # max_solution_length (meta side)
                failed.append((f"generated strategy exceeds max_strategy_chars="
                               f"{self._max_strategy_chars}", code[:500]))
                continue
            ok, error, cls = validate_strategy(code, event_rng)
            if not ok:
                failed.append((error, code))
                continue
            # deploy the ALREADY-loaded class returned by Valid(·): re-exec'ing the source would
            # run the candidate's module level a second time, unguarded — any module-level
            # exception there would abort the run instead of counting as a failed attempt
            source, new_class = code, cls
            break

        if source is None:  # all attempts failed → keep the current strategy, count the event
            self._num_strategy_evolutions += 1
            sig["last_evolution"] = "generation_failed"
            return
        if not self.population.swap_strategy(new_class, source, rng=self.selection_policy.rng,
                                             diverge_label=self._diverge_label,
                                             refine_label=self._refine_label):
            self._num_strategy_evolutions += 1
            sig["last_evolution"] = "swap_failed"
            return
        self._pending = {"source": source, "start_stats": db_stats}
        self._reset_window(start_iteration=solution_iter)
        sig["last_evolution"] = "swapped"
        sig["strategy_swaps"] = int(sig.get("strategy_swaps", 0)) + 1

    # ---- meta prompt assembly --------------------------------------------------------------------
    def _build_meta_user(self, solution_iter: int, parent_entry: dict,
                         inspiration_entries: list[dict], db_stats: dict,
                         failed: list[tuple[str, str]]) -> str:
        """The ``search_evolution_user_message.txt`` assembly. Section order is adapted so the
        current strategy source is the LAST fenced python block (see module docstring)."""
        horizon = self._switch_interval
        filtered = filter_stats_by_horizon(db_stats, horizon)

        # population state: raw φ rendering, replaced by the guide-LLM insight when enabled
        stats_text = format_population_state(filtered)
        population_state = stats_text
        if self._use_stats_insight and stats_text:
            insight = self._guide(STATS_INSIGHT_SYSTEM,
                                  f"Population Statistics:\n\n{stats_text}")
            if insight.strip():
                population_state = _defence(insight)

        window = format_search_window_context(solution_iter, int(self.general.max_iterations),
                                              horizon, self.improvement_threshold)
        entries = self.memory.entries
        previous = entries[-1] if entries else None
        focus = identify_strategy_focus_areas(parent_entry, previous)

        parts = [
            "# DOWNSTREAM PROBLEM CONTEXT\n"
            "Your search algorithm evolves solutions for the downstream problem. Use this to "
            "inform your search strategy.\n\n" + self._problem_template(),
            "---------------------------",
            "# Search Algorithm Information\n\n## Your Algorithm's Search Window\n" + window,
            "## Solution Population Statistics\nThis describes the current state of the solution "
            "database your algorithm will work with.\n\n" + population_state,
        ]
        other_section = self._render_inspirations(inspiration_entries)
        if other_section:
            parts.append(other_section)
        if failed:
            lines = ["## Previous Failed Attempts",
                     "These rewrites were rejected during this evolution event. "
                     "Fix the cause; do not repeat the mistake:"]
            for i, (error, code) in enumerate(failed, 1):
                lines.append(f"### Attempt {i}\nError: {_defence(error)}")
                if code:
                    lines.append("Rejected code (truncated, fences stripped):\n"
                                 + _defence(code[:2000]))
            parts.append("\n".join(lines))
        parts.append("# What You Are Writing\n\n"
                     + format_current_strategy(parent_entry, focus))
        parts.append(META_TASK_TAIL)
        return "\n\n".join(parts)

    def _render_inspirations(self, entries: list[dict]) -> str:
        """Prior strategies from H: batch guide-LLM ``[PROGRAM N]`` summaries when enabled
        (full-source fallback on parse failure), each with its start→end φ diff."""
        if not entries:
            return ""
        summaries: dict[int, str] = {}
        if self._use_batch_summaries:
            blocks = []
            for idx, entry in enumerate(entries, start=1):
                start_stats, end_stats = entry.get("start_stats"), entry.get("end_stats")
                if not (start_stats and end_stats):
                    continue
                metrics = entry.get("metrics", {})
                h = int(metrics.get("search_window_horizon") or 0)
                diff_text = format_stats_diff(filter_stats_by_horizon(start_stats, h),
                                              filter_stats_by_horizon(end_stats, h), horizon=h)
                improvement = (float(metrics.get("search_window_end_score") or 0.0)
                               - float(metrics.get("search_window_start_score") or 0.0))
                blocks.append(
                    f"=== PROGRAM {idx} (score={float(metrics.get('combined_score') or 0.0):.4f},"
                    f" improvement={improvement:.4f}) ===\n"
                    + BATCH_PER_PROGRAM.format(task_description=META_SYSTEM_PROMPT,
                                               solution=entry.get("source", ""),
                                               db_stats_text=diff_text))
            if blocks:
                user = BATCH_INSTRUCTIONS.format(num_programs=len(blocks),
                                                 combined_content="\n".join(blocks))
                response = self._guide(BATCH_SUMMARY_SYSTEM, user)
                summaries = parse_batch_summaries(response, len(entries))
        return format_strategy_inspirations(entries, summaries)

    def _problem_template(self) -> str:
        """Problem context for the meta prompt: guide-LLM summary (cached per problem) when
        enabled, degrading to the raw description + evaluator source."""
        description = self.state.task_context or "(No problem description provided)"
        evaluator_code = self._evaluator_source()
        evaluator_context = (f"```python\n{evaluator_code}\n```" if evaluator_code
                             else "(No evaluator context provided)")
        raw = PROBLEM_TEMPLATE.format(problem_description=description,
                                      evaluator_context=evaluator_context)
        if not self._use_problem_summary:
            return raw
        key = hashlib.sha256(f"{description}|||{evaluator_context}".encode()).hexdigest()
        if key in self._problem_summary_cache:
            return self._problem_summary_cache[key]
        summary = self._guide(PROBLEM_SUMMARY_SYSTEM, raw)
        if summary.strip():  # cache ONLY a successful summary (upstream never caches failures);
            result = _defence(summary.strip())
            self._problem_summary_cache[key] = result
            return result
        return raw  # transient guide failure: degrade to raw, UNcached, so later events retry

    def _guide(self, system: str, user: str) -> str:
        """One guide-LLM call through ``self.model``; ``""`` on ANY failure (the callers all
        degrade gracefully to raw text). Cost is recorded on success."""
        if self.model is None:
            return ""
        try:
            gen = self.model.generate(Prompt(system=system, user=user))
        except Exception:  # noqa: BLE001 — guide compression must never crash the search
            return ""
        self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
        return gen.text or ""

    def _evaluator_source(self) -> str:
        """The task evaluator's SOURCE CODE (``SubprocessEvaluator.evaluator_path``), capped and
        fence-sanitized; ``""`` on any attribute/IO failure (the section degrades gracefully)."""
        try:
            path = self.evaluator.evaluator_path
            with open(path, encoding="utf-8") as f:
                code = f.read()
        except Exception:  # noqa: BLE001 — no evaluator_path / unreadable file
            return ""
        if len(code) > _MAX_EVALUATOR_CODE_CHARS:
            code = code[:_MAX_EVALUATOR_CODE_CHARS] + "\n# ... (truncated)"
        return _defence(code)

    # ---- one-time variation-operator generation ---------------------------------------------------
    def _generate_variation_operators(self) -> None:
        """``_generate_variation_operators``: ONE guide call with the verbatim COMBINED prompt;
        parse the EXPLORATION/EXPLOITATION sections into the DIVERGE/REFINE templates. ANY
        failure → both labels ``""`` (free-form-only). ``auto_generate_variation_operators:
        false`` → the verbatim static DEFAULT templates (no LLM)."""
        if self._diverge_label and self._refine_label:  # cached across setups
            self.population.assign_labels(self._diverge_label, self._refine_label)
            return
        if not self._auto_operators:
            self._diverge_label = DEFAULT_DIVERGE_TEMPLATE
            self._refine_label = DEFAULT_REFINE_TEMPLATE
            self.population.assign_labels(self._diverge_label, self._refine_label)
            return
        try:
            if self.model is None:
                raise RuntimeError("no model available for operator generation")
            user = _OPERATOR_USER.format(
                system_message=self.state.task_context or "",
                packages_list=_PACKAGES_LINE,
                evaluator_code=self._evaluator_source() or "(no evaluator source available)")
            gen = self.model.generate(Prompt(system=COMBINED_SYSTEM_PROMPT, user=user))
            self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
            explore_examples, refine_examples = _parse_combined_response(gen.text or "")
            self._diverge_label = DIVERGE_TEMPLATE.replace("{GENERATED_EXAMPLES}",
                                                           explore_examples)
            self._refine_label = REFINE_TEMPLATE.replace("{GENERATED_EXAMPLES}", refine_examples)
        except Exception:  # noqa: BLE001 — reference fallback: empty labels, free-form only
            self._diverge_label = ""
            self._refine_label = ""
        self.population.assign_labels(self._diverge_label, self._refine_label)