EvoX
Co-evolves the search strategy with the solutions: the parent/context selection policy is itself LLM-written code, scored by windowed improvement and hot-swapped on stagnation.
"""EvoX — a faithful port of "EvoX: Meta-Evolution for Automated Discovery" (UC Berkeley),
following its reference implementation in SkyDiscover (``search/evox/``) wherever paper and code
diverge: J uses the code's ``(1 + ln(1 + max(0, start)))`` weight, stagnation is the per-iteration
consecutive-counter (not the paper's fixed windows), the meta-parent is the deterministic argmax-J
strategy, and the horizon normalizer is fixed at ``switch_interval`` even when a strategy runs
longer. One module per component:
population.py -> EvoXPopulation (hosts the ACTIVE evolved strategy + φ statistics)
selection_policy.py -> EvoXPolicy (thin adapter over the strategy's sample())
prompt_builder.py -> EvoXPromptBuilder (operator-labeled default template + all prompts)
proposer.py -> EvoXProposer (diff proposer + parent_info/context_ids stamps)
evaluator.py -> EvoXEvaluator (task-supplied)
memory.py -> EvoXStrategyMemory (the strategy history H)
scaffold.py -> EvoXScaffold (the orchestrator that composes the six)
plus two non-component infra modules:
strategy.py -> StrategyBase + load_strategy_from_source + validate_strategy (Valid(·))
seed_strategy.py -> the S0 GENOME (read as text by the meta loop AND executed as the
initial strategy)
The scaffold owns the ``CoEvolutionController`` control flow that sits outside select/observe:
* ``setup`` — resolve ``switch_interval`` (explicit config or ``max(1, int(0.10 * T))``), snapshot
the start φ, reset the scoring window, and run the ONE-time variation-operator generation
(one ``self.model`` call with the verbatim COMBINED/EXPLORE/EXPLOIT prompts; ANY failure → both
labels ``""``, the reference free-form-only fallback; ``auto_generate_variation_operators:
false`` → the verbatim static DEFAULT templates).
* ``after_step`` — scoring-window + budget accounting. Upstream's ``_run_iteration(retry_times=3)``
retries a parse/eval failure up to 3x against the SAME parent and reports ``attempts_used`` (1..3),
which steps the J-scoring window AND advances the solution-iteration counter by that many (a
3-retry failure burns 3 budget units + 3 window ticks). The base ``_attempt`` performs that
in-iteration retry (``inner_retry_times=3`` + the fence-free ``feedback_section`` override) and
stamps the attempts it used on ``child.metadata['inner_attempts_used']``; this hook ticks the
window that many times and advances ``state.iteration`` by ``attempts_used - 1`` (the base step
already counted +1). Best-tracking is admission-gated by the base loop, exactly like upstream's
``if was_added``, so no separate gated-best repair is needed.
* ``periodic`` — runs every iteration: detect runtime strategy fallbacks (count the failed
evolution, drop the pending entry), then the verbatim stagnation counter (consecutive
iterations with best-score gain <= 0.01 absolute; reset on improvement; trigger at
``switch_interval``; never on the final iteration). On trigger: finalize the pending strategy
(deferred J + end φ → memory; the FIRST event scores + inserts the seed strategy first), build
the meta prompt (adapted ``evox_search_sys_prompt`` system + the five-section user message with
guide-LLM compression, each call degrading gracefully to raw text), ONE meta model call per
attempt with up to ``meta_max_retries`` validation retries feeding failures back as
"## Previous Failed Attempts"; all failures keep the current strategy (the counter was reset at
trigger, so the next trigger is naturally >= switch_interval away). On success:
``population.swap_strategy`` (full migration + fallback) and a fresh scoring window.
* ``_finalize`` — a run-end pending strategy is finalized into H (upstream end of
``run_discovery``).
Sanctioned adaptations (documented per the port contract): the meta system/user prompts are
adapted to the galapagos contract (``EvolvedStrategy(StrategyBase)`` + Genome field names) since
the LLM codes against OUR base class; the meta user message renders the inspiration strategies
BEFORE the current strategy and replaces the trailing fenced response example with an unfenced
instruction, so the CURRENT STRATEGY SOURCE is the LAST fenced python block (the
``parse_full_rewrite`` target); package discovery for operator generation is a static
standard-scientific-stack line; the upstream ``search_horizon``/``search_window_horizon`` key
mismatch is resolved by storing both keys on every strategy entry.
Further sanctioned deviations:
* meta "Focus areas" baseline — the port compares the current strategy's J against the PREVIOUS
strategy entry in H. Upstream compares J against the newest SOLUTION program's
``combined_score`` (the meta loop reuses the solution database's ``previous_programs``), an
apples-to-oranges quirk that renders a "declined" line at essentially every event; that upstream
bug is deliberately NOT reproduced.
* the three guide-LLM compression calls run sequentially through ``_guide`` (the galapagos
framework is synchronous) rather than via ``asyncio.gather`` — identical prompt content and
graceful degradation; the only difference is latency.
* the upstream ``gap_to_SOTA`` φ lines never render: galapagos exposes no SOTA knob, so
``db_stats`` never carries an ``SOTA_score`` (upstream only renders them when it does).
* solution-side retry-with-failure-feedback is now IN-iteration, faithful to upstream: the base
``_attempt`` retries a parse/eval failure up to ``inner_retry_times`` (=3) times WITHIN the
iteration against the SAME sampled parent, folding each failed attempt into the retry prompt via
the fence-free :meth:`EvoXPromptBuilder.feedback_section` (rendered without code fences so the
current program stays the last fenced block). ``attempts_used`` (1..3) advances both the
solution-iteration counter and the J-scoring window (``after_step``), matching
``CoEvolutionController``'s ``iteration += attempts_used`` and per-attempt window stepping.
"""
from __future__ import annotations
import hashlib
import logging
import math
import random
import re
from pathlib import Path
from ...config import GalapagosConfig
from ...models import GalapagosModel
from ...models.base import Prompt
from ...records import Genome
from ..base_scaffold import GalapagosScaffold
from ..registry import register_scaffold
# one module per component (the EvoX scaffold method)
from .memory import EvoXStrategyMemory
from .population import EvoXPopulation
from .prompt_builder import (BATCH_INSTRUCTIONS, BATCH_PER_PROGRAM, BATCH_SUMMARY_SYSTEM,
COMBINED_SYSTEM_PROMPT, DEFAULT_DIVERGE_TEMPLATE,
DEFAULT_REFINE_TEMPLATE, DIVERGE_TEMPLATE, META_SYSTEM_PROMPT,
META_TASK_TAIL, PROBLEM_SUMMARY_SYSTEM, PROBLEM_TEMPLATE,
REFINE_TEMPLATE, STATS_INSIGHT_SYSTEM, EvoXPromptBuilder, _defence,
filter_stats_by_horizon, format_current_strategy,
format_population_state, format_search_window_context,
format_stats_diff, format_strategy_inspirations,
identify_strategy_focus_areas, parse_batch_summaries)
from .proposer import EvoXProposer
from .selection_policy import EvoXPolicy
from .strategy import validate_strategy
log = logging.getLogger(__name__)
_FENCE = re.compile(r"```(?:python|py)?\s*\n(.*?)```", re.DOTALL)
_MAX_EVALUATOR_CODE_CHARS = 12000
# package discovery adapted to galapagos (upstream reads requirements.txt/pyproject/uv pip list)
_PACKAGES_LINE = ("Standard scientific Python stack (numpy, scipy, pandas, sympy, networkx, "
"scikit-learn). Do not assume packages that require extra installation.")
# variation_operator_generator._build_operator_prompt — verbatim structure
_OPERATOR_USER = """\
Please analyze this problem and generate BOTH guidance blocks.
## Problem Description:
```
{system_message}
```
## Available Packages in Environment
The following packages are available in the current uv environment:
```
{packages_list}
```
## Evaluator Code:
```python
{evaluator_code}
```
Generate BOTH the EXPLORATION (different approaches) and EXPLOITATION (refinement/intensification) guidance blocks now.
For EXPLORATION guidance block, focus on DIFFERENT algorithmic approaches and structural changes.
For EXPLOITATION guidance block, focus on INTENSIFYING within existing approaches - e.g., computational budget (e.g., increase max iterations), better seeds, tighter tolerances, local polish stages.
"""
# ---------------------------------------------------------------------------------------------
# Operator-response parsing — variation_operator_generator.py ports
# ---------------------------------------------------------------------------------------------
def _extract_examples(response: str, is_diverge: bool = True) -> str:
"""``_extract_examples``: keep from the ``EXAMPLES OF ...`` line onward (fences stripped),
falling back to the whole section."""
needle = "EXAMPLES OF DIFFERENT" if is_diverge else "EXAMPLES OF REFINEMENT"
examples_lines: list[str] = []
in_examples = False
for line in response.strip().split("\n"):
if needle in line.upper():
in_examples = True
examples_lines.append(line)
elif in_examples:
if line.strip().startswith("Format:") or line.strip().startswith("Your solution"):
break
if line.strip() == "```":
continue
examples_lines.append(line)
if examples_lines:
while examples_lines and not examples_lines[-1].strip():
examples_lines.pop()
return "\n".join(examples_lines)
return response.strip()
def _parse_combined_response(response: str) -> tuple[str, str]:
"""``_parse_combined_response``: split on the EXPLORATION/EXPLOITATION headers."""
exploration = exploitation = ""
current_section: str | None = None
current_lines: list[str] = []
for line in response.split("\n"):
upper = line.upper().strip()
if "### EXPLORATION" in upper or "EXPLORATION (DIVERGE" in upper:
if current_section == "exploitation":
exploitation = "\n".join(current_lines)
current_section, current_lines = "exploration", []
elif "### EXPLOITATION" in upper or "EXPLOITATION (REFINE" in upper:
if current_section == "exploration":
exploration = "\n".join(current_lines)
current_section, current_lines = "exploitation", []
elif current_section:
current_lines.append(line)
if current_section == "exploration":
exploration = "\n".join(current_lines)
elif current_section == "exploitation":
exploitation = "\n".join(current_lines)
return _extract_examples(exploration, True), _extract_examples(exploitation, False)
# ---------------------------------------------------------------------------------------------
# LogWindowScorer — search/evox/utils/search_scorer.py, verbatim semantics
# ---------------------------------------------------------------------------------------------
class LogWindowScorer:
"""J = (running_best - start) * (1 + ln(1 + max(0, start))) / sqrt(int(horizon)) — the CODE
formula (the paper omits the ``1 +``); the normalizer stays ``switch_interval`` even when the
strategy ran longer (an intentional bonus for long-lived improving strategies)."""
def __init__(self) -> None:
self._start_score: float | None = None
self._start_iteration: int | None = None
self._best_scores: list[float] = []
def reset_window(self, start_score: float | None, start_iteration: int | None = None) -> None:
self._start_score = float(start_score) if start_score is not None else 0.0
self._start_iteration = start_iteration
self._best_scores = []
def record_step(self, best_score: float | None) -> None:
if self._start_score is None:
self.reset_window(best_score)
if best_score is None:
best_score = self._best_scores[-1] if self._best_scores else self._start_score
self._best_scores.append(float(best_score))
def get_window_size(self) -> int:
return len(self._best_scores)
def get_start_score(self) -> float | None:
return self._start_score
def compute_metrics(self, start_score: float | None = None,
best_scores: list[float] | None = None, horizon: int | None = None,
start_iteration: int | None = None) -> dict:
if start_iteration is None:
start_iteration = self._start_iteration
start = float(start_score if start_score is not None else (self._start_score or 0.0))
scores_to_use = best_scores if best_scores is not None else self._best_scores
observed = len(scores_to_use) if scores_to_use else 0
horizon_int = int(horizon) if horizon else max(1, observed)
running_best = start
for score in scores_to_use:
running_best = max(running_best, float(score))
improvement = running_best - start
log_weight = 1.0 + math.log(1.0 + max(0.0, start))
combined_score = improvement * log_weight / math.sqrt(horizon_int)
return {
"combined_score": combined_score,
"window_start_iteration": start_iteration,
"search_window_start_score": start,
"search_window_end_score": running_best,
"search_horizon": horizon_int,
}
# ---------------------------------------------------------------------------------------------
# The scaffold
# ---------------------------------------------------------------------------------------------
@register_scaffold("evox")
class EvoXScaffold(GalapagosScaffold):
name = "evox"
DEFAULT_SWITCH_RATIO = 0.10 # evolve search after 10% of total iterations stagnate
DEFAULT_IMPROVEMENT_THRESHOLD = 0.01 # τ
@classmethod
def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict:
seed = int(config.seed)
pop = config.population
sel = config.selection_policy
policy = EvoXPolicy(seed=seed,
num_context_programs=int(sel.num_context_programs))
seed_source = (Path(__file__).parent / "seed_strategy.py").read_text(encoding="utf-8")
return {
"population": EvoXPopulation(
seed_source=seed_source,
rng=policy.rng, # determinism: the strategy's randomness IS the policy's rng
statistics_k=int(pop.statistics_k),
improvement_threshold=float(pop.improvement_threshold),
),
"selection_policy": policy,
"prompt_builder": EvoXPromptBuilder(),
"proposer": EvoXProposer(),
"memory": EvoXStrategyMemory(),
}
def __init__(self, **kw):
super().__init__(**kw)
cfg = self.config
self.improvement_threshold = float(cfg.population.improvement_threshold)
self._statistics_k = int(cfg.population.statistics_k)
self._meta_num_context = int(cfg.meta.meta_num_context_programs)
self._meta_max_retries = int(cfg.meta.meta_max_retries)
self._auto_operators = bool(cfg.meta.auto_generate_variation_operators)
self._use_stats_insight = bool(cfg.meta.use_llm_stats_insight)
self._use_problem_summary = bool(cfg.meta.use_problem_summary)
self._use_batch_summaries = bool(cfg.meta.use_batch_summaries)
self._max_strategy_chars = int(cfg.meta.max_strategy_chars)
self._scorer = LogWindowScorer()
self._switch_interval: int = 1 # resolved in setup (needs the final budget)
self._completed_solution_iter = 0 # last completed solution iter (pre-advance; for _evolve_search)
self._stagnant_count = 0
self._last_tracked_best_score: float | None = None
self._pending: dict | None = None # deployed-but-unscored strategy
self._best_search_score: float | None = None
self._num_strategy_evolutions = 0
self._counted_fallbacks = 0
self._diverge_label = ""
self._refine_label = ""
self._start_stats: dict = {}
self._problem_summary_cache: dict[str, str] = {}
# ---- setup: window + one-time variation operators ------------------------------------------
def setup(self, task) -> None:
# EvoX co-evolution is strictly sequential: after_step advances state.iteration by
# attempts_used and steps the shared J-scoring window / stagnation counter / pending-strategy
# state, none of which is concurrency-safe. The base parallel loop would race these, so clamp.
if int(self.general.max_parallel_iteration) > 1:
log.warning("evox requires sequential iterations; forcing max_parallel_iteration=1 "
"(was %d)", self.general.max_parallel_iteration)
self.general.max_parallel_iteration = 1
super().setup(task)
explicit = self.config.meta.switch_interval
self._switch_interval = (int(explicit) if explicit else max(
1, int(self.general.max_iterations * self.DEFAULT_SWITCH_RATIO)))
self._start_stats = self._statistics()
self._reset_window(start_iteration=0)
self._generate_variation_operators()
# ---- per-iteration hooks ---------------------------------------------------------------------
def after_step(self, child: Genome, result) -> None:
"""Per-iteration scoring-window + budget accounting (``CoEvolutionController.run_discovery``'s
``for _ in range(attempts_used): _record_search_window_step()`` and ``iteration +=
attempts_used``). Upstream's ``_run_iteration(retry_times=3)`` retries a parse/eval failure up
to 3x against the SAME parent (the base ``_attempt`` now does exactly this via
``inner_retry_times=3`` + the fence-free :meth:`EvoXPromptBuilder.feedback_section`), and its
``attempts_used`` (1..3) advances BOTH the solution-iteration counter and the J-scoring window:
a 3-retry failure burns 3 budget units and 3 window ticks. The base ``_attempt`` stamps the
attempts it used on ``child.metadata['inner_attempts_used']``; this hook replays both. Best /
``state.best`` tracking is already admission-gated by the base loop, exactly like upstream's
``if was_added`` — no separate repair is needed."""
sig = self.state.signals.setdefault("evox", {})
attempts_used = 1
if child is not None:
try:
attempts_used = max(1, int(child.metadata.get("inner_attempts_used", 1)))
except (TypeError, ValueError):
attempts_used = 1
best_now = self._get_best_score() # unchanged across the attempts of this iteration
for _ in range(attempts_used): # one window step per attempt (upstream attempts_used)
self._scorer.record_step(best_now)
# completed_solution_iter = the iteration BEFORE the counter advances (upstream passes it to
# _evolve_search). NO offset: galapagos's state.iteration is ALREADY 1-based over solution
# steps and so is upstream's, so the two window_start values match step-for-step. Galapagos
# seeds the population OUTSIDE the loop and the base loop does state.iteration += 1 BEFORE each
# step (first solution step -> 1). Upstream is identical: runner.py adds the seed at
# iteration_found=0 OUTSIDE the loop, then starts the solution loop at
# discovery_start = start_iteration + 1 = 1 (runner.py:167, should_add_initial=True for a
# fresh run), so its first completed_solution_iter is 1 too (controller.py:125/150). The seed
# STRATEGY window separately uses start_iteration=0 in both (_register_seed_strategy ==
# controller.py:237). Do NOT subtract 1 here — that would make this carrier (and the two
# display sinks it feeds: the meta-prompt "start at iteration N" line and the
# window_start_iteration metric "ran from iteration X to Y") 0-based and DIVERGE from upstream.
# Then advance state.iteration by the extra attempts the base loop has not yet counted (base
# step already did +1; upstream advances by attempts_used).
self._completed_solution_iter = self.state.iteration
self.state.iteration += attempts_used - 1
if self.state.best is not None:
sig["global_best"] = self.state.best.fitness
def periodic(self) -> None:
sig = self.state.signals.setdefault("evox", {})
# runtime strategy errors (policy sample / population add restored the fallback): count
# the failed evolution and drop the pending entry — the broken strategy is never scored
while self._counted_fallbacks < self.population.runtime_fallbacks:
self._counted_fallbacks += 1
self._num_strategy_evolutions += 1
self._pending = None
sig["last_runtime_error"] = self.population.last_runtime_error
# stagnation check every iteration; never trigger on an iteration the run stops on —
# for ANY stop reason (max_iterations, target_score, max_usd, wallclock, patience), not
# just the iteration bound: a meta event fired here would burn meta/guide calls and
# finalize a never-run J=0 ghost strategy into H. The base scaffold checks the same
# side-effect-free _should_stop() right after periodic(), so this is exact (upstream
# short-circuits before _should_evolve_search on the last loop pass).
if not self._should_stop() and self._should_evolve_search():
self._evolve_search(self._completed_solution_iter)
sig.update(num_strategy_evolutions=self._num_strategy_evolutions,
strategies_in_memory=len(self.memory),
stagnant_count=self._stagnant_count,
switch_interval=self._switch_interval)
def _finalize(self):
# upstream run_discovery scores a still-pending strategy at the end of the run
if self._pending is not None:
self._finalize_pending()
return super()._finalize()
# ---- stagnation + scoring window ---------------------------------------------------------
def _get_best_score(self) -> float:
"""``_get_best_score``: the best ``combined_score`` (0.0 when absent/non-numeric)."""
best = self.population.best()
if best is not None:
score = best.scores.get("combined_score")
if isinstance(score, (int, float)):
return float(score)
return 0.0
def _reset_window(self, start_iteration: int | None = None) -> None:
self._scorer.reset_window(self._get_best_score(), start_iteration=start_iteration)
def _should_evolve_search(self) -> bool:
"""The verbatim consecutive-stagnation counter: gain > 0.01 absolute resets it, anything
else increments; firing resets it again (so a failed generation event is naturally
re-tried no sooner than ``switch_interval`` iterations later)."""
current = self._get_best_score()
if self._last_tracked_best_score is None:
self._stagnant_count = 0
elif (current - self._last_tracked_best_score) > self.improvement_threshold:
self._stagnant_count = 0
else:
self._stagnant_count += 1
self._last_tracked_best_score = current
if self._stagnant_count >= self._switch_interval:
self._stagnant_count = 0
return True
return False
def _statistics(self) -> dict:
return self.population.statistics(improvement_threshold=self.improvement_threshold,
k=self._statistics_k)
# ---- the strategy-evolution event ------------------------------------------------------------
def _evolve_search(self, solution_iter: int) -> None:
"""``_evolve_search``: register/score the seed on the first event, finalize any pending
strategy, then generate + validate + hot-swap a new one."""
if self.model is None:
return
log.info("strategy evolution triggered at iter %d (stagnation=%d)", solution_iter,
self._switch_interval)
if len(self.memory) == 0:
self._register_seed_strategy()
elif self._pending is not None:
self._finalize_pending()
self._reset_window()
self._generate_and_validate(solution_iter)
def _make_entry(self, source: str, metrics: dict, start_stats: dict) -> dict:
metrics = dict(metrics)
# upstream uses both key names (scorer: search_horizon; formatters: search_window_horizon)
metrics["search_window_horizon"] = metrics.get("search_horizon")
return {"source": source, "metrics": metrics, "start_stats": start_stats,
"end_stats": self._statistics()}
def _register_seed_strategy(self) -> None:
"""``_initialize_first_search_program``: score the seed strategy over the window recorded
so far and insert it into H before the first generation."""
start = self._scorer.get_start_score() or 0.0
metrics = self._scorer.compute_metrics(start_score=start, best_scores=None,
horizon=self._switch_interval, start_iteration=0)
entry = self._make_entry(self.population.seed_source, metrics, self._start_stats)
self.memory.add_strategy(entry)
self._best_search_score = float(metrics.get("combined_score") or 0.0)
self._num_strategy_evolutions += 1
def _finalize_pending(self) -> None:
"""``_finalize_pending_search`` + ``_assign_search_score``: deferred J over the deployed
strategy's window + end-of-window φ, then into H."""
if self._pending is None:
return
if self._scorer.get_window_size() > 0:
metrics = self._scorer.compute_metrics(horizon=self._switch_interval)
else:
start = self._scorer.get_start_score() or 0.0
metrics = self._scorer.compute_metrics(start_score=start,
best_scores=[self._get_best_score()],
horizon=self._switch_interval)
score = float(metrics.get("combined_score", 0.0) or 0.0)
entry = self._make_entry(self._pending["source"], metrics, self._pending["start_stats"])
self.memory.add_strategy(entry)
if self._best_search_score is None or score > self._best_search_score:
self._best_search_score = score
self._pending = None
self._num_strategy_evolutions += 1
def _generate_and_validate(self, solution_iter: int) -> None:
"""One meta event: argmax-J parent + <=2 random inspirations from H → meta prompt →
ONE model call per attempt, up to ``meta_max_retries`` with failure feedback → Valid(·)
→ hot-swap. All failures keep the current strategy."""
sig = self.state.signals.setdefault("evox", {})
parent_entry = self.memory.best()
if parent_entry is None: # defensive; the seed is always registered first
return
# deterministic per-event rng for meta inspirations + the validator's behavioral tests
event_rng = random.Random(self.seed * 7919 + self._num_strategy_evolutions * 101 + 17)
inspirations = self.memory.inspirations(self._meta_num_context, event_rng)
db_stats = self._statistics()
failed: list[tuple[str, str]] = []
source: str | None = None
new_class = None
for _attempt in range(max(1, self._meta_max_retries)):
user = self._build_meta_user(solution_iter, parent_entry, inspirations, db_stats,
failed)
try:
gen = self.model.generate(Prompt(system=META_SYSTEM_PROMPT, user=user))
except Exception as e: # noqa: BLE001 — a flaky model call is a failed attempt
failed.append((f"model call failed: {e}", ""))
continue
self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
blocks = _FENCE.findall(gen.text or "")
code = blocks[-1].strip("\n") if blocks else ""
if not code.strip():
failed.append(("no fenced python code block found in the reply", ""))
continue
if len(code) > self._max_strategy_chars: # max_solution_length (meta side)
failed.append((f"generated strategy exceeds max_strategy_chars="
f"{self._max_strategy_chars}", code[:500]))
continue
ok, error, cls = validate_strategy(code, event_rng)
if not ok:
failed.append((error, code))
continue
# deploy the ALREADY-loaded class returned by Valid(·): re-exec'ing the source would
# run the candidate's module level a second time, unguarded — any module-level
# exception there would abort the run instead of counting as a failed attempt
source, new_class = code, cls
break
if source is None: # all attempts failed → keep the current strategy, count the event
self._num_strategy_evolutions += 1
sig["last_evolution"] = "generation_failed"
return
if not self.population.swap_strategy(new_class, source, rng=self.selection_policy.rng,
diverge_label=self._diverge_label,
refine_label=self._refine_label):
self._num_strategy_evolutions += 1
sig["last_evolution"] = "swap_failed"
return
self._pending = {"source": source, "start_stats": db_stats}
self._reset_window(start_iteration=solution_iter)
sig["last_evolution"] = "swapped"
sig["strategy_swaps"] = int(sig.get("strategy_swaps", 0)) + 1
# ---- meta prompt assembly --------------------------------------------------------------------
def _build_meta_user(self, solution_iter: int, parent_entry: dict,
inspiration_entries: list[dict], db_stats: dict,
failed: list[tuple[str, str]]) -> str:
"""The ``search_evolution_user_message.txt`` assembly. Section order is adapted so the
current strategy source is the LAST fenced python block (see module docstring)."""
horizon = self._switch_interval
filtered = filter_stats_by_horizon(db_stats, horizon)
# population state: raw φ rendering, replaced by the guide-LLM insight when enabled
stats_text = format_population_state(filtered)
population_state = stats_text
if self._use_stats_insight and stats_text:
insight = self._guide(STATS_INSIGHT_SYSTEM,
f"Population Statistics:\n\n{stats_text}")
if insight.strip():
population_state = _defence(insight)
window = format_search_window_context(solution_iter, int(self.general.max_iterations),
horizon, self.improvement_threshold)
entries = self.memory.entries
previous = entries[-1] if entries else None
focus = identify_strategy_focus_areas(parent_entry, previous)
parts = [
"# DOWNSTREAM PROBLEM CONTEXT\n"
"Your search algorithm evolves solutions for the downstream problem. Use this to "
"inform your search strategy.\n\n" + self._problem_template(),
"---------------------------",
"# Search Algorithm Information\n\n## Your Algorithm's Search Window\n" + window,
"## Solution Population Statistics\nThis describes the current state of the solution "
"database your algorithm will work with.\n\n" + population_state,
]
other_section = self._render_inspirations(inspiration_entries)
if other_section:
parts.append(other_section)
if failed:
lines = ["## Previous Failed Attempts",
"These rewrites were rejected during this evolution event. "
"Fix the cause; do not repeat the mistake:"]
for i, (error, code) in enumerate(failed, 1):
lines.append(f"### Attempt {i}\nError: {_defence(error)}")
if code:
lines.append("Rejected code (truncated, fences stripped):\n"
+ _defence(code[:2000]))
parts.append("\n".join(lines))
parts.append("# What You Are Writing\n\n"
+ format_current_strategy(parent_entry, focus))
parts.append(META_TASK_TAIL)
return "\n\n".join(parts)
def _render_inspirations(self, entries: list[dict]) -> str:
"""Prior strategies from H: batch guide-LLM ``[PROGRAM N]`` summaries when enabled
(full-source fallback on parse failure), each with its start→end φ diff."""
if not entries:
return ""
summaries: dict[int, str] = {}
if self._use_batch_summaries:
blocks = []
for idx, entry in enumerate(entries, start=1):
start_stats, end_stats = entry.get("start_stats"), entry.get("end_stats")
if not (start_stats and end_stats):
continue
metrics = entry.get("metrics", {})
h = int(metrics.get("search_window_horizon") or 0)
diff_text = format_stats_diff(filter_stats_by_horizon(start_stats, h),
filter_stats_by_horizon(end_stats, h), horizon=h)
improvement = (float(metrics.get("search_window_end_score") or 0.0)
- float(metrics.get("search_window_start_score") or 0.0))
blocks.append(
f"=== PROGRAM {idx} (score={float(metrics.get('combined_score') or 0.0):.4f},"
f" improvement={improvement:.4f}) ===\n"
+ BATCH_PER_PROGRAM.format(task_description=META_SYSTEM_PROMPT,
solution=entry.get("source", ""),
db_stats_text=diff_text))
if blocks:
user = BATCH_INSTRUCTIONS.format(num_programs=len(blocks),
combined_content="\n".join(blocks))
response = self._guide(BATCH_SUMMARY_SYSTEM, user)
summaries = parse_batch_summaries(response, len(entries))
return format_strategy_inspirations(entries, summaries)
def _problem_template(self) -> str:
"""Problem context for the meta prompt: guide-LLM summary (cached per problem) when
enabled, degrading to the raw description + evaluator source."""
description = self.state.task_context or "(No problem description provided)"
evaluator_code = self._evaluator_source()
evaluator_context = (f"```python\n{evaluator_code}\n```" if evaluator_code
else "(No evaluator context provided)")
raw = PROBLEM_TEMPLATE.format(problem_description=description,
evaluator_context=evaluator_context)
if not self._use_problem_summary:
return raw
key = hashlib.sha256(f"{description}|||{evaluator_context}".encode()).hexdigest()
if key in self._problem_summary_cache:
return self._problem_summary_cache[key]
summary = self._guide(PROBLEM_SUMMARY_SYSTEM, raw)
if summary.strip(): # cache ONLY a successful summary (upstream never caches failures);
result = _defence(summary.strip())
self._problem_summary_cache[key] = result
return result
return raw # transient guide failure: degrade to raw, UNcached, so later events retry
def _guide(self, system: str, user: str) -> str:
"""One guide-LLM call through ``self.model``; ``""`` on ANY failure (the callers all
degrade gracefully to raw text). Cost is recorded on success."""
if self.model is None:
return ""
try:
gen = self.model.generate(Prompt(system=system, user=user))
except Exception: # noqa: BLE001 — guide compression must never crash the search
return ""
self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
return gen.text or ""
def _evaluator_source(self) -> str:
"""The task evaluator's SOURCE CODE (``SubprocessEvaluator.evaluator_path``), capped and
fence-sanitized; ``""`` on any attribute/IO failure (the section degrades gracefully)."""
try:
path = self.evaluator.evaluator_path
with open(path, encoding="utf-8") as f:
code = f.read()
except Exception: # noqa: BLE001 — no evaluator_path / unreadable file
return ""
if len(code) > _MAX_EVALUATOR_CODE_CHARS:
code = code[:_MAX_EVALUATOR_CODE_CHARS] + "\n# ... (truncated)"
return _defence(code)
# ---- one-time variation-operator generation ---------------------------------------------------
def _generate_variation_operators(self) -> None:
"""``_generate_variation_operators``: ONE guide call with the verbatim COMBINED prompt;
parse the EXPLORATION/EXPLOITATION sections into the DIVERGE/REFINE templates. ANY
failure → both labels ``""`` (free-form-only). ``auto_generate_variation_operators:
false`` → the verbatim static DEFAULT templates (no LLM)."""
if self._diverge_label and self._refine_label: # cached across setups
self.population.assign_labels(self._diverge_label, self._refine_label)
return
if not self._auto_operators:
self._diverge_label = DEFAULT_DIVERGE_TEMPLATE
self._refine_label = DEFAULT_REFINE_TEMPLATE
self.population.assign_labels(self._diverge_label, self._refine_label)
return
try:
if self.model is None:
raise RuntimeError("no model available for operator generation")
user = _OPERATOR_USER.format(
system_message=self.state.task_context or "",
packages_list=_PACKAGES_LINE,
evaluator_code=self._evaluator_source() or "(no evaluator source available)")
gen = self.model.generate(Prompt(system=COMBINED_SYSTEM_PROMPT, user=user))
self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens)
explore_examples, refine_examples = _parse_combined_response(gen.text or "")
self._diverge_label = DIVERGE_TEMPLATE.replace("{GENERATED_EXAMPLES}",
explore_examples)
self._refine_label = REFINE_TEMPLATE.replace("{GENERATED_EXAMPLES}", refine_examples)
except Exception: # noqa: BLE001 — reference fallback: empty labels, free-form only
self._diverge_label = ""
self._refine_label = ""
self.population.assign_labels(self._diverge_label, self._refine_label)