SkyDiscover/evox

EvoX

Co-evolves the search strategy with the solutions: the parent/context selection policy is itself LLM-written code, scored by windowed improvement and hot-swapped on stagnation.

Test-time searchApache-2.0

"""EvoX — a faithful port of "EvoX: Meta-Evolution for Automated Discovery" (UC Berkeley), following its reference implementation in SkyDiscover (``search/evox/``) wherever paper and code diverge: J uses the code's ``(1 + ln(1 + max(0, start)))`` weight, stagnation is the per-iteration consecutive-counter (not the paper's fixed windows), the meta-parent is the deterministic argmax-J strategy, and the horizon normalizer is fixed at ``switch_interval`` even when a strategy runs longer. One module per component: population.py -> EvoXPopulation (hosts the ACTIVE evolved strategy + φ statistics) selection_policy.py -> EvoXPolicy (thin adapter over the strategy's sample()) prompt_builder.py -> EvoXPromptBuilder (operator-labeled default template + all prompts) proposer.py -> EvoXProposer (diff proposer + parent_info/context_ids stamps) evaluator.py -> EvoXEvaluator (task-supplied) memory.py -> EvoXStrategyMemory (the strategy history H) scaffold.py -> EvoXScaffold (the orchestrator that composes the six) plus two non-component infra modules: strategy.py -> StrategyBase + load_strategy_from_source + validate_strategy (Valid(·)) seed_strategy.py -> the S0 GENOME (read as text by the meta loop AND executed as the initial strategy) The scaffold owns the ``CoEvolutionController`` control flow that sits outside select/observe: * ``setup`` — resolve ``switch_interval`` (explicit config or ``max(1, int(0.10 * T))``), snapshot the start φ, reset the scoring window, and run the ONE-time variation-operator generation (one ``self.model`` call with the verbatim COMBINED/EXPLORE/EXPLOIT prompts; ANY failure → both labels ``""``, the reference free-form-only fallback; ``auto_generate_variation_operators: false`` → the verbatim static DEFAULT templates). * ``after_step`` — scoring-window + budget accounting. Upstream's ``_run_iteration(retry_times=3)`` retries a parse/eval failure up to 3x against the SAME parent and reports ``attempts_used`` (1..3), which steps the J-scoring window AND advances the solution-iteration counter by that many (a 3-retry failure burns 3 budget units + 3 window ticks). The base ``_attempt`` performs that in-iteration retry (``inner_retry_times=3`` + the fence-free ``feedback_section`` override) and stamps the attempts it used on ``child.metadata['inner_attempts_used']``; this hook ticks the window that many times and advances ``state.iteration`` by ``attempts_used - 1`` (the base step already counted +1). Best-tracking is admission-gated by the base loop, exactly like upstream's ``if was_added``, so no separate gated-best repair is needed. * ``periodic`` — runs every iteration: detect runtime strategy fallbacks (count the failed evolution, drop the pending entry), then the verbatim stagnation counter (consecutive iterations with best-score gain <= 0.01 absolute; reset on improvement; trigger at ``switch_interval``; never on the final iteration). On trigger: finalize the pending strategy (deferred J + end φ → memory; the FIRST event scores + inserts the seed strategy first), build the meta prompt (adapted ``evox_search_sys_prompt`` system + the five-section user message with guide-LLM compression, each call degrading gracefully to raw text), ONE meta model call per attempt with up to ``meta_max_retries`` validation retries feeding failures back as "## Previous Failed Attempts"; all failures keep the current strategy (the counter was reset at trigger, so the next trigger is naturally >= switch_interval away). On success: ``population.swap_strategy`` (full migration + fallback) and a fresh scoring window. * ``_finalize`` — a run-end pending strategy is finalized into H (upstream end of ``run_discovery``). Sanctioned adaptations (documented per the port contract): the meta system/user prompts are adapted to the galapagos contract (``EvolvedStrategy(StrategyBase)`` + Genome field names) since the LLM codes against OUR base class; the meta user message renders the inspiration strategies BEFORE the current strategy and replaces the trailing fenced response example with an unfenced instruction, so the CURRENT STRATEGY SOURCE is the LAST fenced python block (the ``parse_full_rewrite`` target); package discovery for operator generation is a static standard-scientific-stack line; the upstream ``search_horizon``/``search_window_horizon`` key mismatch is resolved by storing both keys on every strategy entry. Further sanctioned deviations: * meta "Focus areas" baseline — the port compares the current strategy's J against the PREVIOUS strategy entry in H. Upstream compares J against the newest SOLUTION program's ``combined_score`` (the meta loop reuses the solution database's ``previous_programs``), an apples-to-oranges quirk that renders a "declined" line at essentially every event; that upstream bug is deliberately NOT reproduced. * the three guide-LLM compression calls run sequentially through ``_guide`` (the galapagos framework is synchronous) rather than via ``asyncio.gather`` — identical prompt content and graceful degradation; the only difference is latency. * the upstream ``gap_to_SOTA`` φ lines never render: galapagos exposes no SOTA knob, so ``db_stats`` never carries an ``SOTA_score`` (upstream only renders them when it does). * solution-side retry-with-failure-feedback is now IN-iteration, faithful to upstream: the base ``_attempt`` retries a parse/eval failure up to ``inner_retry_times`` (=3) times WITHIN the iteration against the SAME sampled parent, folding each failed attempt into the retry prompt via the fence-free :meth:`EvoXPromptBuilder.feedback_section` (rendered without code fences so the current program stays the last fenced block). ``attempts_used`` (1..3) advances both the solution-iteration counter and the J-scoring window (``after_step``), matching ``CoEvolutionController``'s ``iteration += attempts_used`` and per-attempt window stepping. """ from __future__ import annotations import hashlib import logging import math import random import re from pathlib import Path from ...config import GalapagosConfig from ...models import GalapagosModel from ...models.base import Prompt from ...records import Genome from ..base_scaffold import GalapagosScaffold from ..registry import register_scaffold # one module per component (the EvoX scaffold method) from .memory import EvoXStrategyMemory from .population import EvoXPopulation from .prompt_builder import (BATCH_INSTRUCTIONS, BATCH_PER_PROGRAM, BATCH_SUMMARY_SYSTEM, COMBINED_SYSTEM_PROMPT, DEFAULT_DIVERGE_TEMPLATE, DEFAULT_REFINE_TEMPLATE, DIVERGE_TEMPLATE, META_SYSTEM_PROMPT, META_TASK_TAIL, PROBLEM_SUMMARY_SYSTEM, PROBLEM_TEMPLATE, REFINE_TEMPLATE, STATS_INSIGHT_SYSTEM, EvoXPromptBuilder, _defence, filter_stats_by_horizon, format_current_strategy, format_population_state, format_search_window_context, format_stats_diff, format_strategy_inspirations, identify_strategy_focus_areas, parse_batch_summaries) from .proposer import EvoXProposer from .selection_policy import EvoXPolicy from .strategy import validate_strategy log = logging.getLogger(__name__) _FENCE = re.compile(r"```(?:python|py)?\s*\n(.*?)```", re.DOTALL) _MAX_EVALUATOR_CODE_CHARS = 12000 # package discovery adapted to galapagos (upstream reads requirements.txt/pyproject/uv pip list) _PACKAGES_LINE = ("Standard scientific Python stack (numpy, scipy, pandas, sympy, networkx, " "scikit-learn). Do not assume packages that require extra installation.") # variation_operator_generator._build_operator_prompt — verbatim structure _OPERATOR_USER = """\ Please analyze this problem and generate BOTH guidance blocks. ## Problem Description: ``` {system_message} ``` ## Available Packages in Environment The following packages are available in the current uv environment: ``` {packages_list} ``` ## Evaluator Code: ```python {evaluator_code} ``` Generate BOTH the EXPLORATION (different approaches) and EXPLOITATION (refinement/intensification) guidance blocks now. For EXPLORATION guidance block, focus on DIFFERENT algorithmic approaches and structural changes. For EXPLOITATION guidance block, focus on INTENSIFYING within existing approaches - e.g., computational budget (e.g., increase max iterations), better seeds, tighter tolerances, local polish stages. """ # --------------------------------------------------------------------------------------------- # Operator-response parsing — variation_operator_generator.py ports # --------------------------------------------------------------------------------------------- def _extract_examples(response: str, is_diverge: bool = True) -> str: """``_extract_examples``: keep from the ``EXAMPLES OF ...`` line onward (fences stripped), falling back to the whole section.""" needle = "EXAMPLES OF DIFFERENT" if is_diverge else "EXAMPLES OF REFINEMENT" examples_lines: list[str] = [] in_examples = False for line in response.strip().split("\n"): if needle in line.upper(): in_examples = True examples_lines.append(line) elif in_examples: if line.strip().startswith("Format:") or line.strip().startswith("Your solution"): break if line.strip() == "```": continue examples_lines.append(line) if examples_lines: while examples_lines and not examples_lines[-1].strip(): examples_lines.pop() return "\n".join(examples_lines) return response.strip() def _parse_combined_response(response: str) -> tuple[str, str]: """``_parse_combined_response``: split on the EXPLORATION/EXPLOITATION headers.""" exploration = exploitation = "" current_section: str | None = None current_lines: list[str] = [] for line in response.split("\n"): upper = line.upper().strip() if "### EXPLORATION" in upper or "EXPLORATION (DIVERGE" in upper: if current_section == "exploitation": exploitation = "\n".join(current_lines) current_section, current_lines = "exploration", [] elif "### EXPLOITATION" in upper or "EXPLOITATION (REFINE" in upper: if current_section == "exploration": exploration = "\n".join(current_lines) current_section, current_lines = "exploitation", [] elif current_section: current_lines.append(line) if current_section == "exploration": exploration = "\n".join(current_lines) elif current_section == "exploitation": exploitation = "\n".join(current_lines) return _extract_examples(exploration, True), _extract_examples(exploitation, False) # --------------------------------------------------------------------------------------------- # LogWindowScorer — search/evox/utils/search_scorer.py, verbatim semantics # --------------------------------------------------------------------------------------------- class LogWindowScorer: """J = (running_best - start) * (1 + ln(1 + max(0, start))) / sqrt(int(horizon)) — the CODE formula (the paper omits the ``1 +``); the normalizer stays ``switch_interval`` even when the strategy ran longer (an intentional bonus for long-lived improving strategies).""" def __init__(self) -> None: self._start_score: float | None = None self._start_iteration: int | None = None self._best_scores: list[float] = [] def reset_window(self, start_score: float | None, start_iteration: int | None = None) -> None: self._start_score = float(start_score) if start_score is not None else 0.0 self._start_iteration = start_iteration self._best_scores = [] def record_step(self, best_score: float | None) -> None: if self._start_score is None: self.reset_window(best_score) if best_score is None: best_score = self._best_scores[-1] if self._best_scores else self._start_score self._best_scores.append(float(best_score)) def get_window_size(self) -> int: return len(self._best_scores) def get_start_score(self) -> float | None: return self._start_score def compute_metrics(self, start_score: float | None = None, best_scores: list[float] | None = None, horizon: int | None = None, start_iteration: int | None = None) -> dict: if start_iteration is None: start_iteration = self._start_iteration start = float(start_score if start_score is not None else (self._start_score or 0.0)) scores_to_use = best_scores if best_scores is not None else self._best_scores observed = len(scores_to_use) if scores_to_use else 0 horizon_int = int(horizon) if horizon else max(1, observed) running_best = start for score in scores_to_use: running_best = max(running_best, float(score)) improvement = running_best - start log_weight = 1.0 + math.log(1.0 + max(0.0, start)) combined_score = improvement * log_weight / math.sqrt(horizon_int) return { "combined_score": combined_score, "window_start_iteration": start_iteration, "search_window_start_score": start, "search_window_end_score": running_best, "search_horizon": horizon_int, } # --------------------------------------------------------------------------------------------- # The scaffold # --------------------------------------------------------------------------------------------- @register_scaffold("evox") class EvoXScaffold(GalapagosScaffold): name = "evox" DEFAULT_SWITCH_RATIO = 0.10 # evolve search after 10% of total iterations stagnate DEFAULT_IMPROVEMENT_THRESHOLD = 0.01 # τ @classmethod def build_components(cls, config: GalapagosConfig, model: GalapagosModel | None) -> dict: seed = int(config.seed) pop = config.population sel = config.selection_policy policy = EvoXPolicy(seed=seed, num_context_programs=int(sel.num_context_programs)) seed_source = (Path(__file__).parent / "seed_strategy.py").read_text(encoding="utf-8") return { "population": EvoXPopulation( seed_source=seed_source, rng=policy.rng, # determinism: the strategy's randomness IS the policy's rng statistics_k=int(pop.statistics_k), improvement_threshold=float(pop.improvement_threshold), ), "selection_policy": policy, "prompt_builder": EvoXPromptBuilder(), "proposer": EvoXProposer(), "memory": EvoXStrategyMemory(), } def __init__(self, **kw): super().__init__(**kw) cfg = self.config self.improvement_threshold = float(cfg.population.improvement_threshold) self._statistics_k = int(cfg.population.statistics_k) self._meta_num_context = int(cfg.meta.meta_num_context_programs) self._meta_max_retries = int(cfg.meta.meta_max_retries) self._auto_operators = bool(cfg.meta.auto_generate_variation_operators) self._use_stats_insight = bool(cfg.meta.use_llm_stats_insight) self._use_problem_summary = bool(cfg.meta.use_problem_summary) self._use_batch_summaries = bool(cfg.meta.use_batch_summaries) self._max_strategy_chars = int(cfg.meta.max_strategy_chars) self._scorer = LogWindowScorer() self._switch_interval: int = 1 # resolved in setup (needs the final budget) self._completed_solution_iter = 0 # last completed solution iter (pre-advance; for _evolve_search) self._stagnant_count = 0 self._last_tracked_best_score: float | None = None self._pending: dict | None = None # deployed-but-unscored strategy self._best_search_score: float | None = None self._num_strategy_evolutions = 0 self._counted_fallbacks = 0 self._diverge_label = "" self._refine_label = "" self._start_stats: dict = {} self._problem_summary_cache: dict[str, str] = {} # ---- setup: window + one-time variation operators ------------------------------------------ def setup(self, task) -> None: # EvoX co-evolution is strictly sequential: after_step advances state.iteration by # attempts_used and steps the shared J-scoring window / stagnation counter / pending-strategy # state, none of which is concurrency-safe. The base parallel loop would race these, so clamp. if int(self.general.max_parallel_iteration) > 1: log.warning("evox requires sequential iterations; forcing max_parallel_iteration=1 " "(was %d)", self.general.max_parallel_iteration) self.general.max_parallel_iteration = 1 super().setup(task) explicit = self.config.meta.switch_interval self._switch_interval = (int(explicit) if explicit else max( 1, int(self.general.max_iterations * self.DEFAULT_SWITCH_RATIO))) self._start_stats = self._statistics() self._reset_window(start_iteration=0) self._generate_variation_operators() # ---- per-iteration hooks --------------------------------------------------------------------- def after_step(self, child: Genome, result) -> None: """Per-iteration scoring-window + budget accounting (``CoEvolutionController.run_discovery``'s ``for _ in range(attempts_used): _record_search_window_step()`` and ``iteration += attempts_used``). Upstream's ``_run_iteration(retry_times=3)`` retries a parse/eval failure up to 3x against the SAME parent (the base ``_attempt`` now does exactly this via ``inner_retry_times=3`` + the fence-free :meth:`EvoXPromptBuilder.feedback_section`), and its ``attempts_used`` (1..3) advances BOTH the solution-iteration counter and the J-scoring window: a 3-retry failure burns 3 budget units and 3 window ticks. The base ``_attempt`` stamps the attempts it used on ``child.metadata['inner_attempts_used']``; this hook replays both. Best / ``state.best`` tracking is already admission-gated by the base loop, exactly like upstream's ``if was_added`` — no separate repair is needed.""" sig = self.state.signals.setdefault("evox", {}) attempts_used = 1 if child is not None: try: attempts_used = max(1, int(child.metadata.get("inner_attempts_used", 1))) except (TypeError, ValueError): attempts_used = 1 best_now = self._get_best_score() # unchanged across the attempts of this iteration for _ in range(attempts_used): # one window step per attempt (upstream attempts_used) self._scorer.record_step(best_now) # completed_solution_iter = the iteration BEFORE the counter advances (upstream passes it to # _evolve_search). NO offset: galapagos's state.iteration is ALREADY 1-based over solution # steps and so is upstream's, so the two window_start values match step-for-step. Galapagos # seeds the population OUTSIDE the loop and the base loop does state.iteration += 1 BEFORE each # step (first solution step -> 1). Upstream is identical: runner.py adds the seed at # iteration_found=0 OUTSIDE the loop, then starts the solution loop at # discovery_start = start_iteration + 1 = 1 (runner.py:167, should_add_initial=True for a # fresh run), so its first completed_solution_iter is 1 too (controller.py:125/150). The seed # STRATEGY window separately uses start_iteration=0 in both (_register_seed_strategy == # controller.py:237). Do NOT subtract 1 here — that would make this carrier (and the two # display sinks it feeds: the meta-prompt "start at iteration N" line and the # window_start_iteration metric "ran from iteration X to Y") 0-based and DIVERGE from upstream. # Then advance state.iteration by the extra attempts the base loop has not yet counted (base # step already did +1; upstream advances by attempts_used). self._completed_solution_iter = self.state.iteration self.state.iteration += attempts_used - 1 if self.state.best is not None: sig["global_best"] = self.state.best.fitness def periodic(self) -> None: sig = self.state.signals.setdefault("evox", {}) # runtime strategy errors (policy sample / population add restored the fallback): count # the failed evolution and drop the pending entry — the broken strategy is never scored while self._counted_fallbacks < self.population.runtime_fallbacks: self._counted_fallbacks += 1 self._num_strategy_evolutions += 1 self._pending = None sig["last_runtime_error"] = self.population.last_runtime_error # stagnation check every iteration; never trigger on an iteration the run stops on — # for ANY stop reason (max_iterations, target_score, max_usd, wallclock, patience), not # just the iteration bound: a meta event fired here would burn meta/guide calls and # finalize a never-run J=0 ghost strategy into H. The base scaffold checks the same # side-effect-free _should_stop() right after periodic(), so this is exact (upstream # short-circuits before _should_evolve_search on the last loop pass). if not self._should_stop() and self._should_evolve_search(): self._evolve_search(self._completed_solution_iter) sig.update(num_strategy_evolutions=self._num_strategy_evolutions, strategies_in_memory=len(self.memory), stagnant_count=self._stagnant_count, switch_interval=self._switch_interval) def _finalize(self): # upstream run_discovery scores a still-pending strategy at the end of the run if self._pending is not None: self._finalize_pending() return super()._finalize() # ---- stagnation + scoring window --------------------------------------------------------- def _get_best_score(self) -> float: """``_get_best_score``: the best ``combined_score`` (0.0 when absent/non-numeric).""" best = self.population.best() if best is not None: score = best.scores.get("combined_score") if isinstance(score, (int, float)): return float(score) return 0.0 def _reset_window(self, start_iteration: int | None = None) -> None: self._scorer.reset_window(self._get_best_score(), start_iteration=start_iteration) def _should_evolve_search(self) -> bool: """The verbatim consecutive-stagnation counter: gain > 0.01 absolute resets it, anything else increments; firing resets it again (so a failed generation event is naturally re-tried no sooner than ``switch_interval`` iterations later).""" current = self._get_best_score() if self._last_tracked_best_score is None: self._stagnant_count = 0 elif (current - self._last_tracked_best_score) > self.improvement_threshold: self._stagnant_count = 0 else: self._stagnant_count += 1 self._last_tracked_best_score = current if self._stagnant_count >= self._switch_interval: self._stagnant_count = 0 return True return False def _statistics(self) -> dict: return self.population.statistics(improvement_threshold=self.improvement_threshold, k=self._statistics_k) # ---- the strategy-evolution event ------------------------------------------------------------ def _evolve_search(self, solution_iter: int) -> None: """``_evolve_search``: register/score the seed on the first event, finalize any pending strategy, then generate + validate + hot-swap a new one.""" if self.model is None: return log.info("strategy evolution triggered at iter %d (stagnation=%d)", solution_iter, self._switch_interval) if len(self.memory) == 0: self._register_seed_strategy() elif self._pending is not None: self._finalize_pending() self._reset_window() self._generate_and_validate(solution_iter) def _make_entry(self, source: str, metrics: dict, start_stats: dict) -> dict: metrics = dict(metrics) # upstream uses both key names (scorer: search_horizon; formatters: search_window_horizon) metrics["search_window_horizon"] = metrics.get("search_horizon") return {"source": source, "metrics": metrics, "start_stats": start_stats, "end_stats": self._statistics()} def _register_seed_strategy(self) -> None: """``_initialize_first_search_program``: score the seed strategy over the window recorded so far and insert it into H before the first generation.""" start = self._scorer.get_start_score() or 0.0 metrics = self._scorer.compute_metrics(start_score=start, best_scores=None, horizon=self._switch_interval, start_iteration=0) entry = self._make_entry(self.population.seed_source, metrics, self._start_stats) self.memory.add_strategy(entry) self._best_search_score = float(metrics.get("combined_score") or 0.0) self._num_strategy_evolutions += 1 def _finalize_pending(self) -> None: """``_finalize_pending_search`` + ``_assign_search_score``: deferred J over the deployed strategy's window + end-of-window φ, then into H.""" if self._pending is None: return if self._scorer.get_window_size() > 0: metrics = self._scorer.compute_metrics(horizon=self._switch_interval) else: start = self._scorer.get_start_score() or 0.0 metrics = self._scorer.compute_metrics(start_score=start, best_scores=[self._get_best_score()], horizon=self._switch_interval) score = float(metrics.get("combined_score", 0.0) or 0.0) entry = self._make_entry(self._pending["source"], metrics, self._pending["start_stats"]) self.memory.add_strategy(entry) if self._best_search_score is None or score > self._best_search_score: self._best_search_score = score self._pending = None self._num_strategy_evolutions += 1 def _generate_and_validate(self, solution_iter: int) -> None: """One meta event: argmax-J parent + <=2 random inspirations from H → meta prompt → ONE model call per attempt, up to ``meta_max_retries`` with failure feedback → Valid(·) → hot-swap. All failures keep the current strategy.""" sig = self.state.signals.setdefault("evox", {}) parent_entry = self.memory.best() if parent_entry is None: # defensive; the seed is always registered first return # deterministic per-event rng for meta inspirations + the validator's behavioral tests event_rng = random.Random(self.seed * 7919 + self._num_strategy_evolutions * 101 + 17) inspirations = self.memory.inspirations(self._meta_num_context, event_rng) db_stats = self._statistics() failed: list[tuple[str, str]] = [] source: str | None = None new_class = None for _attempt in range(max(1, self._meta_max_retries)): user = self._build_meta_user(solution_iter, parent_entry, inspirations, db_stats, failed) try: gen = self.model.generate(Prompt(system=META_SYSTEM_PROMPT, user=user)) except Exception as e: # noqa: BLE001 — a flaky model call is a failed attempt failed.append((f"model call failed: {e}", "")) continue self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens) blocks = _FENCE.findall(gen.text or "") code = blocks[-1].strip("\n") if blocks else "" if not code.strip(): failed.append(("no fenced python code block found in the reply", "")) continue if len(code) > self._max_strategy_chars: # max_solution_length (meta side) failed.append((f"generated strategy exceeds max_strategy_chars=" f"{self._max_strategy_chars}", code[:500])) continue ok, error, cls = validate_strategy(code, event_rng) if not ok: failed.append((error, code)) continue # deploy the ALREADY-loaded class returned by Valid(·): re-exec'ing the source would # run the candidate's module level a second time, unguarded — any module-level # exception there would abort the run instead of counting as a failed attempt source, new_class = code, cls break if source is None: # all attempts failed → keep the current strategy, count the event self._num_strategy_evolutions += 1 sig["last_evolution"] = "generation_failed" return if not self.population.swap_strategy(new_class, source, rng=self.selection_policy.rng, diverge_label=self._diverge_label, refine_label=self._refine_label): self._num_strategy_evolutions += 1 sig["last_evolution"] = "swap_failed" return self._pending = {"source": source, "start_stats": db_stats} self._reset_window(start_iteration=solution_iter) sig["last_evolution"] = "swapped" sig["strategy_swaps"] = int(sig.get("strategy_swaps", 0)) + 1 # ---- meta prompt assembly -------------------------------------------------------------------- def _build_meta_user(self, solution_iter: int, parent_entry: dict, inspiration_entries: list[dict], db_stats: dict, failed: list[tuple[str, str]]) -> str: """The ``search_evolution_user_message.txt`` assembly. Section order is adapted so the current strategy source is the LAST fenced python block (see module docstring).""" horizon = self._switch_interval filtered = filter_stats_by_horizon(db_stats, horizon) # population state: raw φ rendering, replaced by the guide-LLM insight when enabled stats_text = format_population_state(filtered) population_state = stats_text if self._use_stats_insight and stats_text: insight = self._guide(STATS_INSIGHT_SYSTEM, f"Population Statistics:\n\n{stats_text}") if insight.strip(): population_state = _defence(insight) window = format_search_window_context(solution_iter, int(self.general.max_iterations), horizon, self.improvement_threshold) entries = self.memory.entries previous = entries[-1] if entries else None focus = identify_strategy_focus_areas(parent_entry, previous) parts = [ "# DOWNSTREAM PROBLEM CONTEXT\n" "Your search algorithm evolves solutions for the downstream problem. Use this to " "inform your search strategy.\n\n" + self._problem_template(), "---------------------------", "# Search Algorithm Information\n\n## Your Algorithm's Search Window\n" + window, "## Solution Population Statistics\nThis describes the current state of the solution " "database your algorithm will work with.\n\n" + population_state, ] other_section = self._render_inspirations(inspiration_entries) if other_section: parts.append(other_section) if failed: lines = ["## Previous Failed Attempts", "These rewrites were rejected during this evolution event. " "Fix the cause; do not repeat the mistake:"] for i, (error, code) in enumerate(failed, 1): lines.append(f"### Attempt {i}\nError: {_defence(error)}") if code: lines.append("Rejected code (truncated, fences stripped):\n" + _defence(code[:2000])) parts.append("\n".join(lines)) parts.append("# What You Are Writing\n\n" + format_current_strategy(parent_entry, focus)) parts.append(META_TASK_TAIL) return "\n\n".join(parts) def _render_inspirations(self, entries: list[dict]) -> str: """Prior strategies from H: batch guide-LLM ``[PROGRAM N]`` summaries when enabled (full-source fallback on parse failure), each with its start→end φ diff.""" if not entries: return "" summaries: dict[int, str] = {} if self._use_batch_summaries: blocks = [] for idx, entry in enumerate(entries, start=1): start_stats, end_stats = entry.get("start_stats"), entry.get("end_stats") if not (start_stats and end_stats): continue metrics = entry.get("metrics", {}) h = int(metrics.get("search_window_horizon") or 0) diff_text = format_stats_diff(filter_stats_by_horizon(start_stats, h), filter_stats_by_horizon(end_stats, h), horizon=h) improvement = (float(metrics.get("search_window_end_score") or 0.0) - float(metrics.get("search_window_start_score") or 0.0)) blocks.append( f"=== PROGRAM {idx} (score={float(metrics.get('combined_score') or 0.0):.4f}," f" improvement={improvement:.4f}) ===\n" + BATCH_PER_PROGRAM.format(task_description=META_SYSTEM_PROMPT, solution=entry.get("source", ""), db_stats_text=diff_text)) if blocks: user = BATCH_INSTRUCTIONS.format(num_programs=len(blocks), combined_content="\n".join(blocks)) response = self._guide(BATCH_SUMMARY_SYSTEM, user) summaries = parse_batch_summaries(response, len(entries)) return format_strategy_inspirations(entries, summaries) def _problem_template(self) -> str: """Problem context for the meta prompt: guide-LLM summary (cached per problem) when enabled, degrading to the raw description + evaluator source.""" description = self.state.task_context or "(No problem description provided)" evaluator_code = self._evaluator_source() evaluator_context = (f"```python\n{evaluator_code}\n```" if evaluator_code else "(No evaluator context provided)") raw = PROBLEM_TEMPLATE.format(problem_description=description, evaluator_context=evaluator_context) if not self._use_problem_summary: return raw key = hashlib.sha256(f"{description}|||{evaluator_context}".encode()).hexdigest() if key in self._problem_summary_cache: return self._problem_summary_cache[key] summary = self._guide(PROBLEM_SUMMARY_SYSTEM, raw) if summary.strip(): # cache ONLY a successful summary (upstream never caches failures); result = _defence(summary.strip()) self._problem_summary_cache[key] = result return result return raw # transient guide failure: degrade to raw, UNcached, so later events retry def _guide(self, system: str, user: str) -> str: """One guide-LLM call through ``self.model``; ``""`` on ANY failure (the callers all degrade gracefully to raw text). Cost is recorded on success.""" if self.model is None: return "" try: gen = self.model.generate(Prompt(system=system, user=user)) except Exception: # noqa: BLE001 — guide compression must never crash the search return "" self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens) return gen.text or "" def _evaluator_source(self) -> str: """The task evaluator's SOURCE CODE (``SubprocessEvaluator.evaluator_path``), capped and fence-sanitized; ``""`` on any attribute/IO failure (the section degrades gracefully).""" try: path = self.evaluator.evaluator_path with open(path, encoding="utf-8") as f: code = f.read() except Exception: # noqa: BLE001 — no evaluator_path / unreadable file return "" if len(code) > _MAX_EVALUATOR_CODE_CHARS: code = code[:_MAX_EVALUATOR_CODE_CHARS] + "\n# ... (truncated)" return _defence(code) # ---- one-time variation-operator generation --------------------------------------------------- def _generate_variation_operators(self) -> None: """``_generate_variation_operators``: ONE guide call with the verbatim COMBINED prompt; parse the EXPLORATION/EXPLOITATION sections into the DIVERGE/REFINE templates. ANY failure → both labels ``""`` (free-form-only). ``auto_generate_variation_operators: false`` → the verbatim static DEFAULT templates (no LLM).""" if self._diverge_label and self._refine_label: # cached across setups self.population.assign_labels(self._diverge_label, self._refine_label) return if not self._auto_operators: self._diverge_label = DEFAULT_DIVERGE_TEMPLATE self._refine_label = DEFAULT_REFINE_TEMPLATE self.population.assign_labels(self._diverge_label, self._refine_label) return try: if self.model is None: raise RuntimeError("no model available for operator generation") user = _OPERATOR_USER.format( system_message=self.state.task_context or "", packages_list=_PACKAGES_LINE, evaluator_code=self._evaluator_source() or "(no evaluator source available)") gen = self.model.generate(Prompt(system=COMBINED_SYSTEM_PROMPT, user=user)) self.state.record_cost(gen.cost_usd, gen.prompt_tokens, gen.completion_tokens) explore_examples, refine_examples = _parse_combined_response(gen.text or "") self._diverge_label = DIVERGE_TEMPLATE.replace("{GENERATED_EXAMPLES}", explore_examples) self._refine_label = REFINE_TEMPLATE.replace("{GENERATED_EXAMPLES}", refine_examples) except Exception: # noqa: BLE001 — reference fallback: empty labels, free-form only self._diverge_label = "" self._refine_label = "" self.population.assign_labels(self._diverge_label, self._refine_label)