default/best_of_n

Best-of-N

Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.

Test-time searchApache-2.0

# Best-of-N (`best_of_n`) > Give the LLM N valid attempts at the same parent before committing to the global best, then repeat. ## Overview Best-of-N is a test-time search baseline that deliberately exploits one program state at a time. It picks a parent and reuses it until N valid children have been produced from it — N independent variations from a single starting point — and only then commits to the current global best and repeats the cycle. Inspirations (context programs shown alongside the parent) are re-sampled fresh from the top pool at every step, regardless of where the reuse cycle stands. This scaffold is a faithful port of SkyDiscover's `BestOfNDatabase` from the UC Berkeley Sky Computing Lab. In Galapagos that single class is split along the standard component seam: a flat keep-all `InMemoryPopulation` stores every scored program and re-derives the global best on demand, while a stateful `BestOfNPolicy` owns the parent-reuse counter. Faithful to the original, the counter is advanced only by a validly-scored child — SkyDiscover increments it inside `add()`, which never runs for an error result — so a parse or evaluation failure is a free retry that does not spend the per-parent budget. The single tuning knob is N. Larger N deepens exploitation of one program state, spending more of the budget refining variations before moving on; N=1 advances to a new best after every valid child and so approaches the behavior of Top-K. If you instead want a strictly fixed per-parent budget where every attempt counts whether or not it scored, the `best_of_n_attempts` sibling spends one budget unit per selection rather than per valid child. ## Algorithm Each iteration selects a parent, builds a multi-section prompt from it plus a fresh sample of inspirations, proposes one diff-based edit, evaluates it, and admits the scored child back into the keep-all archive. The distinctive mechanism is the reuse counter `uses`: the policy keeps returning the same `current_parent_id` until `uses` reaches N, then jumps to the population's current global best and resets. Crucially, `uses` is incremented in `observe()` (after a child is admitted) and only when that child is valid — so failed attempts are retried for free against the same parent and never advance the cycle. ``` state: current_parent_id = none, uses = 0 each iteration: members = population.all() # choose parent: reuse, or commit to the global best if current_parent_id not in members or uses >= N: parent = argmax(members, key=fitness) # current global best current_parent_id = parent.id uses = 0 else: parent = members[current_parent_id] # reuse for another attempt # inspirations: random sample from the top pool, excluding the parent, re-drawn each step pool = population.query(top = max(2 * num_inspirations, 10)) inspirations = sample(pool \ {parent}, num_inspirations) prompt = build(parent, inspirations) child = propose(prompt) # SEARCH/REPLACE diff (full-rewrite fallback) result = evaluate(child) # parse/eval failure => free retry, uses unchanged population.add(child) if child.valid: # observe(): advance the per-parent budget uses += 1 ``` ## Components A Galapagos scaffold composes six components around the select -> prompt -> propose -> evaluate -> admit loop. Best-of-N keeps every part canonical except the SelectionPolicy, which carries the parent-reuse state machine. | Slot | Implementation | Role | |---|---|---| | Population | `BestOfNPopulation` (keep-all `InMemoryPopulation`) | Fitness-sorted archive of every scored program; source of the global best and the inspiration pool. | | SelectionPolicy | `BestOfNPolicy` (`best_of_n_reuse`) | Reuses one parent until N valid children, then switches to the global best; re-samples inspirations each step. | | PromptBuilder | `BestOfNPromptBuilder` (`DefaultPromptBuilder`) | Renders the canonical task -> metrics -> feedback -> inspirations -> current-program message. | | Proposer | `BestOfNProposer` (`DiffProposer`) | One LLM call yielding a SEARCH/REPLACE diff (full-rewrite fallback) with no-op detection. | | Evaluator | task-supplied (`SubprocessEvaluator`) | Scores the child via the task's own evaluator; sets the `valid` flag the policy counts. | | Memory | none (`NullMemory`) | Best-of-N keeps no cross-candidate free-form knowledge. | ## Configuration - `population.capacity` (null) — uncapped keep-all by default (matches SkyDiscover, no cap); set an int to bound the `BestOfNPopulation` archive. - `selection_policy.best_of_n` (5) — N: number of valid children to draw from one parent before switching to the global best. - `selection_policy.num_inspirations` (4) — context programs sampled from the top pool each step (SkyDiscover's `num_context_programs`). - `general.max_iterations` (100) — total evaluated candidates / budget. - `general.inner_retry_times` (1) — proposal resamples per iteration on a parse/eval failure, with the parent held fixed; failures here cost no per-parent budget. ## When to use Reach for Best-of-N when a promising program state rewards sustained, repeated refinement and you want a simple, faithful exploitation baseline with one knob. Larger N concentrates the budget on deepening a single state; N=1 effectively becomes greedy best-first and is close to `topk`. If you need every attempt to count against a fixed per-parent budget — for example to bound wasted calls on a brittle task — prefer `best_of_n_attempts`, which counts selections rather than valid children. For broader exploration across the frontier rather than depth on one parent, an island or explore/exploit policy (e.g. `openevolve`) is the better fit. ## Source Ported from SkyDiscover's `best_of_n` search strategy (`BestOfNDatabase`, UC Berkeley Sky Computing Lab); this is the Galapagos six-component decomposition of that strategy.