Best-of-N
Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.
# Best-of-N (`best_of_n`)
> Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.
## Overview
Best-of-N is a test-time search baseline that deliberately exploits one program state at a time. It picks a parent and reuses it until N valid children have been produced from it — N independent variations from a single starting point — and only then commits to the current global best and repeats the cycle. Inspirations (context programs shown alongside the parent) are re-sampled fresh from the top pool at every step, regardless of where the reuse cycle stands.
This scaffold is a faithful port of SkyDiscover's `BestOfNDatabase` from the UC Berkeley Sky Computing Lab. In Galapagos that single class is split along the standard component seam: a flat keep-all `InMemoryPopulation` stores every scored program and re-derives the global best on demand, while a stateful `BestOfNPolicy` owns the parent-reuse counter. Faithful to the original, the counter is advanced only by a validly-scored child — SkyDiscover increments it inside `add()`, which never runs for an error result — so a parse or evaluation failure is a free retry that does not spend the per-parent budget.
The single tuning knob is N. Larger N deepens exploitation of one program state, spending more of the budget refining variations before moving on; N=1 advances to a new best after every valid child and so approaches the behavior of Top-K. If you instead want a strictly fixed per-parent budget where every attempt counts whether or not it scored, the `best_of_n_attempts` sibling spends one budget unit per selection rather than per valid child.
## Algorithm
Each iteration selects a parent, builds a multi-section prompt from it plus a fresh sample of inspirations, proposes one diff-based edit, evaluates it, and admits the scored child back into the keep-all archive. The distinctive mechanism is the reuse counter `uses`: the policy keeps returning the same `current_parent_id` until `uses` reaches N, then jumps to the population's current global best and resets. Crucially, `uses` is incremented in `observe()` (after a child is admitted) and only when that child is valid — so failed attempts are retried for free against the same parent and never advance the cycle.
```
state: current_parent_id = none, uses = 0
each iteration:
members = population.all()
# choose parent: reuse, or commit to the global best
if current_parent_id not in members or uses >= N:
parent = argmax(members, key=fitness) # current global best
current_parent_id = parent.id
uses = 0
else:
parent = members[current_parent_id] # reuse for another attempt
# inspirations: random sample from the top pool, excluding the parent, re-drawn each step
pool = population.query(top = max(2 * num_inspirations, 10))
inspirations = sample(pool \ {parent}, num_inspirations)
prompt = build(parent, inspirations)
child = propose(prompt) # SEARCH/REPLACE diff (full-rewrite fallback)
result = evaluate(child) # parse/eval failure => free retry, uses unchanged
population.add(child)
if child.valid: # observe(): advance the per-parent budget
uses += 1
```
## Components
A Galapagos scaffold composes six components around the select -> prompt -> propose -> evaluate -> admit loop. Best-of-N keeps every part canonical except the SelectionPolicy, which carries the parent-reuse state machine.
| Slot | Implementation | Role |
|---|---|---|
| Population | `BestOfNPopulation` (keep-all `InMemoryPopulation`) | Fitness-sorted archive of every scored program; source of the global best and the inspiration pool. |
| SelectionPolicy | `BestOfNPolicy` (`best_of_n_reuse`) | Reuses one parent until N valid children, then switches to the global best; re-samples inspirations each step. |
| PromptBuilder | `BestOfNPromptBuilder` (`DefaultPromptBuilder`) | Renders the canonical task -> metrics -> feedback -> inspirations -> current-program message. |
| Proposer | `BestOfNProposer` (`DiffProposer`) | One LLM call yielding a SEARCH/REPLACE diff (full-rewrite fallback) with no-op detection. |
| Evaluator | task-supplied (`SubprocessEvaluator`) | Scores the child via the task's own evaluator; sets the `valid` flag the policy counts. |
| Memory | none (`NullMemory`) | Best-of-N keeps no cross-candidate free-form knowledge. |
## Configuration
- `population.capacity` (null) — uncapped keep-all by default (matches SkyDiscover, no cap); set an int to bound the `BestOfNPopulation` archive.
- `selection_policy.best_of_n` (5) — N: number of valid children to draw from one parent before switching to the global best.
- `selection_policy.num_inspirations` (4) — context programs sampled from the top pool each step (SkyDiscover's `num_context_programs`).
- `general.max_iterations` (100) — total evaluated candidates / budget.
- `general.inner_retry_times` (1) — proposal resamples per iteration on a parse/eval failure, with the parent held fixed; failures here cost no per-parent budget.
## When to use
Reach for Best-of-N when a promising program state rewards sustained, repeated refinement and you want a simple, faithful exploitation baseline with one knob. Larger N concentrates the budget on deepening a single state; N=1 effectively becomes greedy best-first and is close to `topk`. If you need every attempt to count against a fixed per-parent budget — for example to bound wasted calls on a brittle task — prefer `best_of_n_attempts`, which counts selections rather than valid children. For broader exploration across the frontier rather than depth on one parent, an island or explore/exploit policy (e.g. `openevolve`) is the better fit.
## Source
Ported from SkyDiscover's `best_of_n` search strategy (`BestOfNDatabase`, UC Berkeley Sky Computing Lab); this is the Galapagos six-component decomposition of that strategy.