Best-of-N
Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.
Best-of-N is a test-time search baseline that deliberately exploits one program state at a time. It picks a parent and reuses it until N valid children have been produced from it — N independent variations from a single starting point — and only then commits to the current global best and repeats the cycle. Inspirations (context programs shown alongside the parent) are re-sampled fresh from the top pool at every step, regardless of where the reuse cycle stands.
This scaffold is a faithful port of SkyDiscover's `BestOfNDatabase` from the UC Berkeley Sky Computing Lab. In Galapagos that single class is split along the standard component seam: a flat keep-all `InMemoryPopulation` stores every scored program and re-derives the global best on demand, while a stateful `BestOfNPolicy` owns the parent-reuse counter. Faithful to the original, the counter is advanced only by a validly-scored child — SkyDiscover increments it inside `add()`, which never runs for an error result — so a parse or evaluation failure is a free retry that does not spend the per-parent budget.
The single tuning knob is N. Larger N deepens exploitation of one program state, spending more of the budget refining variations before moving on; N=1 advances to a new best after every valid child and so approaches the behavior of Top-K. If you instead want a strictly fixed per-parent budget where every attempt counts whether or not it scored, the `best_of_n_attempts` sibling spends one budget unit per selection rather than per valid child.
The six components this scaffold snaps together. Each block names its concrete implementation.
The set of candidate solutions in play — the gene pool the search evolves over.
Decides which genomes survive and reproduce — tournament, elitism, novelty, or your own policy.
Assembles the context handed to the model — parents, feedback, instructions, examples.
The LLM-driven variation operator — proposes new candidates by mutation and crossover.
Scores each candidate against the task — the fitness signal that drives selection.
SkyDiscover (UC Berkeley Sky Computing Lab) — best_of_n search strategy