galapagos
DocsHubLeaderboardPlaygroundNews
galapagos

six blocks · any task ·
better solutions emerge.

Platform

  • Hub
  • Leaderboard
  • Playground

Resources

  • Docs
  • API reference
  • Card spec

Community

  • GitHub
  • Contribute

Updates

  • News
  • Releases

© 2026 Galapagos. Licensed under Apache-2.0.

Build your own scaffold.

Hub/Scaffolds/default/Best-of-N

default/best_of_n

Best-of-N

Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.

Test-time searchApache-2.0
Scaffold cardFiles and versions
best_of_n/README.md
70 lines · 5.9 KBmarkdownDownload
# Best-of-N (`best_of_n`)

> Give the LLM N valid attempts at the same parent before committing to the global best, then repeat.

## Overview

Best-of-N is a test-time search baseline that deliberately exploits one program state at a time. It picks a parent and reuses it until N valid children have been produced from it — N independent variations from a single starting point — and only then commits to the current global best and repeats the cycle. Inspirations (context programs shown alongside the parent) are re-sampled fresh from the top pool at every step, regardless of where the reuse cycle stands.

This scaffold is a faithful port of SkyDiscover's `BestOfNDatabase` from the UC Berkeley Sky Computing Lab. In Galapagos that single class is split along the standard component seam: a flat keep-all `InMemoryPopulation` stores every scored program and re-derives the global best on demand, while a stateful `BestOfNPolicy` owns the parent-reuse counter. Faithful to the original, the counter is advanced only by a validly-scored child — SkyDiscover increments it inside `add()`, which never runs for an error result — so a parse or evaluation failure is a free retry that does not spend the per-parent budget.

The single tuning knob is N. Larger N deepens exploitation of one program state, spending more of the budget refining variations before moving on; N=1 advances to a new best after every valid child and so approaches the behavior of Top-K. If you instead want a strictly fixed per-parent budget where every attempt counts whether or not it scored, the `best_of_n_attempts` sibling spends one budget unit per selection rather than per valid child.

## Algorithm

Each iteration selects a parent, builds a multi-section prompt from it plus a fresh sample of inspirations, proposes one diff-based edit, evaluates it, and admits the scored child back into the keep-all archive. The distinctive mechanism is the reuse counter `uses`: the policy keeps returning the same `current_parent_id` until `uses` reaches N, then jumps to the population's current global best and resets. Crucially, `uses` is incremented in `observe()` (after a child is admitted) and only when that child is valid — so failed attempts are retried for free against the same parent and never advance the cycle.

```
state: current_parent_id = none, uses = 0

each iteration:
    members = population.all()
    # choose parent: reuse, or commit to the global best
    if current_parent_id not in members or uses >= N:
        parent = argmax(members, key=fitness)   # current global best
        current_parent_id = parent.id
        uses = 0
    else:
        parent = members[current_parent_id]      # reuse for another attempt

    # inspirations: random sample from the top pool, excluding the parent, re-drawn each step
    pool         = population.query(top = max(2 * num_inspirations, 10))
    inspirations = sample(pool \ {parent}, num_inspirations)

    prompt = build(parent, inspirations)
    child  = propose(prompt)                      # SEARCH/REPLACE diff (full-rewrite fallback)
    result = evaluate(child)                       # parse/eval failure => free retry, uses unchanged

    population.add(child)
    if child.valid:                                # observe(): advance the per-parent budget
        uses += 1
```

## Components

A Galapagos scaffold composes six components around the select -> prompt -> propose -> evaluate -> admit loop. Best-of-N keeps every part canonical except the SelectionPolicy, which carries the parent-reuse state machine.

| Slot | Implementation | Role |
|---|---|---|
| Population | `BestOfNPopulation` (keep-all `InMemoryPopulation`) | Fitness-sorted archive of every scored program; source of the global best and the inspiration pool. |
| SelectionPolicy | `BestOfNPolicy` (`best_of_n_reuse`) | Reuses one parent until N valid children, then switches to the global best; re-samples inspirations each step. |
| PromptBuilder | `BestOfNPromptBuilder` (`DefaultPromptBuilder`) | Renders the canonical task -> metrics -> feedback -> inspirations -> current-program message. |
| Proposer | `BestOfNProposer` (`DiffProposer`) | One LLM call yielding a SEARCH/REPLACE diff (full-rewrite fallback) with no-op detection. |
| Evaluator | task-supplied (`SubprocessEvaluator`) | Scores the child via the task's own evaluator; sets the `valid` flag the policy counts. |
| Memory | none (`NullMemory`) | Best-of-N keeps no cross-candidate free-form knowledge. |

## Configuration

- `population.capacity` (null) — uncapped keep-all by default (matches SkyDiscover, no cap); set an int to bound the `BestOfNPopulation` archive.
- `selection_policy.best_of_n` (5) — N: number of valid children to draw from one parent before switching to the global best.
- `selection_policy.num_inspirations` (4) — context programs sampled from the top pool each step (SkyDiscover's `num_context_programs`).
- `general.max_iterations` (100) — total evaluated candidates / budget.
- `general.inner_retry_times` (1) — proposal resamples per iteration on a parse/eval failure, with the parent held fixed; failures here cost no per-parent budget.

## When to use

Reach for Best-of-N when a promising program state rewards sustained, repeated refinement and you want a simple, faithful exploitation baseline with one knob. Larger N concentrates the budget on deepening a single state; N=1 effectively becomes greedy best-first and is close to `topk`. If you need every attempt to count against a fixed per-parent budget — for example to bound wasted calls on a brittle task — prefer `best_of_n_attempts`, which counts selections rather than valid children. For broader exploration across the frontier rather than depth on one parent, an island or explore/exploit policy (e.g. `openevolve`) is the better fit.

## Source

Ported from SkyDiscover's `best_of_n` search strategy (`BestOfNDatabase`, UC Berkeley Sky Computing Lab); this is the Galapagos six-component decomposition of that strategy.