Why Galapagos¶

LLM-driven evolutionary search — using a frontier model as a guided variation operator inside an evolutionary loop — has become one of the most productive recipes in automated scientific discovery. This page is the why: the explosion of methods that followed AlphaEvolve, the evaluation crisis it created, the strengths and gaps of the systems built to address it, and the thesis that motivates Galapagos.

The post-AlphaEvolve explosion¶

AlphaEvolve showed that a simple idea scales remarkably: keep a population of candidate solutions, ask a strong LLM to mutate the best ones, score the results with a deterministic evaluator, and repeat. The result was state-of-the-art progress across mathematics, GPU kernels, algorithms, and systems.

What followed was an explosion of methods, each refining a different part of that loop:

OpenEvolve — an open reproduction of the island + MAP-Elites evolutionary loop.
ShinkaEvolve — UCB bandit routing over an LLM ensemble, crossover, novelty rejection, and a meta-scratchpad.
GEPA — reflective, Pareto-frontier prompt evolution from execution traces.
DGM (Darwin Gödel Machine) — agents that edit their own codebase.
DeepEvolve — deep-research planning fused with an evolutionary loop for drug-discovery tasks.
PAC-Evolve, AdaEvolve, EvoX, SeaEvo, Meta-Harness, CORAL, HyperAgents — and a steady stream more, each adding an adaptive controller, a memory layer, a multi-agent twist, or a strategy that is itself evolved.

Every one of these is, structurally, the same loop — select, prompt, propose, evaluate, repeat — with a different implementation in one or two slots. But each ships as its own incompatible codebase, with its own config format, its own task harness, and its own notion of a "run."

The evaluation crisis¶

The proliferation created a measurement problem. When two papers both report a number on "circle packing," it is rarely an apples-to-apples comparison, because the score is enormously sensitive to variables that are not the method:

Island size and population structure.
Maximum iterations / total budget.
Number of trials and which trial's best is reported.
The underlying model and its sampling temperature.
The seed, the wall-clock limit, the early-stopping patience.

These variables have high variance and are tuned independently per paper. The consequence is that the literature's headline numbers are not directly comparable, and it is genuinely hard to tell whether a new method is better or merely better-tuned. Reproducing a result means reverse-engineering a bespoke harness; comparing two methods means re-implementing both.

The core problem

An explosion of methods, but no consistent, fair way to evaluate them. Progress is real, but it is hard to measure, hard to reproduce, and hard to build on.

SkyDiscover: strengths and gaps¶

The community's most serious attempt at unification is SkyDiscover — a single framework that factors several of these methods into shared abstractions. Galapagos owes it a real debt, and adopts its central insight: that these methods share a common structure worth naming.

SkyDiscover's strengths:

It demonstrated that a single framework can express multiple discovery methods.
Its abstractions (a context builder, a sampled program context, a database) are a clean minimal core — Galapagos's PromptBuilder is modeled directly on SkyDiscover's ContextBuilder.

But it left real gaps:

Hard to use. The surface area and setup cost are high for a newcomer who just wants to run a method on a task.
Hard to cover the frontier. Newer agentic and meta-level methods — Meta-Harness, Meta-N, CORAL — do not fit cleanly. Agent-driven selection, skill files, and shared multi-agent memory have no natural home.
Not easily extensible or scalable. Adding a method or a backend is more invasive than swapping one component should be.

Galapagos keeps SkyDiscover's good idea — one composition, many methods — and closes these gaps with a six-component decomposition over a single Genome, in which agent-driven methods are just the identity selection policy and meta-scaffolds are just nesting on the same loop. See Core components for the full mapping and the per-method coverage matrix.

The thesis: scale up the tasks¶

The deeper lesson from machine learning is that the task suite is the lever. Capability followed the scaling of standardized, shared benchmarks:

ImageNet turned vision into a measurable, competitive field.
T5 and FLAN showed that scaling up the number and diversity of tasks — cast in one consistent format — is what produced general, transferable capability.

Discovery has no equivalent. The tasks exist, but they are scattered across incompatible repositories and formats:

OpenProblems (single-cell biology),
EinsteinArena (physics / reasoning),
DeepEvolve drug-discovery tasks,
Erdős problems (open mathematics),
GPU Mode (kernel optimization),

…and many more, each with its own harness, scoring convention, and submission process. There is no unified format in which a method evaluated on one is comparable to a method evaluated on another.

The Galapagos thesis

If scaling standardized tasks drove progress in vision and language, then the way to drive progress in LLM-driven discovery is to gather scattered discovery tasks into one consistent schema, run every method against them in one consistent loop, and rank the results on one consistent, fair leaderboard — with a verification step so the numbers can be trusted.

What Galapagos adds¶

From these three problems — incompatible methods, unfair evaluation, scattered tasks — Galapagos follows directly:

One loop, many methods. A six-component decomposition (Population, SelectionPolicy, PromptBuilder, Proposer, Evaluator, Memory) over a single Genome, so a method is just a choice of which implementation fills each slot — including agent-driven methods.
One schema for tasks. A Harbor-style task card — seed, evaluator, metric, requirements, evaluation mode — so any task plugs into any scaffold, with a Docker sandbox when a task needs it.
One protocol: the card. Scaffolds, tasks, models, and discoveries are all cards — versioned YAML that is the single source of truth for the library and the Hub.
A consistent, fair leaderboard with verification. A discovery is submitted as a verification card — trajectory plus best solution — and is re-scored by the task's own evaluator and reviewed by a domain expert before it counts. Leaderboard entries are checked, not self-reported.

Honest about scope

This page describes the platform's vision and the architecture that supports it. What ships and runs today is deliberately small — seven runnable scaffolds, 64 runnable tasks. The point of the architecture is that growing the catalog is additive: a new method is a card, a new task is a card, and neither changes the shape of the loop. See the scope note on the home page.

Continue to Installation, the Quickstart, or the Concepts.