Blog
Paper8 min read

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

A mid-training “practice phase” that turns evolutionary search trajectories into supervision, teaching small open-source LLMs how to evolve solutions before they ever see a new problem.

Evolution Fine-Tuning (EFT) converts evolutionary search trajectories into supervision, giving small open-source LLMs a practice phase that teaches them how to evolve solutions before they ever see a new problem. Trained on the 156K-trajectory Finch Collection, our Finch models generalize discovery skill across 22 held-out tasks (+10.22% over base), compose strategies across domains, and reach state-of-the-art on circle-packing when paired with test-time RL.

EFT acts as mid-training: cross-discovery transfer
EFT acts as mid-training. Finch lifts discovery on the Erdős minimum-overlap problem under both test-time search and learning (left); on NP-hard competitive programming it composes strategies across domains, while the base model repeats a single one (right).

The problem: discovery skill lives in the scaffold, not the model

LLMs integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks — open mathematical conjectures, GPU kernel design, scientific-law discovery, and combinatorial puzzles. But prior work applies a search scaffold to one target task at a time, so every new problem is approached from scratch and the experience accumulated during search is discarded once the model finishes.

This leaves the capability of iteratively evolving a solution — knowing which part to mutate and how, deciding when to backtrack — entirely in the scaffold rather than in the model itself. Test-time search needs an expensive proprietary mutation operator; test-time learning over-fits a single task and throws the strategy away.

The idea: move discovery skill into the model

EFT is a mid-training paradigm that teaches LLMs to evolve solutions across tasks by distilling the discovery behavior itself into a small model — which then plugs into either scaffold. We think of it as a “practice phase” for general-purpose discovery agents: instead of rebuilding discovery skill from scratch inside every search run, the model practices before deployment rather than solving each new problem from zero.

The Evolution Fine-Tuning idea in one picture
The EFT idea in one picture. Rather than expensive prompting (test-time search) or single-task RL (test-time learning), EFT distills discovery tasks into the model — producing Finch, which then works inside either scaffold with frozen weights or further adaptation.
  • A mid-training practice phase — EFT teaches the LLM how to mutate, what to keep, and when to backtrack, before it is ever deployed.
  • Trajectories as supervision — optimization tasks are NP-hard and lack ground-truth optima, so (problem, answer) pairs are unavailable. EFT instead treats the trajectories of search runs — parent → child transitions with scores — as the training signal.
  • Orthogonal to the scaffold — an EFT model can serve as a frozen mutation operator inside test-time search, or be further adapted by test-time RL. It is a layer beneath both branches, not a replacement for either.
  • Emergent cross-domain transfer — trained across many domains at once, Finch composes strategies it learned elsewhere when tackling a new problem, behavior the base model never exhibits.

The Finch Collection: 156K trajectories, 371 tasks, 10 domains

156K
filtered trajectories
371
tasks · 10 domains
+10.2%
avg. gain on 22 held-out
2–9B
Finch model sizes

Optimization training data is hard to synthesize, so we source 371 seed tasks from 10 existing benchmarks — each requiring nontrivial search, with a deterministic continuous-score evaluator — and harvest the search itself. The construction pipeline runs in three stages:

Finch Collection construction pipeline
The construction pipeline. Collect seed optimization tasks → run an evolutionary scaffold (OpenEvolve) with a strong teacher mutation operator to harvest parent-to-child trajectories → filter out broken, systematic-error, and overlong cases, yielding ~156K trajectories over 371 tasks.
  • Seed task collection — 371 tasks across 10 domains, led by competitive programming (172) and numerical algorithm optimization (47), chosen to require real search rather than ground-truth matching.
  • Trajectory collection — OpenEvolve with a Qwen3.5-397B-A17B teacher runs each task under diff-edit and full-rewrite strategies, yielding 172,997 raw trajectories.
  • Filtering & labeling — removing systematic errors, hard-negative breakages, and overlong inputs retains 90.6%, then labels each trajectory by its score delta.

Of the filtered trajectories, 39.4% improve the parent, 19.2% leave it unchanged, and 41.3% regress — supplying both imitation and preference (good-vs-bad) signal. The collection is balanced across languages (68.5% Python / 31.5% C++) and strategies (50.3% diff-edit / 49.7% full-rewrite). We fine-tune the Qwen3.5 (2B/4B/9B) and Qwen3-8B bases via full SFT on improved trajectories from 355 tasks (16 held out), producing the Finch family.

Task distribution across 10 domains
371 tasks across 10 domains. Bubble size shows each group's task count, led by competitive programming and numerical algorithm optimization.
Improvement distribution of trajectories
Improvement breakdown. 39.4% of trajectories improve the parent, 19.2% leave it unchanged, 41.3% regress — supplying both imitation and preference (good-vs-bad) signal.

Results: cross-task discovery generalization

Used as a mutation operator inside test-time search, Finch beats its base models across 22 held-out tasks by +10.22% on average — and lets small models rival non-EFT models twice their size. Gains reach +290% on ahc058 and +74% on Transaction, and Finch-4B reaches 0.3865 on the Erdős minimum-overlap problem, comparable to Qwen3-8B's 0.4036 (lower is better) at half the size.

  • Test-time search — Finch lifts the held-out average at every scale, up to +10.24% at 9B, and matches strong proprietary operators (Claude-Opus-4.6, Gemini-3-Pro, GPT-5) on several metrics with a far smaller open backbone.
  • Offline RL (KTO) — further training Finch on improved + regressed trajectories teaches it to tell good solutions from bad; Finch-8B + KTO surpasses the best human score on two algorithm-engineering metrics.
  • Online RL (test-time RL) — as the policy inside nanodiscover, Finch-8B matches state-of-the-art on both circle-packing tasks (n=26 & n=32) and edges out the Qwen3-8B base on the Erdős problem.
  • Positive task-scaling — as the Finch Collection grows from 15 to 355 training tasks, held-out performance rises monotonically, an average +14.1% improvement that shows the gains come from task diversity rather than any single task.
Held-out performance scaling with number of training tasks
Positive task-scaling. As the Finch Collection grows from 15 to 355 training tasks, held-out performance rises monotonically on AC2, CP, and PRISM — an average +14.1% improvement, evidence that EFT gains come from task diversity rather than any single task.

Why it matters

EFT serves as a practice phase for general-purpose discovery agents that doesn't solve new problems from scratch. By moving discovery skill out of the scaffold and into the model, a single small open-source LLM can plug into test-time search with frozen weights or be further adapted by test-time RL — turning search compute that was once discarded into a reusable, transferable discovery skill. The paper, code, the Finch Collection dataset, and the Finch model family are all publicly available.