Tasks¶
A task is the evaluation problem a scaffold is run against. The catalog ships 64 runnable tasks,
each a folder with a card.yaml card (plus a seed program + evaluator.py) under
src/galapagos/tasks/<name>/. List them at runtime:
Catalog¶
The 64 tasks span open mathematics (the *_autocorr_ineq inequalities, erdos_min_overlap,
kissing_number, the Heilbronn and packing families), algorithms and speedups (the eight
algotune_* tasks, matmul, tsp_tour_minimization), systems research (the five adrs_* tasks —
GPU model placement, transaction scheduling, multi-cloud broadcast, prefix-cache column reordering,
expert-parallelism load balancing), GPU kernels (the four gpu_mode_* Triton tasks, kernelbench,
attention_optimization, mlx_metal_kernel_opt, rust_adaptive_sort), competitive programming
(the ten ale_bench_ahc* AtCoder heuristic contests, online_judge_programming), and prompt/ML
tasks (llm_prompt_optimization, hotpot_qa, lm_eval, symbolic_regression, arc_benchmark,
sky_festival). Statuses are stable (34), external (28), and experimental (2) — every one
ships a seed and an evaluator, so every one runs (external tasks additionally need their heavy
dependency stack — GPU, judge packages, or API keys — and degrade to a structured zero score
without it).
The three canonical quickstart tasks:
| Name | Display | Domain | Status | Metric | Summary |
|---|---|---|---|---|---|
circle_packing |
Circle Packing (n=26) | math | stable | combined_score (maximize) |
Pack 26 circles in the unit square; maximize the sum of radii. |
function_minimization |
Function Minimization | math | stable | combined_score (maximize) |
Find (x, y) minimizing f(x,y) = sin(x)·cos(y) + sin(x·y) + (x² + y²)/20. |
playground_sphere |
Sphere (Playground) | playground | stable | combined_score (maximize) |
Tune a 6-vector to minimize the sum of squares (a smooth convex toy). |
All three are pure-Python, single-file, EVOLVE-BLOCK tasks that run locally with no GPU or Docker.
playground_sphere is the fastest and smallest — the recommended task for a quick first run.
Notes per task¶
circle_packing— the classic AlphaEvolve benchmark. You evolveconstruct_packing()inside the EVOLVE-BLOCK; the fixed entry pointrun_packing() -> (centers, radii, sum_radii)calls it. The score and validity are recomputed independently from the returned geometry (anti reward-hacking). Best known ≈ 2.635.function_minimization— a non-convex 2-D objective with a central basin. You evolve the search inside the EVOLVE-BLOCK (search_algorithm); the fixed entry pointrun_search()returns(x, y), and the score rises as f decreases. Approximate global minimum ≈ −1.9.playground_sphere— a smooth convex toy: tune a length-6PARAMSvector to minimizesum(x_i^2); score =1/(1+loss)→ 1 as the vector → 0. Instant; used for the Hub Playground and the fastest first runs.
Working with a task¶
import galapagos as gx
task = gx.GalapagosTask.from_card(name="circle_packing") # or path="tasks/circle_packing/card.yaml"
task.context # the problem statement injected into prompts
task.status # 'stable'
task.runnable # True iff it ships a seed + evaluator.py
task.initial_genome() # the seed Genome (content = the seed program)
task.evaluator # the Evaluator from the card's evaluation.mode: a SubprocessEvaluator
# (mode: local, default) or a ContainerEvaluator (mode: container)
# score any candidate directly:
seed = task.initial_genome()
task.evaluator.evaluate(seed).combined_score
task.set_eval_mode("docker") # force the Docker sandbox regardless of the card (None = card default)
Local vs. docker. With evaluation.mode: local (the default) the task's evaluator.py runs
in a host subprocess; with mode: docker (alias container) the same evaluator.py runs inside a
self-contained Docker sandbox — same scoring, dockerized, Harbor-style. The whole run can be forced
either way via general.evaluation_mode (--set general.evaluation_mode=docker). See
Write your own task → Docker evaluation.
Card fields¶
Each task card (a TaskCard) records: name, display_name, domain,
family, status (stable | experimental | spec | external), summary, description, metric
({key, direction, type}), components ({initial_program, evaluator} file pointers —
initial_program has the alias seed), evaluation ({format, modes, mode: local|docker, and
for docker mode: dockerfile, base_image, requirements, python_bin, env}), constraint,
seed, references, and metadata.
from galapagos.cards.registry import load_task_card
card = load_task_card("circle_packing")
card.metric.key # 'combined_score'
card.metric.direction # 'maximize'
card.components # {'initial_program': 'initial_program.py', 'evaluator': 'evaluator.py'}
The description becomes task.context. The metric block declares the headline number and
direction; the leaderboard ranks verified discoveries by it.
To add a task, see Write your own task.