Skip to content

Tasks

A task is the evaluation problem a scaffold is run against. The catalog ships 64 runnable tasks, each a folder with a card.yaml card (plus a seed program + evaluator.py) under src/galapagos/tasks/<name>/. List them at runtime:

import galapagos as gx
gx.available_tasks()    # all 64 bundled task cards
galapagos task list

Catalog

The 64 tasks span open mathematics (the *_autocorr_ineq inequalities, erdos_min_overlap, kissing_number, the Heilbronn and packing families), algorithms and speedups (the eight algotune_* tasks, matmul, tsp_tour_minimization), systems research (the five adrs_* tasks — GPU model placement, transaction scheduling, multi-cloud broadcast, prefix-cache column reordering, expert-parallelism load balancing), GPU kernels (the four gpu_mode_* Triton tasks, kernelbench, attention_optimization, mlx_metal_kernel_opt, rust_adaptive_sort), competitive programming (the ten ale_bench_ahc* AtCoder heuristic contests, online_judge_programming), and prompt/ML tasks (llm_prompt_optimization, hotpot_qa, lm_eval, symbolic_regression, arc_benchmark, sky_festival). Statuses are stable (34), external (28), and experimental (2) — every one ships a seed and an evaluator, so every one runs (external tasks additionally need their heavy dependency stack — GPU, judge packages, or API keys — and degrade to a structured zero score without it).

The three canonical quickstart tasks:

Name Display Domain Status Metric Summary
circle_packing Circle Packing (n=26) math stable combined_score (maximize) Pack 26 circles in the unit square; maximize the sum of radii.
function_minimization Function Minimization math stable combined_score (maximize) Find (x, y) minimizing f(x,y) = sin(x)·cos(y) + sin(x·y) + (x² + y²)/20.
playground_sphere Sphere (Playground) playground stable combined_score (maximize) Tune a 6-vector to minimize the sum of squares (a smooth convex toy).

All three are pure-Python, single-file, EVOLVE-BLOCK tasks that run locally with no GPU or Docker. playground_sphere is the fastest and smallest — the recommended task for a quick first run.

Notes per task

  • circle_packing — the classic AlphaEvolve benchmark. You evolve construct_packing() inside the EVOLVE-BLOCK; the fixed entry point run_packing() -> (centers, radii, sum_radii) calls it. The score and validity are recomputed independently from the returned geometry (anti reward-hacking). Best known ≈ 2.635.
  • function_minimization — a non-convex 2-D objective with a central basin. You evolve the search inside the EVOLVE-BLOCK (search_algorithm); the fixed entry point run_search() returns (x, y), and the score rises as f decreases. Approximate global minimum ≈ −1.9.
  • playground_sphere — a smooth convex toy: tune a length-6 PARAMS vector to minimize sum(x_i^2); score = 1/(1+loss) → 1 as the vector → 0. Instant; used for the Hub Playground and the fastest first runs.

Working with a task

import galapagos as gx
task = gx.GalapagosTask.from_card(name="circle_packing")   # or path="tasks/circle_packing/card.yaml"

task.context           # the problem statement injected into prompts
task.status            # 'stable'
task.runnable          # True iff it ships a seed + evaluator.py
task.initial_genome()  # the seed Genome (content = the seed program)
task.evaluator         # the Evaluator from the card's evaluation.mode: a SubprocessEvaluator
                       # (mode: local, default) or a ContainerEvaluator (mode: container)

# score any candidate directly:
seed = task.initial_genome()
task.evaluator.evaluate(seed).combined_score

task.set_eval_mode("docker")      # force the Docker sandbox regardless of the card (None = card default)

Local vs. docker. With evaluation.mode: local (the default) the task's evaluator.py runs in a host subprocess; with mode: docker (alias container) the same evaluator.py runs inside a self-contained Docker sandbox — same scoring, dockerized, Harbor-style. The whole run can be forced either way via general.evaluation_mode (--set general.evaluation_mode=docker). See Write your own task → Docker evaluation.

Card fields

Each task card (a TaskCard) records: name, display_name, domain, family, status (stable | experimental | spec | external), summary, description, metric ({key, direction, type}), components ({initial_program, evaluator} file pointers — initial_program has the alias seed), evaluation ({format, modes, mode: local|docker, and for docker mode: dockerfile, base_image, requirements, python_bin, env}), constraint, seed, references, and metadata.

from galapagos.cards.registry import load_task_card
card = load_task_card("circle_packing")
card.metric.key        # 'combined_score'
card.metric.direction  # 'maximize'
card.components        # {'initial_program': 'initial_program.py', 'evaluator': 'evaluator.py'}

The description becomes task.context. The metric block declares the headline number and direction; the leaderboard ranks verified discoveries by it.

To add a task, see Write your own task.