Write your own task¶

A task is three files in a directory under tasks/<name>/:

tasks/my_task/
├── card.yaml            # the task card (metadata, metric, file pointers)
├── initial_program.py   # the seed program, with an EVOLVE-BLOCK
└── evaluator.py         # evaluate(program_path) -> dict

The task owns the problem statement (task.context), the seed (task.initial_genome()), and the Evaluator (task.evaluator). Because Galapagos supplies the Evaluator from the task, any scaffold can run against it unchanged.

Below is a complete, minimal task: tune three numbers to minimize their sum of squares.

1. `card.yaml`¶

name: my_task
display_name: My Task
domain: math
summary: "Tune a 3-vector to minimize the sum of squares."
description: |
  Tune a length-3 PARAMS vector to MINIMIZE sum(x_i^2). Implement `solve() -> list` returning the
  three numbers. Only the code inside the EVOLVE-BLOCK is modified by the search. The score is
  1/(1+loss), recomputed independently from the returned values (anti reward-hacking).
metrics:
  - metric_name: combined_score
    metric_direction: maximize
    metric_description: "1 / (1 + sum of squares); higher is better."
    metric_computation: "Recompute sum(x_i^2) from the returned vector; score = 1/(1+loss)."
components: {initial_program: initial_program.py, evaluator: evaluator.py}
evaluation: {format: python, mode: local}        # mode: local | docker (see "Docker evaluation")
language: python
modality: text
library: [numpy]
constraint: {gpu: none, docker: optional}
references: {best_known: 1.0, note: "score -> 1 as the vector -> 0"}

The description is what gets injected into prompts as task.context. components.initial_program (alias seed) and components.evaluator point at the two files (these defaults — initial_program.py / evaluator.py — are also assumed if omitted). The metrics list declares the headline number and direction.

2. `initial_program.py`¶

The seed must mark the editable region with # EVOLVE-BLOCK-START … # EVOLVE-BLOCK-END. The Proposer only touches code inside those markers.

"""Tune the PARAMS vector to MINIMIZE the sphere loss sum(x_i^2) (score = 1/(1+loss))."""


def solve():
    # EVOLVE-BLOCK-START
    PARAMS = [0.9, -0.8, 0.7]
    # EVOLVE-BLOCK-END
    return PARAMS


if __name__ == "__main__":
    print(solve())

Keep the seed runnable and self-contained — it is evaluated once at setup to seed the population.

3. `evaluator.py`¶

The contract is one function: evaluate(program_path) -> dict returning at least combined_score (a float). Galapagos runs it in an isolated subprocess (the SubprocessEvaluator), so it must be importable and side-effect-free.

"""Deterministic evaluator. Contract: evaluate(program_path) -> dict with combined_score (maximize)."""
import importlib.util

N = 3


def _load(program_path):
    spec = importlib.util.spec_from_file_location("_cand", program_path)
    mod = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(mod)
    return mod


def evaluate(program_path):
    try:
        params = list(_load(program_path).solve())
    except Exception as e:  # candidate crashed
        return {"combined_score": 0.0, "validity": 0.0, "status": "exec_error",
                "artifacts": {"text_feedback": f"execution error: {e}"}}
    if len(params) != N or any((not isinstance(x, (int, float)) or x != x) for x in params):
        return {"combined_score": 0.0, "validity": 0.0,
                "artifacts": {"text_feedback": f"PARAMS must be {N} finite numbers"}}
    loss = sum(float(v) ** 2 for v in params)
    score = 1.0 / (1.0 + loss)
    return {
        "combined_score": score,                 # the headline number (required)
        "loss": loss,                            # any extra numeric metrics are kept too
        "validity": 1.0,
        "artifacts": {"text_feedback": f"loss={loss:.4f}, score={score:.4f}"},
    }

The returned dict keys mean:

Key	Effect
`combined_score`	The headline fitness (required). Maximized by the loop.
any other numeric key	Kept in `genome.scores` and shown in the prompt's metrics section.
`validity` / `status`	Gate admission. `validity: 0.0` or `status` in `{exec_error, invalid}` marks the candidate invalid.
`artifacts.text_feedback`	Surfaced back into the next prompt under "Evaluator feedback".
`per_instance`	Optional per-test-case success vector (for Pareto / instance-level methods).

Recompute the score independently

Score the candidate from its returned output, not from any number it printed — this is the anti-reward-hacking discipline the bundled tasks follow. A candidate that returns combined_score = 999 from inside its own code gains nothing; the evaluator decides the score.

Validate it¶

gx.available_tasks() and galapagos task list show the cards bundled in the galapagos package (under src/galapagos/tasks/); a third-party task does not appear there and does not load by name — load it by explicit path to its card:

import galapagos as gx
task = gx.GalapagosTask.from_card(path="tasks/my_task/card.yaml")
task.runnable                       # -> True  (seed + evaluator present)
task.context                        # the problem text
seed = task.initial_genome()
print(task.evaluator.evaluate(seed).combined_score)  # score the seed directly

Run any scaffold against it:

scaffold = gx.OpenEvolveScaffold.from_card(model=gx.load_model("openai/gpt-4o-mini", host="openrouter"))
result = scaffold.run(task=task, max_iterations=30)
print(result.best_score)

Docker (sandbox) evaluation¶

By default a task is scored locally — task.evaluator is a SubprocessEvaluator that runs your evaluator.py in an isolated subprocess on the host. Set evaluation.mode: docker (alias container) and the same evaluator.py runs inside a self-contained Docker sandbox instead — Harbor-style. The scoring contract, the cascade stages, and the validity gating are identical; only the execution is containerized, so a deterministic scorer runs the same way on every machine.

evaluation:
  mode: docker             # score evaluator.py inside a Docker sandbox
  # optional, all with sensible defaults:
  # dockerfile: Dockerfile # a Dockerfile shipped in the task folder (else one is synthesized)
  # base_image: python:3.11-slim
  # requirements: [numpy==1.26.4, scipy]   # pip deps for the sandbox (defaults to the card's `library`)
  # python_bin: python3    # interpreter inside the image that runs the scorer
  # env: [HF_TOKEN]        # host env vars to forward in (a list of names), or a literal {KEY: value} map

The image is built once and a single container is reused across evaluations (each candidate is injected at a unique path, so concurrent evaluations don't collide); it is removed when the run ends. The environment is defined one of two ways:

Ship a Dockerfile in the task folder for full control (system packages, a compiler toolchain, a pinned base). It builds with the task folder as its context, so it may COPY task files.
Ship nothing and Galapagos synthesizes a minimal image from the card — FROM <base_image or python:3.11-slim> plus pip install of evaluation.requirements (falling back to the card's library). Your evaluator.py and its data files are copied into the running container, so the Dockerfile only needs to describe the environment, not the task.

Any task can be forced into (or out of) a sandbox for a whole run without touching its card, via the run config general.evaluation_mode (docker / local; None honors the card):

galapagos run --scaffold openevolve --task my_task --set general.evaluation_mode=docker

task = gx.GalapagosTask.from_card(name="my_task").set_eval_mode("docker")
print(task.evaluator.evaluate(task.initial_genome()).combined_score)  # scored in the sandbox

The sandbox is sealed: unlike the local subprocess (which inherits the full host environment), a docker-mode evaluator sees only the env vars you list under evaluation.env. Docker mode needs the docker CLI on PATH; if it's missing, the evaluator raises with a clear message. Auto-synthesis only knows how to build a Python image — a non-Python task (or one whose dependencies aren't pip-installable: a system package, a compiler, a private wheel) must ship its own Dockerfile (or set base_image), and Galapagos raises rather than guess. For scoring that reproduces across machines (not just across the local/container split on one machine), pin your requirements versions or ship a Dockerfile with a pinned base.

To publish it, see Submit to the Hub.

Write your own task¶

1. card.yaml¶

2. initial_program.py¶

3. evaluator.py¶

Validate it¶

Docker (sandbox) evaluation¶

1. `card.yaml`¶

2. `initial_program.py`¶

3. `evaluator.py`¶