Skip to content

The Hub

The galapagos Hub is a Hugging-Face-Hub-style registry for LLM-driven discovery. It has four parts:

  1. A registry of scaffold cards and task cards — the same cards bundled in the library, plus community contributions (library ⊆ hub).
  2. A live leaderboard — per task, ranking verified discoveries by the task's metric.
  3. A verification system — domain-expert review of submitted discoveries (the trajectory + best solution), so leaderboard scores are checked, not self-reported.
  4. A live playground — run component compositions against bundled toy tasks on a small budget, right from the browser (or run any scaffold via the CLI), to compare methods. It calls a real model, so it needs an OpenRouter key.

Cards are the unit of exchange

Everything on the Hub is a card:

Card What it registers
ScaffoldCard a discovery method (which class + which six components fill the slots)
TaskCard an evaluation task (seed + evaluator + metric + verification semantics)
ModelCard a model (path + host)
VerificationCard a submitted discovery (trajectory + best solution) for expert review

Cards are permissive (extra="allow"), so a contributor can attach method-specific fields without a core change. Validate any card before submitting with galapagos submit --card ....

The verification system

A discovery is submitted as a VerificationCard carrying the discovery trajectory and the best solution, with status unverified. A domain-expert reviewer reproduces the trajectory, re-scores the solution with the task's own (independent, anti-reward-hacking) evaluator, and promotes it:

unverified → under_review → verified | rejected

Only verified discoveries reach the leaderboard. Because every submission is re-scored by the same task evaluator, leaderboard entries are directly comparable across scaffolds and agents. See Submit to the Hub for the submission flow.

Verifying a registered scaffold or task

The verification above checks a discovery. A second, distinct question is whether a registered method or task is a faithful port of its original. When the original author registers their own scaffold, that is self-evident. When a third party ports someone else's published method, the Hub must confirm the port reproduces the original before its scores are trusted. Two procedures gate a community port:

  1. Original-author code review. The porter records the upstream author and repo on the card; the Hub routes the port to that author (or a domain expert) to confirm the implementation matches.
  2. A parity experiment. The port is run head-to-head with the original on a fixed task, model, and config, and the two must reach equivalent performance.

Two kinds of parity

Galapagos separates two parity questions, because they have different reference points and admit different levels of statistical rigour:

Parity Reference Variance Statistical leverage
Scaffold parity the original method implementation high — stochastic search + LLM sampling, cross-repo low → leans on code review
Task parity the original benchmark low — the evaluator is usually deterministic high → full equivalence testing is cheap

A scaffold is a stochastic search procedure run with a real LLM, so a single run is one draw from a high-variance distribution and each run costs money. The original lives in a different repository with different RNG semantics, so the two sides cannot be paired by seed. Scaffold parity is therefore a coarse check, and the weight of trust sits on the code review. Task parity, by contrast, compares two evaluators on the same solution — cheap, repeatable, and pairable — so it carries the rigorous statistics.

How parity is reported

Every run-to-run number is reported as mean ± sample SEM — the sample standard error of the mean, s / √n, where s is the sample standard deviation over n runs per side (n ≥ 2, three or more preferred). This follows Harbor's adapter-parity convention: SEM measures how precisely each side estimates the true score and tightens as runs are added, whereas the sample standard deviation does not and can hide a genuinely diverging port behind wide error bars.

The raw per-run scores — original_runs and galapagos_runs — are the source of truth; the mean ± SEM strings are derived for display, and a reviewer recomputes them from the raw arrays. Both sides run in lockstep: a sanity check on a few tasks, then one full run each, then scale to three.

Overlapping error bars do not prove equivalence

Two means whose mean ± SEM bars overlap are not thereby equivalent — that is "no evidence of a difference," which an underpowered experiment produces for free. Galapagos uses raw-range overlap only as a fast smoke test to reject gross divergence. A genuine equivalence claim requires the 90% confidence interval of the difference to fall inside a pre-registered margin Δ (a TOST / equivalence test). That test is only affordable and pairable for task parity, so that is where it is applied; scaffold parity at three runs is reported honestly as no_divergence | inconclusive | divergent, and an inconclusive result is never auto-promoted.

The two-tier check

Tier When Cost Catches
A — deterministic re-score every submission zero (no LLM) task / evaluator drift
B — statistical smoke on review a few LLM runs (n = 3) gross search divergence

Tier A takes the original's best solution from a frozen reference fixture and re-scores it with the registered task's own evaluator. If the recomputed score matches the original's claim, the task faithfully reproduces the benchmark's scoring. (This is the evaluator re-run these docs promise, made concrete.) Tier B runs the port with a real LLM under the original's model and config and compares galapagos_runs against the fixture's original_runs in mean ± SEM.

The reference fixture is contributed once by the original author — per-seed scores plus the best solution, under a pinned (task, model, config, seeds). It doubles as the author's sign-off and the parity baseline, so the two verification procedures fold into a single artifact.

What the card records

A scaffold card carries explicit provenance and a verification block (cards are permissive, so these attach without a schema change):

original_author: "..."        # who published the method
upstream_repo: "..."          # and where
upstream_commit: "..."        # pinned to a commit
registered_by: "..."          # who ported it — distinct from the author

verification:
  status: unverified          # unverified → under_review → verified | rejected
  code_review: { reviewer: "...", status: approved, ref: "..." }
  parity:
    fixture: "..."            # the original author's reference fixture
    tier_a: { rescore_passed: true }
    tier_b: { decision: no_divergence, ref: "parity_experiment.json" }

A port is listed as stable and reaches the leaderboard only once verification.status is verified — wiring a runnable class is not enough on its own. Until then a community port is catalogued but flagged unverified.

Endpoints

The Hub is a small FastAPI service; every route is under /api. Reads are open; writes require an Authorization: Bearer <token> header (mint one at POST /api/auth/token).

Method & path Auth Purpose
GET /api/stats catalog counts + breakdowns by tier / status / type
GET /api/scaffolds browse the scaffold catalog (filters: q, tier, status, type)
GET /api/scaffolds/{name} fetch one scaffold card (YAML + JSON)
POST /api/scaffolds bearer register a scaffold card — body {"card_yaml": "..."}
GET /api/tasks browse the task catalog (filters: q, domain, status)
GET /api/tasks/{name} fetch one task card
POST /api/tasks bearer register a task card — body {"card_yaml": "..."}
GET /api/{scaffolds\|tasks}/{name}/tree the card bundle's full file tree (recursive, like a HF repo's Files tab)
GET /api/{scaffolds\|tasks}/{name}/file?path= one file's content (text inlined up to 200 KB)
GET /api/{scaffolds\|tasks}/{name}/raw?path= raw file download (streams; supports Range)
GET /api/leaderboard the leaderboard (filters: task, scaffold; sorted by score desc)
POST /api/leaderboard bearer submit a run result (lands pending until reviewed)
POST /api/leaderboard/{id}/verify bearer admin review action: flip an entry pending → accepted \| rejected
GET /api/verifications list submitted discoveries (filters: task, status)
POST /api/verifications bearer submit a discovery (a VerificationCard) for review
POST /api/auth/token gated issue a bearer token (open in dev; gated by OG_HUB_ADMIN_TOKEN in prod)

The full submission flow — validate a card, mint a token, POST it — is in Submit to the Hub.

Run it locally

The Hub mirrors the library's bundled cards, so a local instance serves the 8 scaffold cards and 64 task cards out of the box. The catalog is discoverable straight from the package without a server:

from galapagos.cards.registry import (
    available_scaffolds, available_tasks, load_scaffold_card, load_task_card,
)
available_scaffolds()                 # the scaffold catalog the Hub mirrors
load_task_card("circle_packing")      # one task card
galapagos scaffold list      # the catalog as the Hub would list it
galapagos task list

The playground is just a budget-capped run of a runnable scaffold (it calls a live model, so set your OpenRouter key in OPENAI_API_KEY):

galapagos run --scaffold openevolve --task playground_sphere \
    --model openai/gpt-4o-mini --host openrouter --iters 20

playground_sphere is the dedicated Playground task — instant and pure-Python — designed for the Compare demo and the fastest first runs.

Stand up the web service

The repository ships the Hub backend under hub/. A local instance creates a SQLite database and syncs every bundled card on startup (library ⊆ hub):

bash hub/run_hub.sh
# → http://127.0.0.1:8000/api          the API
#   http://127.0.0.1:8000/docs         interactive OpenAPI docs
#   http://127.0.0.1:8000/healthz      liveness

For a shared deployment, hub/docker-compose.yml runs the backend against Postgres. It fails closed in prod unless you set an explicit OG_HUB_CORS origin list and an OG_HUB_ADMIN_TOKEN (which gates token issuance). See hub/README.md for the full environment-variable reference.