The Hub¶

The galapagos Hub is a Hugging-Face-Hub-style registry for LLM-driven discovery. It has four parts:

A registry of scaffold cards and task cards — the same cards bundled in the library, plus community contributions (library ⊆ hub).
A live leaderboard — per task, ranking verified discoveries by the task's metric.
A verification system — domain-expert review of submitted discoveries (the trajectory + best solution), so leaderboard scores are checked, not self-reported.
A live playground — run component compositions against bundled toy tasks on a small budget, right from the browser (or run any scaffold via the CLI), to compare methods. It calls a real model, so it needs an OpenRouter key.

Cards are the unit of exchange¶

Everything on the Hub is a card:

Card	What it registers
`ScaffoldCard`	a discovery method (which class + which six components fill the slots)
`TaskCard`	an evaluation task (seed + evaluator + metric + verification semantics)
`ModelCard`	a model (path + host)
`VerificationCard`	a submitted discovery (trajectory + best solution) for expert review

Cards are permissive (extra="allow"), so a contributor can attach method-specific fields without a core change. Validate any card before submitting with galapagos submit --card ....

The verification system¶

A discovery is submitted as a VerificationCard carrying the discovery trajectory and the best solution, with status unverified. A domain-expert reviewer reproduces the trajectory, re-scores the solution with the task's own (independent, anti-reward-hacking) evaluator, and promotes it:

unverified → under_review → verified | rejected

Only verified discoveries reach the leaderboard. Because every submission is re-scored by the same task evaluator, leaderboard entries are directly comparable across scaffolds and agents. See Submit to the Hub for the submission flow.

Verifying a registered scaffold or task¶

The verification above checks a discovery. A second, distinct question is whether a registered method or task is a faithful port of its original. When the original author registers their own scaffold, that is self-evident. When a third party ports someone else's published method, the Hub must confirm the port reproduces the original before its scores are trusted. Two procedures gate a community port:

Original-author code review. The porter records the upstream author and repo on the card; the Hub routes the port to that author (or a domain expert) to confirm the implementation matches.
A parity experiment. The port is run head-to-head with the original on a fixed task, model, and config, and the two must reach equivalent performance.

Two kinds of parity¶

Galapagos separates two parity questions, because they have different reference points and admit different levels of statistical rigour:

Parity	Reference	Variance	Statistical leverage
Scaffold parity	the original method implementation	high — stochastic search + LLM sampling, cross-repo	low → leans on code review
Task parity	the original benchmark	low — the evaluator is usually deterministic	high → full equivalence testing is cheap

A scaffold is a stochastic search procedure run with a real LLM, so a single run is one draw from a high-variance distribution and each run costs money. The original lives in a different repository with different RNG semantics, so the two sides cannot be paired by seed. Scaffold parity is therefore a coarse check, and the weight of trust sits on the code review. Task parity, by contrast, compares two evaluators on the same solution — cheap, repeatable, and pairable — so it carries the rigorous statistics.

How parity is reported¶

Every run-to-run number is reported as mean ± sample SEM — the sample standard error of the mean, s / √n, where s is the sample standard deviation over n runs per side (n ≥ 2, three or more preferred). This follows Harbor's adapter-parity convention: SEM measures how precisely each side estimates the true score and tightens as runs are added, whereas the sample standard deviation does not and can hide a genuinely diverging port behind wide error bars.

The raw per-run scores — original_runs and galapagos_runs — are the source of truth; the mean ± SEM strings are derived for display, and a reviewer recomputes them from the raw arrays. Both sides run in lockstep: a sanity check on a few tasks, then one full run each, then scale to three.

Overlapping error bars do not prove equivalence

Two means whose mean ± SEM bars overlap are not thereby equivalent — that is "no evidence of a difference," which an underpowered experiment produces for free. Galapagos uses raw-range overlap only as a fast smoke test to reject gross divergence. A genuine equivalence claim requires the 90% confidence interval of the difference to fall inside a pre-registered margin Δ (a TOST / equivalence test). That test is only affordable and pairable for task parity, so that is where it is applied; scaffold parity at three runs is reported honestly as no_divergence | inconclusive | divergent, and an inconclusive result is never auto-promoted.

The two-tier check¶

Tier	When	Cost	Catches
A — deterministic re-score	every submission	zero (no LLM)	task / evaluator drift
B — statistical smoke	on review	a few LLM runs (`n = 3`)	gross search divergence

Tier A takes the original's best solution from a frozen reference fixture and re-scores it with the registered task's own evaluator. If the recomputed score matches the original's claim, the task faithfully reproduces the benchmark's scoring. (This is the evaluator re-run these docs promise, made concrete.) Tier B runs the port with a real LLM under the original's model and config and compares galapagos_runs against the fixture's original_runs in mean ± SEM.

The reference fixture is contributed once by the original author — per-seed scores plus the best solution, under a pinned (task, model, config, seeds). It doubles as the author's sign-off and the parity baseline, so the two verification procedures fold into a single artifact.

What the card records¶

A scaffold card carries explicit provenance and a verification block (cards are permissive, so these attach without a schema change):

original_author: "..."        # who published the method
upstream_repo: "..."          # and where
upstream_commit: "..."        # pinned to a commit
registered_by: "..."          # who ported it — distinct from the author

verification:
  status: unverified          # unverified → under_review → verified | rejected
  code_review: { reviewer: "...", status: approved, ref: "..." }
  parity:
    fixture: "..."            # the original author's reference fixture
    tier_a: { rescore_passed: true }
    tier_b: { decision: no_divergence, ref: "parity_experiment.json" }

A port is listed as stable and reaches the leaderboard only once verification.status is verified — wiring a runnable class is not enough on its own. Until then a community port is catalogued but flagged unverified.

Endpoints¶

The Hub is a small FastAPI service; every route is under /api. Reads are open; writes require an Authorization: Bearer <token> header (mint one at POST /api/auth/token).

Method & path	Auth	Purpose
`GET /api/stats`	—	catalog counts + breakdowns by tier / status / type
`GET /api/scaffolds`	—	browse the scaffold catalog (filters: `q`, `tier`, `status`, `type`)
`GET /api/scaffolds/{name}`	—	fetch one scaffold card (YAML + JSON)
`POST /api/scaffolds`	bearer	register a scaffold card — body `{"card_yaml": "..."}`
`GET /api/tasks`	—	browse the task catalog (filters: `q`, `domain`, `status`)
`GET /api/tasks/{name}`	—	fetch one task card
`POST /api/tasks`	bearer	register a task card — body `{"card_yaml": "..."}`
`GET /api/{scaffolds\\|tasks}/{name}/tree`	—	the card bundle's full file tree (recursive, like a HF repo's Files tab)
`GET /api/{scaffolds\\|tasks}/{name}/file?path=`	—	one file's content (text inlined up to 200 KB)
`GET /api/{scaffolds\\|tasks}/{name}/raw?path=`	—	raw file download (streams; supports `Range`)
`GET /api/leaderboard`	—	the leaderboard (filters: `task`, `scaffold`; sorted by score desc)
`POST /api/leaderboard`	bearer	submit a run result (lands `pending` until reviewed)
`POST /api/leaderboard/{id}/verify`	bearer	admin review action: flip an entry `pending → accepted \\| rejected`
`GET /api/verifications`	—	list submitted discoveries (filters: `task`, `status`)
`POST /api/verifications`	bearer	submit a discovery (a `VerificationCard`) for review
`POST /api/auth/token`	gated	issue a bearer token (open in dev; gated by `OG_HUB_ADMIN_TOKEN` in prod)

The full submission flow — validate a card, mint a token, POST it — is in Submit to the Hub.

Run it locally¶

The Hub mirrors the library's bundled cards, so a local instance serves the 8 scaffold cards and 64 task cards out of the box. The catalog is discoverable straight from the package without a server:

from galapagos.cards.registry import (
    available_scaffolds, available_tasks, load_scaffold_card, load_task_card,
)
available_scaffolds()                 # the scaffold catalog the Hub mirrors
load_task_card("circle_packing")      # one task card

galapagos scaffold list      # the catalog as the Hub would list it
galapagos task list

The playground is just a budget-capped run of a runnable scaffold (it calls a live model, so set your OpenRouter key in OPENAI_API_KEY):

galapagos run --scaffold openevolve --task playground_sphere \
    --model openai/gpt-4o-mini --host openrouter --iters 20

playground_sphere is the dedicated Playground task — instant and pure-Python — designed for the Compare demo and the fastest first runs.

Stand up the web service¶

The repository ships the Hub backend under hub/. A local instance creates a SQLite database and syncs every bundled card on startup (library ⊆ hub):

bash hub/run_hub.sh
# → http://127.0.0.1:8000/api          the API
#   http://127.0.0.1:8000/docs         interactive OpenAPI docs
#   http://127.0.0.1:8000/healthz      liveness

For a shared deployment, hub/docker-compose.yml runs the backend against Postgres. It fails closed in prod unless you set an explicit OG_HUB_CORS origin list and an OG_HUB_ADMIN_TOKEN (which gates token issuance). See hub/README.md for the full environment-variable reference.