The Hub¶
The galapagos Hub is a Hugging-Face-Hub-style registry for LLM-driven discovery. It has four parts:
- A registry of scaffold cards and task cards — the same cards bundled in the library, plus community contributions (library ⊆ hub).
- A live leaderboard — per task, ranking verified discoveries by the task's metric.
- A verification system — domain-expert review of submitted discoveries (the trajectory + best solution), so leaderboard scores are checked, not self-reported.
- A live playground — run component compositions against bundled toy tasks on a small budget, right from the browser (or run any scaffold via the CLI), to compare methods. It calls a real model, so it needs an OpenRouter key.
Cards are the unit of exchange¶
Everything on the Hub is a card:
| Card | What it registers |
|---|---|
ScaffoldCard |
a discovery method (which class + which six components fill the slots) |
TaskCard |
an evaluation task (seed + evaluator + metric + verification semantics) |
ModelCard |
a model (path + host) |
VerificationCard |
a submitted discovery (trajectory + best solution) for expert review |
Cards are permissive (extra="allow"), so a contributor can attach method-specific fields without a
core change. Validate any card before submitting with galapagos submit --card ....
The verification system¶
A discovery is submitted as a VerificationCard carrying the discovery trajectory and the
best solution, with status unverified. A domain-expert reviewer reproduces the trajectory,
re-scores the solution with the task's own (independent, anti-reward-hacking) evaluator, and promotes
it:
Only verified discoveries reach the leaderboard. Because every submission is re-scored by the same
task evaluator, leaderboard entries are directly comparable across scaffolds and agents. See
Submit to the Hub for the submission flow.
Verifying a registered scaffold or task¶
The verification above checks a discovery. A second, distinct question is whether a registered method or task is a faithful port of its original. When the original author registers their own scaffold, that is self-evident. When a third party ports someone else's published method, the Hub must confirm the port reproduces the original before its scores are trusted. Two procedures gate a community port:
- Original-author code review. The porter records the upstream author and repo on the card; the Hub routes the port to that author (or a domain expert) to confirm the implementation matches.
- A parity experiment. The port is run head-to-head with the original on a fixed task, model, and config, and the two must reach equivalent performance.
Two kinds of parity¶
Galapagos separates two parity questions, because they have different reference points and admit different levels of statistical rigour:
| Parity | Reference | Variance | Statistical leverage |
|---|---|---|---|
| Scaffold parity | the original method implementation | high — stochastic search + LLM sampling, cross-repo | low → leans on code review |
| Task parity | the original benchmark | low — the evaluator is usually deterministic | high → full equivalence testing is cheap |
A scaffold is a stochastic search procedure run with a real LLM, so a single run is one draw from a high-variance distribution and each run costs money. The original lives in a different repository with different RNG semantics, so the two sides cannot be paired by seed. Scaffold parity is therefore a coarse check, and the weight of trust sits on the code review. Task parity, by contrast, compares two evaluators on the same solution — cheap, repeatable, and pairable — so it carries the rigorous statistics.
How parity is reported¶
Every run-to-run number is reported as mean ± sample SEM — the sample standard error of the mean,
s / √n, where s is the sample standard deviation over n runs per side (n ≥ 2, three or more
preferred). This follows Harbor's adapter-parity convention: SEM measures how precisely each side
estimates the true score and tightens as runs are added, whereas the sample standard deviation does
not and can hide a genuinely diverging port behind wide error bars.
The raw per-run scores — original_runs and galapagos_runs — are the source of truth; the
mean ± SEM strings are derived for display, and a reviewer recomputes them from the raw arrays. Both
sides run in lockstep: a sanity check on a few tasks, then one full run each, then scale to three.
Overlapping error bars do not prove equivalence
Two means whose mean ± SEM bars overlap are not thereby equivalent — that is "no evidence of a
difference," which an underpowered experiment produces for free. Galapagos uses raw-range overlap
only as a fast smoke test to reject gross divergence. A genuine equivalence claim requires the
90% confidence interval of the difference to fall inside a pre-registered margin Δ (a TOST /
equivalence test). That test is only affordable and pairable for task parity, so that is where
it is applied; scaffold parity at three runs is reported honestly as
no_divergence | inconclusive | divergent, and an inconclusive result is never auto-promoted.
The two-tier check¶
| Tier | When | Cost | Catches |
|---|---|---|---|
| A — deterministic re-score | every submission | zero (no LLM) | task / evaluator drift |
| B — statistical smoke | on review | a few LLM runs (n = 3) |
gross search divergence |
Tier A takes the original's best solution from a frozen reference fixture and re-scores it with the
registered task's own evaluator. If the recomputed score matches the original's claim, the task
faithfully reproduces the benchmark's scoring. (This is the evaluator re-run these docs promise,
made concrete.) Tier B runs the port with a real LLM under the original's model and config and
compares galapagos_runs against the fixture's original_runs in mean ± SEM.
The reference fixture is contributed once by the original author — per-seed scores plus the best
solution, under a pinned (task, model, config, seeds). It doubles as the author's sign-off and the
parity baseline, so the two verification procedures fold into a single artifact.
What the card records¶
A scaffold card carries explicit provenance and a verification block (cards are permissive, so these attach without a schema change):
original_author: "..." # who published the method
upstream_repo: "..." # and where
upstream_commit: "..." # pinned to a commit
registered_by: "..." # who ported it — distinct from the author
verification:
status: unverified # unverified → under_review → verified | rejected
code_review: { reviewer: "...", status: approved, ref: "..." }
parity:
fixture: "..." # the original author's reference fixture
tier_a: { rescore_passed: true }
tier_b: { decision: no_divergence, ref: "parity_experiment.json" }
A port is listed as stable and reaches the leaderboard only once verification.status is
verified — wiring a runnable class is not enough on its own. Until then a community port is
catalogued but flagged unverified.
Endpoints¶
The Hub is a small FastAPI service; every route is under /api. Reads are open; writes require an
Authorization: Bearer <token> header (mint one at POST /api/auth/token).
| Method & path | Auth | Purpose |
|---|---|---|
GET /api/stats |
— | catalog counts + breakdowns by tier / status / type |
GET /api/scaffolds |
— | browse the scaffold catalog (filters: q, tier, status, type) |
GET /api/scaffolds/{name} |
— | fetch one scaffold card (YAML + JSON) |
POST /api/scaffolds |
bearer | register a scaffold card — body {"card_yaml": "..."} |
GET /api/tasks |
— | browse the task catalog (filters: q, domain, status) |
GET /api/tasks/{name} |
— | fetch one task card |
POST /api/tasks |
bearer | register a task card — body {"card_yaml": "..."} |
GET /api/{scaffolds\|tasks}/{name}/tree |
— | the card bundle's full file tree (recursive, like a HF repo's Files tab) |
GET /api/{scaffolds\|tasks}/{name}/file?path= |
— | one file's content (text inlined up to 200 KB) |
GET /api/{scaffolds\|tasks}/{name}/raw?path= |
— | raw file download (streams; supports Range) |
GET /api/leaderboard |
— | the leaderboard (filters: task, scaffold; sorted by score desc) |
POST /api/leaderboard |
bearer | submit a run result (lands pending until reviewed) |
POST /api/leaderboard/{id}/verify |
bearer | admin review action: flip an entry pending → accepted \| rejected |
GET /api/verifications |
— | list submitted discoveries (filters: task, status) |
POST /api/verifications |
bearer | submit a discovery (a VerificationCard) for review |
POST /api/auth/token |
gated | issue a bearer token (open in dev; gated by OG_HUB_ADMIN_TOKEN in prod) |
The full submission flow — validate a card, mint a token, POST it — is in Submit to the Hub.
Run it locally¶
The Hub mirrors the library's bundled cards, so a local instance serves the 8 scaffold cards and 64 task cards out of the box. The catalog is discoverable straight from the package without a server:
from galapagos.cards.registry import (
available_scaffolds, available_tasks, load_scaffold_card, load_task_card,
)
available_scaffolds() # the scaffold catalog the Hub mirrors
load_task_card("circle_packing") # one task card
The playground is just a budget-capped run of a runnable scaffold (it calls a live model, so set
your OpenRouter key in OPENAI_API_KEY):
galapagos run --scaffold openevolve --task playground_sphere \
--model openai/gpt-4o-mini --host openrouter --iters 20
playground_sphere is the dedicated Playground task — instant and pure-Python — designed for the
Compare demo and the fastest first runs.
Stand up the web service¶
The repository ships the Hub backend under hub/. A local instance creates a SQLite database and
syncs every bundled card on startup (library ⊆ hub):
bash hub/run_hub.sh
# → http://127.0.0.1:8000/api the API
# http://127.0.0.1:8000/docs interactive OpenAPI docs
# http://127.0.0.1:8000/healthz liveness
For a shared deployment, hub/docker-compose.yml runs the backend against Postgres. It fails
closed in prod unless you set an explicit OG_HUB_CORS origin list and an OG_HUB_ADMIN_TOKEN
(which gates token issuance). See hub/README.md for the full environment-variable reference.