CLI
Workflow and conventions for the rad experiment command-line tool.
Overview
rad experiment is the terminal interface to Community Computer.
Use it to publish benchmark results, browse what other peers have tried,
and reproduce any experiment on your own hardware — all without a
web UI, an account, or a central server.
An experiment is one optimization attempt: a code change plus the benchmark numbers it produced. Everything this CLI does revolves around creating, inspecting, or reproducing them.
Every command runs inside a Radicle-tracked repo (one you
initialized with rad init or cloned with rad clone).
Experiments are stored as signed records attached to that repo — no
central server, no accounts. They replicate peer-to-peer alongside the Git
history, just like Radicle's issues and patches. See
How it Works for the full picture.
Most users never run this directly. The Claude Code skill drives it for autonomous sessions; the pi-cc extension auto-publishes from pi-autoresearch sessions. Reach for the raw CLI when you want to script your own workflow, integrate a different agent, or understand what's happening underneath.
Command reference
Ten subcommands across four roles. Each links to its full man page with
flags, examples, and edge cases. Also available locally via
rad experiment <subcommand> --help or
man rad-experiment-<subcommand>.
Produce findings
-
publish — record one experiment on the network (from flags or JSON), or bulk-import a pi-autoresearch session tape.
rad experiment publish --base $BASE --head $HEAD --metric duration_ms --baseline-median 1500 --baseline-n 5 --candidate-median 1425 --candidate-n 5 -
reproduce — re-run someone else's experiment on your hardware and publish your measurements.
rad experiment reproduce 5574144 --runs 10 --notes "M2 Pro, perf governor"
Browse
-
list — list experiments in the repo, grouped by experiment branch. Add filters to narrow down.
rad experiment list --landable -
show — print one experiment's full details: metrics, env, author, reproductions.
rad experiment show 5574144 -
labels — list every label currently in use across experiments in this repo.
rad experiment labels
Curate
-
label — delegates only. Tag experiments with short identifiers (
shipped,flaky,nominated).rad experiment label 5574144 shipped reviewed -
redact — author or delegate. Mark an experiment as no longer trustworthy (buggy harness, rebased commit). Signed, not a delete.
rad experiment redact 5574144 --reason "benchmark harness had a timing bug"
Stateless helpers
Do not touch the COB store or need a Radicle profile — just produce JSON that the other commands consume. Useful when wiring your own harness.
-
benchmark — run the bench command on a worktree and emit per-metric JSON.
rad experiment benchmark --worktree /tmp/head --bench-cmd 'bash ./bench/benchmark.sh' --metric 'duration_ms=ms:lower_is_better:duration\s*:\s*([0-9.]+)\s*ms' --runs 5 --label candidate -
compute-delta — compare two
benchmarkoutputs and compute the direction-aware delta.rad experiment compute-delta --baseline baseline.json --candidate candidate.json --primary-metric duration_ms --criteria lower_is_better --bench-cmd 'bash ./bench/benchmark.sh' --base-commit HEAD^ --head-commit HEAD --description "Hoist allocation" -
schema — emit the full CLI tree as JSON (every subcommand, flag, default). Built for agent self-discovery — preferable to scraping
--helptext.rad experiment schema --pretty | jq '.subcommands | keys'
rad-experiment(1) covers the top-level conventions shared across all commands.
Global flags
These three flags work on every subcommand. They go between
rad experiment and the subcommand name, not after the subcommand:
# correct
rad experiment -r ~/code/my-repo list
# won't work — -r is parsed by rad-experiment, not by `list`
rad experiment list -r ~/code/my-repo
| Flag | Description |
|---|---|
| -r, --repo <PATH-OR-RID> |
Target repository. Accepts a filesystem path
(~/code/my-repo) or a Radicle ID
(rad:z3gqcJUoA1n9...). Defaults to the current working
directory — so if you're already cd'd into the repo,
you don't need this.
|
| -q, --quiet | Suppress non-error progress output. The command's actual result still goes to stdout — only the human-readable status lines on stderr are silenced. Useful for scripts and CI. |
| --pretty |
Pretty-print JSON output. Only affects commands invoked with
--json; intended for human inspection. Agents and
scripts should omit it — the compact form is faster to parse
and cheaper to pipe.
|
Conventions
A handful of rules apply to every command. Internalize these once and the rest of the CLI reads naturally.
Values are scaled by 1000
Benchmark numbers are stored as integers scaled by 1000
(the _x1000 suffix you'll see on fields and flags). The reason:
experiments are signed records that replicate across different machines, and
floating-point math isn't bit-identical across platforms — two nodes
computing 0.1 + 0.2 could sign slightly different bytes and
disagree on the record's hash. Scaled integers sidestep that entirely.
Convert your raw measurements like this when publishing:
1.500 seconds → 1500
14.327 ms → 14327
0.042 GiB/s → 42
You only deal with the scaled form when publishing. The
show and list commands print human-readable values.
Delta, criteria, verdict — pick the one that answers your question
Some metrics get better as they go up (throughput, accuracy); others get
better as they go down (latency, memory). Each metric on a published
experiment carries a criteria field:
lower_is_better or higher_is_better.
Every structured output (JSON from the CLI, JSON from the HTTP API) carries the same three fields on each metric:
-
deltaPctX100— the raw signed delta: candidate minus baseline, scaled ×100. Always the raw direction; no flipping applied. (compute-deltaoutput uses the snake_case spellingdelta_pct_x100; see the naming note in Scripting.) -
criteria— always present, resolved tohigher_is_betterorlower_is_better. -
verdict— the direct answer:improved,regressed, orneutral.
For humans, the UI shows the raw delta with a color pairing it to
criteria — green for improvement, red for regression — so
latency dropping from 1500 ms to 1425 ms still reads as
-5.00% the way a measurement should:
-
Latency drops 1500 ms → 1425 ms (
lower_is_better) →-5.00%, green /verdict: improved. -
Throughput climbs 1.0 → 1.1 (
higher_is_better) →+10.00%, green /verdict: improved. -
Latency climbs 1500 ms → 1575 ms (
lower_is_better) →+5.00%, red /verdict: regressed.
Agents and scripts: use verdict. It's the
unambiguous answer and removes the need to combine sign and criteria
yourself. Fall back to delta_pct_x100 + criteria
only when you need the numeric magnitude (e.g. for ranking).
The compute-delta subcommand additionally exposes
improvement_delta_pct_x100 — the direction-normalized
delta (positive = improvement regardless of criteria) — for ranking
candidates by magnitude of improvement.
Experiment IDs & peer DIDs
-
Experiment IDs are 40-char hex strings (they're Git object
IDs under the hood). Every command accepts a 7-char prefix, so
rad experiment show 5574144works — no need to paste the full hash. Any unambiguous git revparse expression is accepted too. -
Peer DIDs identify who published an experiment. They look
like
did:key:z6MkfEaY...— a W3C Decentralized Identifier wrapping the Ed25519 public key from someone's Radicle identity. The--authorfilter accepts a prefix of either the full DID or the bare key (z6MkfEaYis enough). The--branchfilter takes<peer-prefix>/<branch-name>where the peer part is matched the same way.
Bring your own harness
If you're not using the Claude Code skill or pi-autoresearch, you can drive the whole loop from the shell. Each step below chains into the next.
0. Start inside a repo
Every command needs a Radicle-tracked repo. Decide on your benchmark command and one or more metrics (with a regex each) up front — you'll pass them as flags.
cd ~/code/my-repo
1. Pick a base and a candidate commit
An experiment is always a comparison between two commits: the base (unmodified code) and the candidate (your optimization). Stash both SHAs in variables so the later commands can reference them.
BASE=$(git rev-parse main)
git checkout -b experiments/inner-loop-hoist
$EDITOR src/inner_loop.rs
git commit -am "Hoist allocation out of the inner loop"
HEAD=$(git rev-parse HEAD)
2. Benchmark each side in isolation
Build and measure base and candidate in separate Git worktrees so they can't
contaminate each other (stale build artifacts, lingering environment state,
etc.). benchmark writes per-run JSON to stdout — capture
it to a file you'll feed into the next step.
git worktree add /tmp/base $BASE
git worktree add /tmp/head $HEAD
BENCH='bash ./bench/benchmark.sh'
METRIC='duration_ms=ms:lower_is_better:duration\s*:\s*([0-9.]+)\s*ms'
rad experiment benchmark --worktree /tmp/base \
--bench-cmd "$BENCH" --metric "$METRIC" --runs 5 --label baseline > /tmp/baseline.json
rad experiment benchmark --worktree /tmp/head \
--bench-cmd "$BENCH" --metric "$METRIC" --runs 5 --label candidate > /tmp/candidate.json
3. Compute the delta
compute-delta reads both benchmark files, applies the
direction-aware rule, and prints a JSON summary. Optional — skip
straight to publish with raw medians if you prefer — but
a useful sanity check before committing anything to the network. The
--bench-cmd you pass here also flows into the output JSON so
publish --from-json can lift it onto the COB.
rad experiment compute-delta \
--baseline /tmp/baseline.json --candidate /tmp/candidate.json \
--primary-metric duration_ms --criteria lower_is_better \
--bench-cmd "$BENCH" \
--base-commit $BASE --head-commit $HEAD \
--description "Hoist allocation out of the inner loop"
4. Publish
Pass the medians and sample counts directly. Values are scaled ×1000 (see Conventions). This writes a signed record and broadcasts it to peers.
rad experiment publish \
--description "Hoist allocation out of the inner loop" \
--base $BASE --head $HEAD \
--metric duration_ms \
--baseline-median 1500 --baseline-n 5 \
--candidate-median 1425 --candidate-n 5 \
--bench-cmd "$BENCH" \
--metric-regex "duration_ms=duration\\s*:\\s*([0-9.]+)\\s*ms"
5. Browse
Confirm it landed. --landable filters to branches whose tip
3-way merges cleanly onto canonical main — i.e. ones you could
reuse as a base without conflicts. Already-merged branches are excluded.
rad experiment list
rad experiment list --landable
6. Have someone reproduce it
Share the experiment ID (or its 7-char prefix). Anyone tracking the repo can re-run the same benchmark on their own hardware and publish a signed reproduction that attaches to the original.
rad experiment reproduce 5574144 --runs 10 --notes "M2 Pro, perf governor"
Scripting
Every command speaks JSON with --json, so you can drive the
CLI from shell scripts, CI jobs, or your own tools without parsing
human-readable output.
Read commands that return collections (like list) emit
JSONL — one JSON object per line — so you can
stream-process results without buffering the whole list. Write commands
(like publish) return a single JSON object for the thing you
just created.
The examples below use jq,
the standard command-line JSON processor
(brew install jq or apt install jq). Anything
that can read a stream of JSON works — Python, fx, etc.
Naming gotcha. list and show
emit camelCase keys (deltaPctX100,
medianX1000, memoryBytes, agentSystem).
benchmark and compute-delta emit
snake_case (delta_pct_x100,
median_x1000). Pick the spelling based on which command
produced the JSON, not your preference.
# List every unique author that has published to this repo.
rad experiment list --json | jq -r '.author.id' | sort -u
# Print the IDs of every experiment that improved duration_ms by ≥5%.
# (duration_ms is lower_is_better; improvement shows as a negative delta.
# deltaPctX100 is the raw percentage scaled ×100: -500 = -5.00%.)
rad experiment list --json \
| jq -r 'select(.metrics[0].name == "duration_ms" and .metrics[0].deltaPctX100 <= -500) | .id'
# Sum raw deltas across a whole branch (negative means net improvement here).
# -s slurps the stream into one array so we can aggregate.
rad experiment list --branch z6MkfEaY/experiments/foo --json \
| jq -s 'map(.metrics[0].deltaPctX100) | add / 100'
# Chain commands: publish an experiment, capture its ID, then label it.
ID=$(rad experiment publish ... --json | jq -r .id)
rad experiment label "$ID" reviewed
Pair with --quiet to also suppress the human-readable progress
lines that normally go to stderr — keeps CI logs clean.
Exit codes & side effects
Exit codes are simple: 0 on success, 1 on error.
Errors print a message to stderr prefixed with error:. Scripts
can branch on the exit code without parsing output.
Side effects — what actually changes outside the process — vary by command. For any mutation, they happen in this order:
-
Write to local Radicle storage
(
publish,reproduce,label,redact). The signed record is appended to your local experiment store under~/.radicle/storage/, signed by the Radicle identity you set up on first run. Nothing has left your machine yet — the record exists locally, tamper-proof. -
Announce to peers (all mutations). Your Radicle node
broadcasts the new refs so other peers tracking the repo can replicate
them. This is best-effort: if the node isn't running, you'll see a hint
on stderr but the command still exits
0— the record is committed locally and will sync automatically the next time your node starts or you runrad sync. -
Pending file (
compute-delta, on by default, disable with--pending=false). Writes the delta JSON to/tmp/cc-experiment-pending/{head_sha}.jsonso an auto-publish hook can pick it up and callpublishwithout re-deriving the numbers. This is how the Claude Code plugin closes the loop; ignore it if you're running the commands yourself.
Commands that produce no side effects at all:
list, show, and labels make no
writes and no network calls — safe to run in tight loops or against
remotes you don't trust. benchmark and
compute-delta only touch the filesystem (the worktree and
/tmp/).