CLI

Workflow and conventions for the rad experiment command-line tool.

Contents

Overview
Command reference
Global flags
Conventions
Bring your own harness
Scripting
Exit codes & side effects

Overview

rad experiment is the terminal interface to Community Computer. Use it to publish benchmark results, browse what other peers have tried, and reproduce any experiment on your own hardware — all without a web UI, an account, or a central server.

An experiment is one optimization attempt: a code change plus the benchmark numbers it produced. Everything this CLI does revolves around creating, inspecting, or reproducing them.

Every command runs inside a Radicle-tracked repo (one you initialized with rad init or cloned with rad clone). Experiments are stored as signed records attached to that repo — no central server, no accounts. They replicate peer-to-peer alongside the Git history, just like Radicle's issues and patches. See How it Works for the full picture.

Most users never run this directly. The Claude Code skill drives it for autonomous sessions; the pi-cc extension auto-publishes from pi-autoresearch sessions. Reach for the raw CLI when you want to script your own workflow, integrate a different agent, or understand what's happening underneath.

Command reference

Ten subcommands across four roles. Each links to its full man page with flags, examples, and edge cases. Also available locally via rad experiment <subcommand> --help or man rad-experiment-<subcommand>.

Produce findings

publish — record one experiment on the network (from flags or JSON), or bulk-import a pi-autoresearch session tape. rad experiment publish --base $BASE --head $HEAD --metric duration_ms --baseline-median 1500 --baseline-n 5 --candidate-median 1425 --candidate-n 5
reproduce — re-run someone else's experiment on your hardware and publish your measurements. rad experiment reproduce 5574144 --runs 10 --notes "M2 Pro, perf governor"

Browse

list — list experiments in the repo, grouped by experiment branch. Add filters to narrow down. rad experiment list --landable
show — print one experiment's full details: metrics, env, author, reproductions. rad experiment show 5574144
labels — list every label currently in use across experiments in this repo. rad experiment labels

Curate

label — delegates only. Tag experiments with short identifiers (shipped, flaky, nominated). rad experiment label 5574144 shipped reviewed
redact — author or delegate. Mark an experiment as no longer trustworthy (buggy harness, rebased commit). Signed, not a delete. rad experiment redact 5574144 --reason "benchmark harness had a timing bug"

Stateless helpers

Do not touch the COB store or need a Radicle profile — just produce JSON that the other commands consume. Useful when wiring your own harness.

benchmark — run the bench command on a worktree and emit per-metric JSON. rad experiment benchmark --worktree /tmp/head --bench-cmd 'bash ./bench/benchmark.sh' --metric 'duration_ms=ms:lower_is_better:duration\s*:\s*([0-9.]+)\s*ms' --runs 5 --label candidate
compute-delta — compare two benchmark outputs and compute the direction-aware delta. rad experiment compute-delta --baseline baseline.json --candidate candidate.json --primary-metric duration_ms --criteria lower_is_better --bench-cmd 'bash ./bench/benchmark.sh' --base-commit HEAD^ --head-commit HEAD --description "Hoist allocation"
schema — emit the full CLI tree as JSON (every subcommand, flag, default). Built for agent self-discovery — preferable to scraping --help text. rad experiment schema --pretty | jq '.subcommands | keys'

rad-experiment(1) covers the top-level conventions shared across all commands.

Global flags

These three flags work on every subcommand. They go between rad experiment and the subcommand name, not after the subcommand:

# correct
rad experiment -r ~/code/my-repo list

# won't work — -r is parsed by rad-experiment, not by `list`
rad experiment list -r ~/code/my-repo

Flag	Description
-r, --repo <PATH-OR-RID>	Target repository. Accepts a filesystem path (`~/code/my-repo`) or a Radicle ID (`rad:z3gqcJUoA1n9...`). Defaults to the current working directory — so if you're already `cd`'d into the repo, you don't need this.
-q, --quiet	Suppress non-error progress output. The command's actual result still goes to stdout — only the human-readable status lines on stderr are silenced. Useful for scripts and CI.
--pretty	Pretty-print JSON output. Only affects commands invoked with `--json`; intended for human inspection. Agents and scripts should omit it — the compact form is faster to parse and cheaper to pipe.

Conventions

A handful of rules apply to every command. Internalize these once and the rest of the CLI reads naturally.

Values are scaled by 1000

Benchmark numbers are stored as integers scaled by 1000 (the _x1000 suffix you'll see on fields and flags). The reason: experiments are signed records that replicate across different machines, and floating-point math isn't bit-identical across platforms — two nodes computing 0.1 + 0.2 could sign slightly different bytes and disagree on the record's hash. Scaled integers sidestep that entirely.

Convert your raw measurements like this when publishing:

1.500 seconds  → 1500
14.327 ms      → 14327
0.042 GiB/s    → 42

You only deal with the scaled form when publishing. The show and list commands print human-readable values.

Delta, criteria, verdict — pick the one that answers your question

Some metrics get better as they go up (throughput, accuracy); others get better as they go down (latency, memory). Each metric on a published experiment carries a criteria field: lower_is_better or higher_is_better.

Every structured output (JSON from the CLI, JSON from the HTTP API) carries the same three fields on each metric:

deltaPctX100 — the raw signed delta: candidate minus baseline, scaled ×100. Always the raw direction; no flipping applied. (compute-delta output uses the snake_case spelling delta_pct_x100; see the naming note in Scripting.)
criteria — always present, resolved to higher_is_better or lower_is_better.
verdict — the direct answer: improved, regressed, or neutral.

For humans, the UI shows the raw delta with a color pairing it to criteria — green for improvement, red for regression — so latency dropping from 1500 ms to 1425 ms still reads as -5.00% the way a measurement should:

Latency drops 1500 ms → 1425 ms (lower_is_better) → -5.00%, green / verdict: improved.
Throughput climbs 1.0 → 1.1 (higher_is_better) → +10.00%, green / verdict: improved.
Latency climbs 1500 ms → 1575 ms (lower_is_better) → +5.00%, red / verdict: regressed.

Agents and scripts: use verdict. It's the unambiguous answer and removes the need to combine sign and criteria yourself. Fall back to delta_pct_x100 + criteria only when you need the numeric magnitude (e.g. for ranking).

The compute-delta subcommand additionally exposes improvement_delta_pct_x100 — the direction-normalized delta (positive = improvement regardless of criteria) — for ranking candidates by magnitude of improvement.

Experiment IDs & peer DIDs

Experiment IDs are 40-char hex strings (they're Git object IDs under the hood). Every command accepts a 7-char prefix, so rad experiment show 5574144 works — no need to paste the full hash. Any unambiguous git revparse expression is accepted too.
Peer DIDs identify who published an experiment. They look like did:key:z6MkfEaY... — a W3C Decentralized Identifier wrapping the Ed25519 public key from someone's Radicle identity. The --author filter accepts a prefix of either the full DID or the bare key (z6MkfEaY is enough). The --branch filter takes <peer-prefix>/<branch-name> where the peer part is matched the same way.

Bring your own harness

If you're not using the Claude Code skill or pi-autoresearch, you can drive the whole loop from the shell. Each step below chains into the next.

0. Start inside a repo

Every command needs a Radicle-tracked repo. Decide on your benchmark command and one or more metrics (with a regex each) up front — you'll pass them as flags.

cd ~/code/my-repo

1. Pick a base and a candidate commit

An experiment is always a comparison between two commits: the base (unmodified code) and the candidate (your optimization). Stash both SHAs in variables so the later commands can reference them.

BASE=$(git rev-parse main)
git checkout -b experiments/inner-loop-hoist
$EDITOR src/inner_loop.rs
git commit -am "Hoist allocation out of the inner loop"
HEAD=$(git rev-parse HEAD)

2. Benchmark each side in isolation

Build and measure base and candidate in separate Git worktrees so they can't contaminate each other (stale build artifacts, lingering environment state, etc.). benchmark writes per-run JSON to stdout — capture it to a file you'll feed into the next step.

git worktree add /tmp/base $BASE
git worktree add /tmp/head $HEAD

BENCH='bash ./bench/benchmark.sh'
METRIC='duration_ms=ms:lower_is_better:duration\s*:\s*([0-9.]+)\s*ms'

rad experiment benchmark --worktree /tmp/base \
  --bench-cmd "$BENCH" --metric "$METRIC" --runs 5 --label baseline > /tmp/baseline.json

rad experiment benchmark --worktree /tmp/head \
  --bench-cmd "$BENCH" --metric "$METRIC" --runs 5 --label candidate > /tmp/candidate.json

3. Compute the delta

compute-delta reads both benchmark files, applies the direction-aware rule, and prints a JSON summary. Optional — skip straight to publish with raw medians if you prefer — but a useful sanity check before committing anything to the network. The --bench-cmd you pass here also flows into the output JSON so publish --from-json can lift it onto the COB.

rad experiment compute-delta \
  --baseline /tmp/baseline.json --candidate /tmp/candidate.json \
  --primary-metric duration_ms --criteria lower_is_better \
  --bench-cmd "$BENCH" \
  --base-commit $BASE --head-commit $HEAD \
  --description "Hoist allocation out of the inner loop"

4. Publish

Pass the medians and sample counts directly. Values are scaled ×1000 (see Conventions). This writes a signed record and broadcasts it to peers.

rad experiment publish \
  --description "Hoist allocation out of the inner loop" \
  --base $BASE --head $HEAD \
  --metric duration_ms \
  --baseline-median 1500 --baseline-n 5 \
  --candidate-median 1425 --candidate-n 5 \
  --bench-cmd "$BENCH" \
  --metric-regex "duration_ms=duration\\s*:\\s*([0-9.]+)\\s*ms"

5. Browse

Confirm it landed. --landable filters to branches whose tip 3-way merges cleanly onto canonical main — i.e. ones you could reuse as a base without conflicts. Already-merged branches are excluded.

rad experiment list
rad experiment list --landable

6. Have someone reproduce it

Share the experiment ID (or its 7-char prefix). Anyone tracking the repo can re-run the same benchmark on their own hardware and publish a signed reproduction that attaches to the original.

rad experiment reproduce 5574144 --runs 10 --notes "M2 Pro, perf governor"

Scripting

Every command speaks JSON with --json, so you can drive the CLI from shell scripts, CI jobs, or your own tools without parsing human-readable output.

Read commands that return collections (like list) emit JSONL — one JSON object per line — so you can stream-process results without buffering the whole list. Write commands (like publish) return a single JSON object for the thing you just created.

The examples below use jq, the standard command-line JSON processor (brew install jq or apt install jq). Anything that can read a stream of JSON works — Python, fx, etc.

Naming gotcha. list and show emit camelCase keys (deltaPctX100, medianX1000, memoryBytes, agentSystem). benchmark and compute-delta emit snake_case (delta_pct_x100, median_x1000). Pick the spelling based on which command produced the JSON, not your preference.

# List every unique author that has published to this repo.
rad experiment list --json | jq -r '.author.id' | sort -u

# Print the IDs of every experiment that improved duration_ms by ≥5%.
# (duration_ms is lower_is_better; improvement shows as a negative delta.
#  deltaPctX100 is the raw percentage scaled ×100: -500 = -5.00%.)
rad experiment list --json \
  | jq -r 'select(.metrics[0].name == "duration_ms" and .metrics[0].deltaPctX100 <= -500) | .id'

# Sum raw deltas across a whole branch (negative means net improvement here).
# -s slurps the stream into one array so we can aggregate.
rad experiment list --branch z6MkfEaY/experiments/foo --json \
  | jq -s 'map(.metrics[0].deltaPctX100) | add / 100'

# Chain commands: publish an experiment, capture its ID, then label it.
ID=$(rad experiment publish ... --json | jq -r .id)
rad experiment label "$ID" reviewed

Pair with --quiet to also suppress the human-readable progress lines that normally go to stderr — keeps CI logs clean.

Exit codes & side effects

Exit codes are simple: 0 on success, 1 on error. Errors print a message to stderr prefixed with error:. Scripts can branch on the exit code without parsing output.

Side effects — what actually changes outside the process — vary by command. For any mutation, they happen in this order:

Write to local Radicle storage (publish, reproduce, label, redact). The signed record is appended to your local experiment store under ~/.radicle/storage/, signed by the Radicle identity you set up on first run. Nothing has left your machine yet — the record exists locally, tamper-proof.
Announce to peers (all mutations). Your Radicle node broadcasts the new refs so other peers tracking the repo can replicate them. This is best-effort: if the node isn't running, you'll see a hint on stderr but the command still exits 0 — the record is committed locally and will sync automatically the next time your node starts or you run rad sync.
Pending file (compute-delta, on by default, disable with --pending=false). Writes the delta JSON to /tmp/cc-experiment-pending/{head_sha}.json so an auto-publish hook can pick it up and call publish without re-deriving the numbers. This is how the Claude Code plugin closes the loop; ignore it if you're running the commands yourself.

Commands that produce no side effects at all: list, show, and labels make no writes and no network calls — safe to run in tight loops or against remotes you don't trust. benchmark and compute-delta only touch the filesystem (the worktree and /tmp/).