Track external optimization experiments with Weco's observability platform. Get tree visualization, code diffs, and metrics for any LLM-driven optimization loop.

Weco Observe lets you track experiments from any optimization loop - whether it's an LLM agent running autoresearch, a custom training script, or a manual experimentation workflow. Your experiments appear in the same Weco dashboard as regular runs, with tree visualization, code diffs, and metric tracking.

How is this different from Weights & Biases?

Weights & Biases is designed to track the model-weight optimization process given training code. Weco Observe is designed to track the code optimization process. W&B logs how your weights change given fixed code; Weco tracks how your code changes to produce better results. Weco operates at a meta level - and the two are complementary.

When to use Observe

Use weco observe when you're running your own optimization loop and want to track results in the Weco dashboard. This is different from weco run, which drives the optimization itself.

	`weco run`	`weco observe`
Who optimizes?	Weco	You (or your agent)
Who evaluates?	Weco (runs your eval command)	You (evaluation happens externally)
What Weco provides	End-to-end optimization	Dashboard, visualization, tracking
CLI interaction	Long-running process	Fire-and-forget commands

Quick start

First, authenticate if you haven't already:

weco login

If you have the Weco skill installed, just include an instruction to monitor experiments when you start an optimization:

Use the weco skill to monitor the experiments.

The skill handles initializing runs, logging steps, branching, and tracking metrics automatically - no manual CLI commands needed.

Don't have the skill installed? Run weco setup claude-code or weco setup cursor first. See the Skills guide for details.

1. Initialize a run

Create a run and capture the run ID. This also records your baseline code as step 0:

WECO_RUN_ID=$(weco observe init \
  --name "my-experiment" \
  --metric val_bpb \
  --goal min \
  --source train.py)

2. Log experiments

After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:

# Log baseline result (step 0)
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 0 \
  --status completed \
  --description "baseline" \
  --metrics '{"val_bpb": 2.366, "memory_gb": 0.0}' \
  --source train.py

# Log an experiment
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 1 \
  --status completed \
  --description "increase batch size to 32K" \
  --metrics '{"val_bpb": 2.261, "memory_gb": 0.0}' \
  --source train.py

# Log a failed experiment
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 2 \
  --status failed \
  --description "double model depth (OOM)" \
  --metrics '{"val_bpb": 0.0, "memory_gb": 0.0}' \
  --source train.py

3. View in dashboard

Open the Weco dashboard to see your experiments with tree visualization, code diffs, and metrics.

CLI reference

`weco observe init`

Create a new external run for tracking.

Argument	Description	Required
`-s, --source`	Single source code file to track	Yes (or `--sources`)
`--sources`	Multiple source code files to track	Yes (or `--source`)
`--metric`	Primary metric name (e.g. `val_bpb`)	Yes
`-g, --goal`	`maximize`/`max` or `minimize`/`min`	No (default: `minimize`)
`--name`	Run name	No
`-i, --additional-instructions`	Instructions for the run	No

Prints the run ID to stdout so it can be captured with $().

`weco observe log`

Log a step (experiment) to an existing run.

Argument	Description	Required
`--run-id`	Run ID from `weco observe init`	Yes
`--step`	Step number (0 = baseline, then 1, 2, 3, ...)	Yes
`--status`	`completed` or `failed`	No (default: `completed`)
`--description`	What was tried in this experiment	No
`--metrics`	JSON object of metrics (e.g. `'{"val_bpb": 1.03}'`)	No
`-s, --source`	Source file to snapshot	No (or `--sources`)
`--sources`	Multiple source files to snapshot	No (or `--source`)
`--parent-step`	Parent step number for tree branching	No (auto-chains to last successful step)

Branching and tree structure

By default, each step automatically chains to the last successful (non-failed) step, forming a linear sequence. When you discard a failed experiment and try a different approach, use --parent-step to branch correctly:

# Step 2 is the current best
# Step 3 failed - we revert to step 2 and try something different
weco observe log --run-id "$WECO_RUN_ID" --step 3 --status failed \
  --description "tried X, didn't work" --metrics '{"val_bpb": 2.5}'

# Step 4 branches from step 2, not the failed step 3
weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 \
  --status completed --description "tried Y instead" --metrics '{"val_bpb": 2.1}'

This produces a tree in the dashboard:

[Baseline (step 0)]
  +-- Step 1 (kept)
        +-- Step 2 (kept)
              +-- Step 3 (failed)
              +-- Step 4 (kept, branched from 2)

Python SDK

For scripts with a Python loop, use the SDK directly instead of shell commands:

from weco.observe import WecoObserver

obs = WecoObserver()
run = obs.create_run(
    name="sweep v3",
    source_code={"train.py": open("train.py").read()},
    primary_metric="val_bpb",
    maximize=False,
)

for i, result in enumerate(experiments):
    run.log_step(
        step=i,
        status="completed" if result.kept else "failed",
        description=result.description,
        metrics={"val_bpb": result.val_bpb, "memory_gb": result.memory_gb},
        code={"train.py": open("train.py").read()},
    )

Run lifecycle

External runs are managed by the dashboard, not the CLI:

Runs stay active as long as steps are being logged
The dashboard shows "Latest step X ago" for running external runs
Mark as Complete from the dashboard actions menu to close a run
Logging a new step to a closed run silently reopens it
Runs with no activity for 24 hours are automatically archived

Design notes

Step 0 is the baseline - created automatically by init with your source code. Log step 0 after your baseline run to attach metrics.
Idempotent - re-posting the same step number updates the existing step instead of creating a duplicate.
Multi-metric - the --metrics JSON can contain any number of metrics. The primary metric (specified in init) is used for the chart and "best metric" calculations.
Code diffs - pass --source on each step to see code diffs between experiments in the dashboard.

Weco Observe

On this page