Weco LogoWeco Docs
Observe

Weco Observe

Track external optimization experiments with Weco's observability platform. Get tree visualization, code diffs, and metrics for any LLM-driven optimization loop.

Weco Observe lets you track experiments from any optimization loop - whether it's an LLM agent running autoresearch, a custom training script, or a manual experimentation workflow. Your experiments appear in the same Weco dashboard as regular runs, with tree visualization, code diffs, and multi-metric tracking.

When to use Observe

Use weco observe when you're running your own optimization loop and want to track results in the Weco dashboard. This is different from weco run, which drives the optimization itself.

weco runweco observe
Who optimizes?WecoYou (or your agent)
Who evaluates?Weco (runs your eval command)You (evaluation happens externally)
What Weco providesEnd-to-end optimizationDashboard, visualization, tracking
CLI interactionLong-running processFire-and-forget commands

Quick start

First, authenticate if you haven't already:

weco login

Add the following block to your agent's instructions (e.g. program.md, system prompt, or task description). The agent handles everything from there:

## Weco Observe

Track experiments with Weco Observe. Authenticate using `weco login` if needed.

At setup, create a run (this captures your baseline code as step 0):

    WECO_RUN_ID=$(weco observe init --name "<run-name>" --metric <metric> --goal min --source <file>)

After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:

    weco observe log --run-id "$WECO_RUN_ID" --step 0 --status completed --description "baseline" --metrics '{"<metric>": <val>}' --source <file>
    weco observe log --run-id "$WECO_RUN_ID" --step 1 --status completed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>
    weco observe log --run-id "$WECO_RUN_ID" --step 2 --status failed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>

Use `--status failed` for discarded or crashed experiments. If you omit `--parent-step`, each step automatically chains to the last successful one. If you want to branch from a particular step pass the `--parent-step` explicitly to branch correctly:

    # Step 3 failed and was reverted - step 4 branches from step 2 (not 3)
    weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 --status completed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>

All observe commands are fire-and-forget - they print warnings to stderr on failure but always exit 0, so they never crash your agent's loop.

1. Initialize a run

Create a run and capture the run ID. This also records your baseline code as step 0:

WECO_RUN_ID=$(weco observe init \
  --name "my-experiment" \
  --metric val_bpb \
  --goal min \
  --source train.py)

2. Log experiments

After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:

# Log baseline result (step 0)
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 0 \
  --status completed \
  --description "baseline" \
  --metrics '{"val_bpb": 2.366, "memory_gb": 0.0}' \
  --source train.py

# Log an experiment
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 1 \
  --status completed \
  --description "increase batch size to 32K" \
  --metrics '{"val_bpb": 2.261, "memory_gb": 0.0}' \
  --source train.py

# Log a failed experiment
weco observe log \
  --run-id "$WECO_RUN_ID" \
  --step 2 \
  --status failed \
  --description "double model depth (OOM)" \
  --metrics '{"val_bpb": 0.0, "memory_gb": 0.0}' \
  --source train.py

3. View in dashboard

Open the Weco dashboard to see your experiments with tree visualization, code diffs, and metrics.

CLI reference

weco observe init

Create a new external run for tracking.

ArgumentDescriptionRequired
-s, --sourceSingle source code file to trackYes (or --sources)
--sourcesMultiple source code files to trackYes (or --source)
--metricPrimary metric name (e.g. val_bpb)Yes
-g, --goalmaximize/max or minimize/minNo (default: minimize)
--nameRun nameNo
-i, --additional-instructionsInstructions for the runNo

Prints the run ID to stdout so it can be captured with $().

weco observe log

Log a step (experiment) to an existing run.

ArgumentDescriptionRequired
--run-idRun ID from weco observe initYes
--stepStep number (0 = baseline, then 1, 2, 3, ...)Yes
--statuscompleted or failedNo (default: completed)
--descriptionWhat was tried in this experimentNo
--metricsJSON object of metrics (e.g. '{"val_bpb": 1.03}')No
-s, --sourceSource file to snapshotNo (or --sources)
--sourcesMultiple source files to snapshotNo (or --source)
--parent-stepParent step number for tree branchingNo (auto-chains to last successful step)

Branching and tree structure

By default, each step automatically chains to the last successful (non-failed) step, forming a linear sequence. When you discard a failed experiment and try a different approach, use --parent-step to branch correctly:

# Step 2 is the current best
# Step 3 failed - we revert to step 2 and try something different
weco observe log --run-id "$WECO_RUN_ID" --step 3 --status failed \
  --description "tried X, didn't work" --metrics '{"val_bpb": 2.5}'

# Step 4 branches from step 2, not the failed step 3
weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 \
  --status completed --description "tried Y instead" --metrics '{"val_bpb": 2.1}'

This produces a tree in the dashboard:

[Baseline (step 0)]
  +-- Step 1 (kept)
        +-- Step 2 (kept)
              +-- Step 3 (failed)
              +-- Step 4 (kept, branched from 2)

Python SDK

For scripts with a Python loop, use the SDK directly instead of shell commands:

from weco.observe import WecoObserver

obs = WecoObserver()
run = obs.create_run(
    name="sweep v3",
    source_code={"train.py": open("train.py").read()},
    primary_metric="val_bpb",
    maximize=False,
)

for i, result in enumerate(experiments):
    run.log_step(
        step=i,
        status="completed" if result.kept else "failed",
        description=result.description,
        metrics={"val_bpb": result.val_bpb, "memory_gb": result.memory_gb},
        code={"train.py": open("train.py").read()},
    )

Run lifecycle

External runs are managed by the dashboard, not the CLI:

  • Runs stay active as long as steps are being logged
  • The dashboard shows "Latest step X ago" for running external runs
  • Mark as Complete from the dashboard actions menu to close a run
  • Logging a new step to a closed run silently reopens it
  • Runs with no activity for 24 hours are automatically archived

Design notes

  • Step 0 is the baseline - created automatically by init with your source code. Log step 0 after your baseline run to attach metrics.
  • Idempotent - re-posting the same step number updates the existing step instead of creating a duplicate.
  • Multi-metric - the --metrics JSON can contain any number of metrics. The primary metric (specified in init) is used for the chart and "best metric" calculations.
  • Code diffs - pass --source on each step to see code diffs between experiments in the dashboard.

On this page