Weco Observe
Track external optimization experiments with Weco's observability platform. Get tree visualization, code diffs, and metrics for any LLM-driven optimization loop.
Weco Observe lets you track experiments from any optimization loop - whether it's an LLM agent running autoresearch, a custom training script, or a manual experimentation workflow. Your experiments appear in the same Weco dashboard as regular runs, with tree visualization, code diffs, and metric tracking.
How is this different from Weights & Biases?
Weights & Biases is designed to track the model-weight optimization process given training code. Weco Observe is designed to track the code optimization process. W&B logs how your weights change given fixed code; Weco tracks how your code changes to produce better results. Weco operates at a meta level - and the two are complementary.
When to use Observe
Use weco observe when you're running your own optimization loop and want to track results in the Weco dashboard. This is different from weco run, which drives the optimization itself.
weco run | weco observe | |
|---|---|---|
| Who optimizes? | Weco | You (or your agent) |
| Who evaluates? | Weco (runs your eval command) | You (evaluation happens externally) |
| What Weco provides | End-to-end optimization | Dashboard, visualization, tracking |
| CLI interaction | Long-running process | Fire-and-forget commands |
Quick start
First, authenticate if you haven't already:
weco loginIf you have the Weco skill installed, just include an instruction to monitor experiments when you start an optimization:
Use the weco skill to monitor the experiments.The skill handles initializing runs, logging steps, branching, and tracking metrics automatically - no manual CLI commands needed.
Don't have the skill installed? Run weco setup claude-code or weco setup cursor first. See the Skills guide for details.
1. Initialize a run
Create a run and capture the run ID. This also records your baseline code as step 0:
WECO_RUN_ID=$(weco observe init \
--name "my-experiment" \
--metric val_bpb \
--goal min \
--source train.py)2. Log experiments
After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:
# Log baseline result (step 0)
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 0 \
--status completed \
--description "baseline" \
--metrics '{"val_bpb": 2.366, "memory_gb": 0.0}' \
--source train.py
# Log an experiment
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 1 \
--status completed \
--description "increase batch size to 32K" \
--metrics '{"val_bpb": 2.261, "memory_gb": 0.0}' \
--source train.py
# Log a failed experiment
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 2 \
--status failed \
--description "double model depth (OOM)" \
--metrics '{"val_bpb": 0.0, "memory_gb": 0.0}' \
--source train.py3. View in dashboard
Open the Weco dashboard to see your experiments with tree visualization, code diffs, and metrics.
CLI reference
weco observe init
Create a new external run for tracking.
| Argument | Description | Required |
|---|---|---|
-s, --source | Single source code file to track | Yes (or --sources) |
--sources | Multiple source code files to track | Yes (or --source) |
--metric | Primary metric name (e.g. val_bpb) | Yes |
-g, --goal | maximize/max or minimize/min | No (default: minimize) |
--name | Run name | No |
-i, --additional-instructions | Instructions for the run | No |
Prints the run ID to stdout so it can be captured with $().
weco observe log
Log a step (experiment) to an existing run.
| Argument | Description | Required |
|---|---|---|
--run-id | Run ID from weco observe init | Yes |
--step | Step number (0 = baseline, then 1, 2, 3, ...) | Yes |
--status | completed or failed | No (default: completed) |
--description | What was tried in this experiment | No |
--metrics | JSON object of metrics (e.g. '{"val_bpb": 1.03}') | No |
-s, --source | Source file to snapshot | No (or --sources) |
--sources | Multiple source files to snapshot | No (or --source) |
--parent-step | Parent step number for tree branching | No (auto-chains to last successful step) |
Branching and tree structure
By default, each step automatically chains to the last successful (non-failed) step, forming a linear sequence. When you discard a failed experiment and try a different approach, use --parent-step to branch correctly:
# Step 2 is the current best
# Step 3 failed - we revert to step 2 and try something different
weco observe log --run-id "$WECO_RUN_ID" --step 3 --status failed \
--description "tried X, didn't work" --metrics '{"val_bpb": 2.5}'
# Step 4 branches from step 2, not the failed step 3
weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 \
--status completed --description "tried Y instead" --metrics '{"val_bpb": 2.1}'This produces a tree in the dashboard:
[Baseline (step 0)]
+-- Step 1 (kept)
+-- Step 2 (kept)
+-- Step 3 (failed)
+-- Step 4 (kept, branched from 2)Python SDK
For scripts with a Python loop, use the SDK directly instead of shell commands:
from weco.observe import WecoObserver
obs = WecoObserver()
run = obs.create_run(
name="sweep v3",
source_code={"train.py": open("train.py").read()},
primary_metric="val_bpb",
maximize=False,
)
for i, result in enumerate(experiments):
run.log_step(
step=i,
status="completed" if result.kept else "failed",
description=result.description,
metrics={"val_bpb": result.val_bpb, "memory_gb": result.memory_gb},
code={"train.py": open("train.py").read()},
)Run lifecycle
External runs are managed by the dashboard, not the CLI:
- Runs stay active as long as steps are being logged
- The dashboard shows "Latest step X ago" for running external runs
- Mark as Complete from the dashboard actions menu to close a run
- Logging a new step to a closed run silently reopens it
- Runs with no activity for 24 hours are automatically archived
Design notes
- Step 0 is the baseline - created automatically by
initwith your source code. Log step 0 after your baseline run to attach metrics. - Idempotent - re-posting the same step number updates the existing step instead of creating a duplicate.
- Multi-metric - the
--metricsJSON can contain any number of metrics. The primary metric (specified ininit) is used for the chart and "best metric" calculations. - Code diffs - pass
--sourceon each step to see code diffs between experiments in the dashboard.