Weco Observe
Track external optimization experiments with Weco's observability platform. Get tree visualization, code diffs, and metrics for any LLM-driven optimization loop.
Weco Observe lets you track experiments from any optimization loop - whether it's an LLM agent running autoresearch, a custom training script, or a manual experimentation workflow. Your experiments appear in the same Weco dashboard as regular runs, with tree visualization, code diffs, and multi-metric tracking.
When to use Observe
Use weco observe when you're running your own optimization loop and want to track results in the Weco dashboard. This is different from weco run, which drives the optimization itself.
weco run | weco observe | |
|---|---|---|
| Who optimizes? | Weco | You (or your agent) |
| Who evaluates? | Weco (runs your eval command) | You (evaluation happens externally) |
| What Weco provides | End-to-end optimization | Dashboard, visualization, tracking |
| CLI interaction | Long-running process | Fire-and-forget commands |
Quick start
First, authenticate if you haven't already:
weco loginAdd the following block to your agent's instructions (e.g. program.md, system prompt, or task description). The agent handles everything from there:
## Weco Observe
Track experiments with Weco Observe. Authenticate using `weco login` if needed.
At setup, create a run (this captures your baseline code as step 0):
WECO_RUN_ID=$(weco observe init --name "<run-name>" --metric <metric> --goal min --source <file>)
After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:
weco observe log --run-id "$WECO_RUN_ID" --step 0 --status completed --description "baseline" --metrics '{"<metric>": <val>}' --source <file>
weco observe log --run-id "$WECO_RUN_ID" --step 1 --status completed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>
weco observe log --run-id "$WECO_RUN_ID" --step 2 --status failed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>
Use `--status failed` for discarded or crashed experiments. If you omit `--parent-step`, each step automatically chains to the last successful one. If you want to branch from a particular step pass the `--parent-step` explicitly to branch correctly:
# Step 3 failed and was reverted - step 4 branches from step 2 (not 3)
weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 --status completed --description "<what you tried>" --metrics '{"<metric>": <val>}' --source <file>All observe commands are fire-and-forget - they print warnings to stderr on failure but always exit 0, so they never crash your agent's loop.
1. Initialize a run
Create a run and capture the run ID. This also records your baseline code as step 0:
WECO_RUN_ID=$(weco observe init \
--name "my-experiment" \
--metric val_bpb \
--goal min \
--source train.py)2. Log experiments
After each experiment, log the result. Use step 0 for the baseline, then 1, 2, 3, ... for experiments:
# Log baseline result (step 0)
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 0 \
--status completed \
--description "baseline" \
--metrics '{"val_bpb": 2.366, "memory_gb": 0.0}' \
--source train.py
# Log an experiment
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 1 \
--status completed \
--description "increase batch size to 32K" \
--metrics '{"val_bpb": 2.261, "memory_gb": 0.0}' \
--source train.py
# Log a failed experiment
weco observe log \
--run-id "$WECO_RUN_ID" \
--step 2 \
--status failed \
--description "double model depth (OOM)" \
--metrics '{"val_bpb": 0.0, "memory_gb": 0.0}' \
--source train.py3. View in dashboard
Open the Weco dashboard to see your experiments with tree visualization, code diffs, and metrics.
CLI reference
weco observe init
Create a new external run for tracking.
| Argument | Description | Required |
|---|---|---|
-s, --source | Single source code file to track | Yes (or --sources) |
--sources | Multiple source code files to track | Yes (or --source) |
--metric | Primary metric name (e.g. val_bpb) | Yes |
-g, --goal | maximize/max or minimize/min | No (default: minimize) |
--name | Run name | No |
-i, --additional-instructions | Instructions for the run | No |
Prints the run ID to stdout so it can be captured with $().
weco observe log
Log a step (experiment) to an existing run.
| Argument | Description | Required |
|---|---|---|
--run-id | Run ID from weco observe init | Yes |
--step | Step number (0 = baseline, then 1, 2, 3, ...) | Yes |
--status | completed or failed | No (default: completed) |
--description | What was tried in this experiment | No |
--metrics | JSON object of metrics (e.g. '{"val_bpb": 1.03}') | No |
-s, --source | Source file to snapshot | No (or --sources) |
--sources | Multiple source files to snapshot | No (or --source) |
--parent-step | Parent step number for tree branching | No (auto-chains to last successful step) |
Branching and tree structure
By default, each step automatically chains to the last successful (non-failed) step, forming a linear sequence. When you discard a failed experiment and try a different approach, use --parent-step to branch correctly:
# Step 2 is the current best
# Step 3 failed - we revert to step 2 and try something different
weco observe log --run-id "$WECO_RUN_ID" --step 3 --status failed \
--description "tried X, didn't work" --metrics '{"val_bpb": 2.5}'
# Step 4 branches from step 2, not the failed step 3
weco observe log --run-id "$WECO_RUN_ID" --step 4 --parent-step 2 \
--status completed --description "tried Y instead" --metrics '{"val_bpb": 2.1}'This produces a tree in the dashboard:
[Baseline (step 0)]
+-- Step 1 (kept)
+-- Step 2 (kept)
+-- Step 3 (failed)
+-- Step 4 (kept, branched from 2)Python SDK
For scripts with a Python loop, use the SDK directly instead of shell commands:
from weco.observe import WecoObserver
obs = WecoObserver()
run = obs.create_run(
name="sweep v3",
source_code={"train.py": open("train.py").read()},
primary_metric="val_bpb",
maximize=False,
)
for i, result in enumerate(experiments):
run.log_step(
step=i,
status="completed" if result.kept else "failed",
description=result.description,
metrics={"val_bpb": result.val_bpb, "memory_gb": result.memory_gb},
code={"train.py": open("train.py").read()},
)Run lifecycle
External runs are managed by the dashboard, not the CLI:
- Runs stay active as long as steps are being logged
- The dashboard shows "Latest step X ago" for running external runs
- Mark as Complete from the dashboard actions menu to close a run
- Logging a new step to a closed run silently reopens it
- Runs with no activity for 24 hours are automatically archived
Design notes
- Step 0 is the baseline - created automatically by
initwith your source code. Log step 0 after your baseline run to attach metrics. - Idempotent - re-posting the same step number updates the existing step instead of creating a duplicate.
- Multi-metric - the
--metricsJSON can contain any number of metrics. The primary metric (specified ininit) is used for the chart and "best metric" calculations. - Code diffs - pass
--sourceon each step to see code diffs between experiments in the dashboard.