Optimize a tabular fraud-detection pipeline on real Vesta payment transactions, with a leakage-safe fit/transform API

This example uses Weco to optimize a tabular fraud-detection pipeline on the IEEE-CIS Fraud Detection Kaggle dataset (real Vesta payment transactions). Weco rewrites two files — features.py (feature engineering) and model.py (the model) — to maximize AUC-ROC on a held-out, time-based validation split.

The interface is designed so that train/val leakage is impossible by construction: the validation split is never visible to feature-fitting code, and the target column is stripped before any feature sees it. See Why the fit/transform API below.

You can follow along here or check out the complete files here.

Baseline AUC 0.9091 (deterministic — python evaluate.py after python prepare_data.py). With the bundled instructions.md and 200 steps of gemini-3.1-pro-preview, expect AUC in the 0.928–0.933 range. Numbers reproduce Weco's published fraud-detection case study.

Setup

If you haven't already, follow the Installation guide. Otherwise, install the CLI:

pipx install weco

pipx install weco

Other installation methods

curl -fsSL https://weco.ai/install.sh | sh

curl -fsSL https://weco.ai/install.sh | sh

irm https://weco.ai/install.ps1 | iex

irm https://weco.ai/install.ps1 | iex

pip install weco

pip install weco

We recommend using a virtual environment when installing with pip.

git clone https://github.com/wecoai/weco-cli.git
cd weco-cli
pip install -e .

git clone https://github.com/wecoai/weco-cli.gitcd weco-clipip install -e .

Use this if you want to contribute or modify Weco.

Clone the repo and install the example's dependencies:

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/fraud-detection
pip install --upgrade -r requirements.txt

Kaggle prerequisites. prepare_data.py downloads the competition data, which requires:

A Kaggle API token at ~/.kaggle/kaggle.json (chmod 600 ~/.kaggle/kaggle.json). See the Kaggle API docs.
Joining the competition at kaggle.com/c/ieee-fraud-detection (click Late Submission / Join Competition to accept the rules). Without this, the download returns a 403.

Prepare the data

This one-off script downloads the IEEE-CIS data and builds a fixed, leakage-safe train/val split (100K train / 25K validation, split by time so the validation period is strictly later than training):

python prepare_data.py

It writes data/base_train_small.parquet and data/base_val_small.parquet (SHA-256 identical to the published case study). A quick sanity check should print the deterministic baseline:

python evaluate.py
# auc_roc: 0.909132

The optimization target

Weco edits two files. Each owns a separate scope, and the interface keeps them honest.

features.py — a FeatureBuilder with fit(X_train, y_train) and transform(X). fit() sees only the training split; transform() has no y argument, so it can never branch on labels:

# features.py — Weco optimizes this for the Features scope
class FeatureBuilder:
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series) -> "FeatureBuilder":
        # Fit frequency / group / target encoders on (X_train, y_train) only.
        # Stash state in self.* so transform() can apply it deterministically.
        ...
        return self

    def transform(self, X: pd.DataFrame) -> np.ndarray:
        # Apply self.* state to X. Called once each on X_train and X_val.
        # No `y` here — val labels can't leak into val features.
        ...

model.py — a single train_and_evaluate function that receives pre-built feature arrays and returns validation AUC:

# model.py — Weco optimizes this for the Model scope
def train_and_evaluate(X_train, y_train, X_val, y_val) -> float:
    """Train on (X_train, y_train); return AUC-ROC on (X_val, y_val).
    Tune LightGBM, switch model class, build ensembles — features arrive
    pre-built as ndarrays, so this scope can't re-engineer features."""
    ...
    return float(roc_auc_score(y_val, y_pred))

instructions.md (passed via --additional-instructions) carries the EDA + techniques domain knowledge from the case study, plus a guardrail against silent target leakage.

The evaluator (frozen)

evaluate.py is the API enforcement boundary — Weco never edits it. It loads the data, strips isFraud and TransactionID before any feature code runs, calls fit/transform/train_and_evaluate, and prints the metric line Weco parses:

# evaluate.py (frozen)
y_train = train_df["isFraud"].astype("int32")
X_train = train_df.drop(columns=["isFraud", "TransactionID"])   # target stripped
X_val   = val_df.drop(columns=["isFraud", "TransactionID"])

fb = FeatureBuilder().fit(X_train, y_train)                     # val never seen by fit()
X_train_t, X_val_t = fb.transform(X_train), fb.transform(X_val)

auc = train_and_evaluate(X_train_t, y_train, X_val_t, y_val)
print(f"auc_roc: {auc:.6f}")

Run Weco

Pick a scope. The full pipeline (recommended) lets Weco edit both files:

weco run --sources features.py model.py \
     --eval-command "python evaluate.py" \
     --metric auc_roc \
     --goal maximize \
     --steps 200 \
     --model gemini-3.1-pro-preview \
     --additional-instructions instructions.md \
     --eval-timeout 900 \
     --log-dir .runs/full

weco run --sources features.py model.py ^
     --eval-command "python evaluate.py" ^
     --metric auc_roc ^
     --goal maximize ^
     --steps 200 ^
     --model gemini-3.1-pro-preview ^
     --additional-instructions instructions.md ^
     --eval-timeout 900 ^
     --log-dir .runs/full

Or in PowerShell:

weco run --sources features.py model.py `
     --eval-command "python evaluate.py" `
     --metric auc_roc `
     --goal maximize `
     --steps 200 `
     --model gemini-3.1-pro-preview `
     --additional-instructions instructions.md `
     --eval-timeout 900 `
     --log-dir .runs/full

To isolate where the gains come from, run a single scope instead:

Features only — --sources features.py (model stays at the baseline LightGBM).
Model only — --sources model.py (features frozen at the baseline FeatureBuilder). Headroom is small (~+0.008 AUC) — model tuning isn't where the wins live for tabular fraud.

Explanation

--sources features.py model.py: the two files Weco may edit (independently, separately, or together).
--eval-command "python evaluate.py": the frozen evaluator; Weco parses auc_roc: 0.xxxxxx from its output.
--metric auc_roc --goal maximize: maximize validation AUC-ROC.
--steps 200: optimization iterations.
--model gemini-3.1-pro-preview: the LLM driving the optimization.
--additional-instructions instructions.md: the EDA + techniques domain prompt.
--eval-timeout 900: per-evaluation timeout (seconds); a full train+score pass on 100K rows can take a few minutes.

Why the fit/transform API

The original case study used a single build_features(train_df, val_df) function — and the agent could pd.concat([train, val]) and silently introduce time-leakage (we measured 0.001–0.005 AUC of inflation, and even explicit "fit on train only" warnings didn't reliably stop it). This interface kills both leakage flavors at the boundary:

Leakage path	Killed by
`isFraud` in cross-column aggregations	`evaluate.py` strips `isFraud` before `X` reaches `FeatureBuilder`
`pd.concat([train_df, val_df])` for groupby / frequency	`val_df` is never passed to `fit()`
Val labels at predict time	`transform(X)` has no `y` argument

Weco can't write the leaky pattern because the leaky symbols literally aren't in scope.

An earlier single-file variant (build_features(train_df, val_df) in one train.py) is kept for comparison at examples/fraud-detection-loose. It's not recommended for new work because it admits the time-leakage this strict API prevents.

What's Next?

Another tabular task: try Model Development on Kaggle's Spaceship Titanic.
Better evaluation scripts: learn Writing Good Evaluation Scripts.
All command options: check the CLI Reference.
More examples: browse all Examples.

Fraud Detection (IEEE-CIS)