Fraud Detection (IEEE-CIS)
Optimize a tabular fraud-detection pipeline on real Vesta payment transactions, with a leakage-safe fit/transform API
This example uses Weco to optimize a tabular fraud-detection pipeline on the IEEE-CIS Fraud Detection Kaggle dataset (real Vesta payment transactions). Weco rewrites two files — features.py (feature engineering) and model.py (the model) — to maximize AUC-ROC on a held-out, time-based validation split.
The interface is designed so that train/val leakage is impossible by construction: the validation split is never visible to feature-fitting code, and the target column is stripped before any feature sees it. See Why the fit/transform API below.
You can follow along here or check out the complete files here.
Baseline AUC 0.9091 (deterministic — python evaluate.py after python prepare_data.py). With the bundled instructions.md and 200 steps of gemini-3.1-pro-preview, expect AUC in the 0.928–0.933 range. Numbers reproduce Weco's published fraud-detection case study.
Setup
If you haven't already, follow the Installation guide. Otherwise, install the CLI:
pipx install wecoOther installation methods
curl -fsSL https://weco.ai/install.sh | shirm https://weco.ai/install.ps1 | iexpip install wecogit clone https://github.com/wecoai/weco-cli.gitcd weco-clipip install -e .Clone the repo and install the example's dependencies:
git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/fraud-detection
pip install --upgrade -r requirements.txtKaggle prerequisites. prepare_data.py downloads the competition data, which requires:
- A Kaggle API token at
~/.kaggle/kaggle.json(chmod 600 ~/.kaggle/kaggle.json). See the Kaggle API docs. - Joining the competition at kaggle.com/c/ieee-fraud-detection (click Late Submission / Join Competition to accept the rules). Without this, the download returns a
403.
Prepare the data
This one-off script downloads the IEEE-CIS data and builds a fixed, leakage-safe train/val split (100K train / 25K validation, split by time so the validation period is strictly later than training):
python prepare_data.pyIt writes data/base_train_small.parquet and data/base_val_small.parquet (SHA-256 identical to the published case study). A quick sanity check should print the deterministic baseline:
python evaluate.py
# auc_roc: 0.909132The optimization target
Weco edits two files. Each owns a separate scope, and the interface keeps them honest.
features.py — a FeatureBuilder with fit(X_train, y_train) and transform(X). fit() sees only the training split; transform() has no y argument, so it can never branch on labels:
# features.py — Weco optimizes this for the Features scope
class FeatureBuilder:
def fit(self, X_train: pd.DataFrame, y_train: pd.Series) -> "FeatureBuilder":
# Fit frequency / group / target encoders on (X_train, y_train) only.
# Stash state in self.* so transform() can apply it deterministically.
...
return self
def transform(self, X: pd.DataFrame) -> np.ndarray:
# Apply self.* state to X. Called once each on X_train and X_val.
# No `y` here — val labels can't leak into val features.
...model.py — a single train_and_evaluate function that receives pre-built feature arrays and returns validation AUC:
# model.py — Weco optimizes this for the Model scope
def train_and_evaluate(X_train, y_train, X_val, y_val) -> float:
"""Train on (X_train, y_train); return AUC-ROC on (X_val, y_val).
Tune LightGBM, switch model class, build ensembles — features arrive
pre-built as ndarrays, so this scope can't re-engineer features."""
...
return float(roc_auc_score(y_val, y_pred))instructions.md (passed via --additional-instructions) carries the EDA + techniques domain knowledge from the case study, plus a guardrail against silent target leakage.
The evaluator (frozen)
evaluate.py is the API enforcement boundary — Weco never edits it. It loads the data, strips isFraud and TransactionID before any feature code runs, calls fit/transform/train_and_evaluate, and prints the metric line Weco parses:
# evaluate.py (frozen)
y_train = train_df["isFraud"].astype("int32")
X_train = train_df.drop(columns=["isFraud", "TransactionID"]) # target stripped
X_val = val_df.drop(columns=["isFraud", "TransactionID"])
fb = FeatureBuilder().fit(X_train, y_train) # val never seen by fit()
X_train_t, X_val_t = fb.transform(X_train), fb.transform(X_val)
auc = train_and_evaluate(X_train_t, y_train, X_val_t, y_val)
print(f"auc_roc: {auc:.6f}")Run Weco
Pick a scope. The full pipeline (recommended) lets Weco edit both files:
weco run --sources features.py model.py \
--eval-command "python evaluate.py" \
--metric auc_roc \
--goal maximize \
--steps 200 \
--model gemini-3.1-pro-preview \
--additional-instructions instructions.md \
--eval-timeout 900 \
--log-dir .runs/fullweco run --sources features.py model.py ^
--eval-command "python evaluate.py" ^
--metric auc_roc ^
--goal maximize ^
--steps 200 ^
--model gemini-3.1-pro-preview ^
--additional-instructions instructions.md ^
--eval-timeout 900 ^
--log-dir .runs/fullOr in PowerShell:
weco run --sources features.py model.py `
--eval-command "python evaluate.py" `
--metric auc_roc `
--goal maximize `
--steps 200 `
--model gemini-3.1-pro-preview `
--additional-instructions instructions.md `
--eval-timeout 900 `
--log-dir .runs/fullTo isolate where the gains come from, run a single scope instead:
- Features only —
--sources features.py(model stays at the baseline LightGBM). - Model only —
--sources model.py(features frozen at the baselineFeatureBuilder). Headroom is small (~+0.008 AUC) — model tuning isn't where the wins live for tabular fraud.
Explanation
--sources features.py model.py: the two files Weco may edit (independently, separately, or together).--eval-command "python evaluate.py": the frozen evaluator; Weco parsesauc_roc: 0.xxxxxxfrom its output.--metric auc_roc --goal maximize: maximize validation AUC-ROC.--steps 200: optimization iterations.--model gemini-3.1-pro-preview: the LLM driving the optimization.--additional-instructions instructions.md: the EDA + techniques domain prompt.--eval-timeout 900: per-evaluation timeout (seconds); a full train+score pass on 100K rows can take a few minutes.
Why the fit/transform API
The original case study used a single build_features(train_df, val_df) function — and the agent could pd.concat([train, val]) and silently introduce time-leakage (we measured 0.001–0.005 AUC of inflation, and even explicit "fit on train only" warnings didn't reliably stop it). This interface kills both leakage flavors at the boundary:
| Leakage path | Killed by |
|---|---|
isFraud in cross-column aggregations | evaluate.py strips isFraud before X reaches FeatureBuilder |
pd.concat([train_df, val_df]) for groupby / frequency | val_df is never passed to fit() |
| Val labels at predict time | transform(X) has no y argument |
Weco can't write the leaky pattern because the leaky symbols literally aren't in scope.
An earlier single-file variant (build_features(train_df, val_df) in one train.py) is kept for comparison at examples/fraud-detection-loose. It's not recommended for new work because it admits the time-leakage this strict API prevents.
What's Next?
- Another tabular task: try Model Development on Kaggle's Spaceship Titanic.
- Better evaluation scripts: learn Writing Good Evaluation Scripts.
- All command options: check the CLI Reference.
- More examples: browse all Examples.