Agentic Scaffold

This example shows how Weco can optimize an AI workflow that extracts tabular data from chart images using a Vision Language Model (VLM).

You can follow along here or directly checkout the files from here.

Prerequisites

If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip:

pip install weco

You'll also need:

Python 3.9+
uv installed (see https://docs.astral.sh/uv/)
An OpenAI API key in your environment:

export OPENAI_API_KEY=your_key_here

Prepare the Data

The example uses a subset of line charts from the ChartQA dataset. First, prepare the data:

cd examples/extract-line-plot
uv run --with huggingface_hub python prepare_data.py

This script:

Downloads the ChartQA dataset snapshot
Produces a 100-sample subset of line charts in subset_line_100/ with:
- index.csv: mapping of example IDs to image and ground truth table paths
- images/: chart images (PNG/JPEG)
- tables/: ground truth CSV tables

Specify the file to be optimized

Point Weco to optimize.py, which contains the baseline VLM function that Weco will optimize. This file includes:

VLMExtractor.image_to_csv(): Main function that takes an image path and returns CSV text
build_prompt(): Prompt template that instructs the VLM to extract data
clean_to_csv(): Post-processing function to clean the output

def build_prompt() -> str:
    return (
        "You are a precise data extraction model. Given a chart image, extract the underlying data table.\n"
        "Return ONLY the CSV text with a header row and no markdown code fences.\n"
        "Rules:\n"
        "- The first column must be the x-axis values with its exact axis label as the header.\n"
        "- Include one column per data series using the legend labels as headers.\n"
        "- Preserve the original order of x-axis ticks as they appear.\n"
        "- Use plain CSV (comma-separated), no explanations, no extra text.\n"
    )
 
class VLMExtractor:
    def image_to_csv(self, image_path: Path) -> str:
        prompt = build_prompt()
        image_uri = image_to_data_uri(image_path)
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": image_uri}}
                    ],
                }
            ],
        )
        text = response.choices[0].message.content or ""
        return clean_to_csv(text)

Weco will edit this file during optimization, focusing on improving the prompt and extraction logic.

Create the Evaluation Script

The eval.py script evaluates the VLM extraction performance:

Loads the prepared dataset from subset_line_100/
Calls VLMExtractor.image_to_csv() for each chart image
Compares predicted CSV tables to ground truth using a similarity metric
Writes predictions to predictions/ directory
Prints progress and a final accuracy: line that Weco reads

The evaluation metric combines:

Header match (20% weight): Exact match of column headers
Content similarity (80% weight): Jaccard-based similarity of data rows using SMAPE (Symmetric Mean Absolute Percentage Error) for numeric values

Key configuration options:

--max-samples: Number of samples to evaluate (default: 100)
--num-workers: Parallel workers for concurrent VLM calls (default: 4)
--visualize-dir: Optional directory to save comparison plots

Run a baseline evaluation:

uv run --with openai python eval.py --max-samples 10 --num-workers 4

This writes predicted CSVs to predictions/ and prints a final line like accuracy: 0.32.

Run Weco

Now run Weco to iteratively improve the extraction function:

weco run --source optimize.py \
     --eval-command 'uv run --with openai python eval.py --max-samples 100 --num-workers 50' \
     --metric accuracy \
     --goal maximize \
     --steps 20 \
     --model gpt-5

Arguments:

--source optimize.py: File that Weco will edit to improve results
--eval-command '…': Command Weco executes to measure the metric
--metric accuracy: Weco parses accuracy: <value> from eval.py output
--goal maximize: Higher accuracy is better
--steps 20: Number of optimization iterations
--model gpt-5: Model used by Weco to propose edits

During each evaluation round, you will see log lines similar to:

[setup] evaluating 100 samples using gpt-4o-mini …
[progress] 5/100 done, avg score: 0.3120, elapsed 12.3s
[progress] 10/100 done, avg score: 0.3280, elapsed 24.1s
...
accuracy: 0.3420

Weco then mutates the prompt and extraction logic in optimize.py, tries again, and gradually pushes the accuracy higher.

How it works

The evaluation script loads images and ground truth tables from the prepared dataset
It sends VLM calls in parallel via ThreadPoolExecutor, hiding network latency
Every 5 completed items, the script logs progress with current average score and elapsed time
The final line accuracy: value is parsed by Weco for guidance
The metric includes a cost cap: if average cost per query exceeds $0.02, accuracy is set to 0.0

Tips

Adjust --num-workers to balance throughput and rate limits (50 workers works well for larger datasets)
You can tweak baseline behavior in optimize.py (prompt, temperature, model) - Weco will explore modifications automatically
Use --visualize-dir to generate comparison plots showing ground truth vs predictions
For faster iteration during development, reduce --max-samples to 10-20 samples

What's Next?

Different optimization types: Try Model Development for ML workflows or GPU optimization with CUDA and Triton
Better evaluation scripts: Learn Writing Good Evaluation Scripts
All command options: Check the CLI Reference
More examples: Browse all Examples

Agentic Scaffold

On this page