Agentic Scaffold
Optimize a VLM function that extracts tabular data from chart images
This example shows how Weco can optimize an AI workflow that extracts tabular data from chart images using a Vision Language Model (VLM).
You can follow along here or directly checkout the files from here.
Prerequisites
If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip:
You'll also need:
- Python 3.9+
uvinstalled (see https://docs.astral.sh/uv/)- An OpenAI API key in your environment:
Prepare the Data
The example uses a subset of line charts from the ChartQA dataset. First, prepare the data:
This script:
- Downloads the ChartQA dataset snapshot
- Produces a 100-sample subset of line charts in
subset_line_100/with:index.csv: mapping of example IDs to image and ground truth table pathsimages/: chart images (PNG/JPEG)tables/: ground truth CSV tables
Specify the file to be optimized
Point Weco to optimize.py, which contains the baseline VLM function that Weco will optimize. This file includes:
VLMExtractor.image_to_csv(): Main function that takes an image path and returns CSV textbuild_prompt(): Prompt template that instructs the VLM to extract dataclean_to_csv(): Post-processing function to clean the output
Weco will edit this file during optimization, focusing on improving the prompt and extraction logic.
Create the Evaluation Script
The eval.py script evaluates the VLM extraction performance:
- Loads the prepared dataset from
subset_line_100/ - Calls
VLMExtractor.image_to_csv()for each chart image - Compares predicted CSV tables to ground truth using a similarity metric
- Writes predictions to
predictions/directory - Prints progress and a final
accuracy:line that Weco reads
The evaluation metric combines:
- Header match (20% weight): Exact match of column headers
- Content similarity (80% weight): Jaccard-based similarity of data rows using SMAPE (Symmetric Mean Absolute Percentage Error) for numeric values
Key configuration options:
--max-samples: Number of samples to evaluate (default: 100)--num-workers: Parallel workers for concurrent VLM calls (default: 4)--visualize-dir: Optional directory to save comparison plots
Run a baseline evaluation:
This writes predicted CSVs to predictions/ and prints a final line like accuracy: 0.32.
Run Weco
Now run Weco to iteratively improve the extraction function:
Arguments:
--source optimize.py: File that Weco will edit to improve results--eval-command '…': Command Weco executes to measure the metric--metric accuracy: Weco parsesaccuracy: <value>fromeval.pyoutput--goal maximize: Higher accuracy is better--steps 20: Number of optimization iterations--model gpt-5: Model used by Weco to propose edits
During each evaluation round, you will see log lines similar to:
Weco then mutates the prompt and extraction logic in optimize.py, tries again, and gradually pushes the accuracy higher.
How it works
- The evaluation script loads images and ground truth tables from the prepared dataset
- It sends VLM calls in parallel via
ThreadPoolExecutor, hiding network latency - Every 5 completed items, the script logs progress with current average score and elapsed time
- The final line
accuracy: valueis parsed by Weco for guidance - The metric includes a cost cap: if average cost per query exceeds $0.02, accuracy is set to 0.0
Tips
- Adjust
--num-workersto balance throughput and rate limits (50 workers works well for larger datasets) - You can tweak baseline behavior in
optimize.py(prompt, temperature, model) - Weco will explore modifications automatically - Use
--visualize-dirto generate comparison plots showing ground truth vs predictions - For faster iteration during development, reduce
--max-samplesto 10-20 samples
What's Next?
- Different optimization types: Try Model Development for ML workflows or GPU optimization with CUDA and Triton
- Better evaluation scripts: Learn Writing Good Evaluation Scripts
- All command options: Check the CLI Reference
- More examples: Browse all Examples