Prompt Engineering
Iteratively improve a prompt for solving AIME problems
This example shows how Weco can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files and a prompt guide, and aims to improve the accuracy metric.
You can follow along here or directly checkout the files from here.
Setup
If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip
:
This example uses gpt-4.1-mini
via the OpenAI API by default. Ensure your OPENAI_API_KEY
environment variable is set. You can create a key here.
Install the dependencies of the scripts shown in subsequent sections.
Create the Optimization Target
Create a file named optimize.py
.
This file holds the prompt template (instructing the LLM to reason step-by-step and use \\boxed{}
for the final answer) and the mutable EXTRA_INSTRUCTIONS
string. Weco edits only this file during the search.
Create the Evaluation Script
Create a file named eval.py
.
This script downloads a small slice of the 2024 AIME dataset, calls optimize.solve
in parallel,
parses the LLM output (looking for \\boxed{}
), compares it to the ground truth, prints progress logs, and finally prints an accuracy:
line that Weco reads.
It also defines the LLM model to use (MODEL_TO_USE
).
(Optional) Create Additional Instructions Guide
You can optionally create a prompt_guide.md
file to provide more context or specific instructions to Weco during the optimization process.
Run Weco
Now run Weco to optimize your prompt:
Note: You can replace --model gemini-2.5-pro-exp-03-25
with another powerful model like o3
or others, provided you have the respective API keys set.
During each evaluation round, you will see log lines similar to the following:
Weco then mutates the prompt instructions in optimize.py
, tries again, and gradually pushes the accuracy higher.
How it works
eval.py
slices the Maxwell-Jia/AIME_2024 dataset to twenty problems for fast feedback. You can change the slice in one line within the script.- The script sends model calls in parallel via
ThreadPoolExecutor
, so network latency is hidden. - Every five completed items, the script logs progress and elapsed time.
- The final line
accuracy: value
is the only part Weco needs for guidance.
For more examples, visit the Examples section.