Prompt Engineering

This example shows how Weco can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files and a prompt guide, and aims to improve the accuracy metric.

You can follow along here or directly checkout the files from here.

Setup

If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip:

pip install weco

This example uses gpt-4.1-mini via the OpenAI API by default, but you can use other providers as well. Set up your API key:

Create your OpenAI API key here.

export OPENAI_API_KEY="your_key_here"

Install the dependencies of the scripts shown in subsequent sections.

pip install openai datasets

Create the Optimization Target

Create a file named optimize.py. This file holds the prompt template (instructing the LLM to reason step-by-step and use \\boxed{} for the final answer) and the mutable EXTRA_INSTRUCTIONS string. Weco edits only this file during the search.

from openai import OpenAI
 
client = OpenAI()  # API key must be in OPENAI_API_KEY
 
PROMPT_TEMPLATE = """You are an expert competition mathematician tasked with solving an AIME problem.
The final answer must be a three-digit integer between 000 and 999, inclusive.
Please reason step-by-step towards the solution. Keep your reasoning concise.
Conclude your response with the final answer enclosed in \\boxed{{}}. For example: The final answer is \\boxed{{042}}.
 
Problem:
{problem}
 
Solution:
"""
 
 
def solve(problem: str, model_name: str) -> str:
    """Return the model's raw text answer for one problem using the specified model."""
    prompt = PROMPT_TEMPLATE.format(problem=problem)
 
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content.strip()

Create the Evaluation Script

Create a file named eval.py. This script downloads a small slice of the 2024 AIME dataset, calls optimize.solve in parallel, parses the LLM output (looking for \\boxed{}), compares it to the ground truth, prints progress logs, and finally prints an accuracy: line that Weco reads. It also defines the LLM model to use (MODEL_TO_USE).

import re
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import sys
import concurrent.futures
 
from datasets import load_dataset
import optimize  # the file Weco mutates
 
# ---------------------------------------------------------------------
# Configuration
TOTAL_SAMPLES = 30  # how many problems to load
NUM_WORKERS = 30  # concurrent LLM calls
LOG_EVERY = 5  # print progress after this many
MODEL_TO_USE = "gpt-4.1-mini"  # Define the model to use HERE
TASK_TIMEOUT = 300  # seconds per LLM call
# ---------------------------------------------------------------------
 
print(f"[setup] loading {TOTAL_SAMPLES} problems from AIME 2024 …", flush=True)
DATA = load_dataset("Maxwell-Jia/AIME_2024", split=f"train[:{TOTAL_SAMPLES}]", cache_dir=".cache")
 
 
def extract_final_answer(text: str) -> str:
    """
    Extracts the final AIME answer (000-999) from the LLM response.
    Prioritizes answers within \boxed{}, then looks for patterns,
    and falls back to finding the last 3-digit number.
    """
    # 1. Check for \boxed{...}
    boxed_match = re.search(r"\\boxed\{(\d{1,3})\}", text)
    if boxed_match:
        return boxed_match.group(1).zfill(3)  # Pad with leading zeros if needed
 
    # 2. Check for "final answer is ..." patterns (case-insensitive)
    # Make sure pattern captures potential variations like "is: 123", "is 123."
    answer_pattern = r"(?:final|answer is|result is)[:\s]*(\d{1,3})\b"
    answer_match = re.search(answer_pattern, text, re.IGNORECASE)
    if answer_match:
        return answer_match.group(1).zfill(3)
 
    # 3. Fallback: Find the last occurrence of a 1-3 digit number in the text
    #    This is less reliable but can be a fallback.
    #    Let's refine the fallback regex to be slightly more specific
    #    Look for isolated 1-3 digit numbers, possibly at the end or after keywords.
    fallback_matches = re.findall(r"\b(\d{1,3})\b", text)
    if fallback_matches:
        # Return the last found number, assuming it's the most likely answer candidate
        return fallback_matches[-1].zfill(3)
 
    return ""  # Return empty if no answer found
 
 
def grade_answer(llm_output: str, ground_truth_answer: str) -> bool:
    """Compares the extracted LLM answer to the ground truth."""
    extracted_guess = extract_final_answer(llm_output)
    # Ground truth answers in AIME are typically strings "000" to "999"
    # Ensure comparison is consistent (e.g., both as strings, potentially padded)
    # The ground truth from the dataset seems to be string integers already.
    # Let's ensure the extracted guess is also treated as a simple integer string for comparison.
    # The ground truth might not be zero-padded in the dataset, so compare integers.
    try:
        # Check if both can be converted to integers for comparison
        return int(extracted_guess) == int(ground_truth_answer)
    except ValueError:
        # If conversion fails (e.g., empty string), they don't match
        return False
 
 
def run_evaluation() -> float:
    """Runs the evaluation on the dataset and returns the accuracy."""
    correct = 0
    start = time.time()
    results = []  # Store results for potential later analysis if needed
 
    with ThreadPoolExecutor(max_workers=NUM_WORKERS) as pool:
        # Submit all tasks, passing the MODEL_TO_USE
        futures = {
            pool.submit(optimize.solve, row["Problem"], MODEL_TO_USE): row["Answer"] for row in DATA
        }  # Pass MODEL_TO_USE here
 
        try:
            # Process completed tasks
            for idx, future in enumerate(as_completed(futures), 1):
                problem_answer = futures[future]  # Get the corresponding ground truth answer
                try:
                    # Wait up to TASK_TIMEOUT seconds for each LLM call
                    llm_raw_output = future.result(timeout=TASK_TIMEOUT)
                    is_correct = grade_answer(llm_raw_output, str(problem_answer))
                    if is_correct:
                        correct += 1
                    results.append({"raw_output": llm_raw_output, "correct_answer": problem_answer, "is_correct": is_correct})
 
                except Exception as exc:
                    print(f"[error] Generated an exception: {exc}")
                    results.append({"raw_output": f"Error: {exc}", "correct_answer": problem_answer, "is_correct": False})
 
                if idx % LOG_EVERY == 0 or idx == TOTAL_SAMPLES:
                    elapsed = time.time() - start
                    current_accuracy = correct / idx if idx > 0 else 0
                    print(
                        f"[progress] {idx}/{TOTAL_SAMPLES} completed, accuracy: {current_accuracy:.4f}, elapsed {elapsed:.1f} s",
                        flush=True,
                    )
        except concurrent.futures.TimeoutError:
            # Abort any stuck LLM calls
            print(f"[error] LLM call timed out after {TASK_TIMEOUT}s", flush=True)
            # Cancel all pending futures and exit
            for f in futures:
                f.cancel()
            print("Exiting due to timeout", file=sys.stderr)
            sys.exit(1)
        except KeyboardInterrupt:
            print("\nEvaluation interrupted by user", file=sys.stderr)
            sys.exit(1)
 
    # Final accuracy calculation
    total_evaluated = len(results)
    final_accuracy = correct / total_evaluated if total_evaluated > 0 else 0
    return final_accuracy
 
 
if __name__ == "__main__":
    acc = run_evaluation()
    # Weco parses this exact line format
    print(f"accuracy: {acc:.4f}")

(Optional) Create Additional Instructions Guide

You can optionally create a prompt_guide.md file to provide more context or specific instructions to Weco during the optimization process.

# Weco Prompt Optimization Guidelines for AIME (Targeting GPT-4.1)
 
## 1. Goal
 
Your objective is to modify the the `optimize.py` file to improve the `accuracy` metric when solving AIME math problems. The modifications should leverage the capabilities of the target model, **GPT-4.1**.
 
## 2. Files and Workflow
 
*   **Target File for Modification:** `optimize.py`. *   **Evaluation Script:** `eval.py`. This script:
    *   Defines the actual LLM used for solving (`MODEL_TO_USE`, which is set to `gpt-4.1-mini` in this context).
    *   Calls `optimize.solve(problem, model_name="gpt-4.1-mini")`.
    *   Parses the output from `optimize.solve`. **Crucially, it expects the final 3-digit answer (000-999) to be enclosed in `\boxed{XXX}`.** For example: `\boxed{042}`. Your prompt modifications *must* ensure the model consistently produces this format for the final answer.
    *   Compares the extracted answer to the ground truth and prints the `accuracy:` metric, which Weco uses for guidance.
 
## 3. Target Model: GPT-4.1-mini
 
You are optimizing the prompt for `gpt-4.1-mini`. Based on its characteristics, consider the following:
 
*   **Strengths:**
    *   **Significantly Improved Instruction Following:** GPT-4.1-mini is better at adhering to complex instructions, formats, and constraints compared to previous models. This is key for AIME where precision is vital. It excels on hard instruction-following tasks.
    *   **Stronger Coding & Reasoning:** Its improved coding performance (e.g., SWE-bench) suggests enhanced logical reasoning capabilities applicable to mathematical problem-solving.
    *   **Refreshed Knowledge:** Knowledge cutoff is June 2024.
*   **Considerations:**
    *   **Literal Interpretation:** GPT-4.1-mini can be more literal. Prompts should be explicit and specific about the desired reasoning process and output format. Avoid ambiguity.
 
## 4. Optimization Strategies (Focus on `PROMPT_TEMPLATE` in `optimize.py`)
 
The primary goal is to enhance the model's reasoning process for these challenging math problems. Focus on Chain-of-Thought (CoT) designs within the `PROMPT_TEMPLATE`.
 
**Ideas to Explore:**
You don't have to implement all of them, but the following ideas might be helpful:
*   **Workflow Patterns** try to use some of the following patterns:
    *  **Linear**: Linear workflow, standarded CoT E.g. considering the following thinking steps (you don't have to include all of them), "1. Understand the problem constraints. 2. Identify relevant theorems/formulas. 3. Formulate a plan. 4. Execute calculations step-by-step. 5. Verify intermediate results. 6. State the final answer in the required format."
    *  **List Candidates**: You can ask the model to propose a few solutions in a particular step and pick the best solution. You can potentially also set the criterias in the prompt.
    *  **Code** Use pesudo code to define even more complex workflows with loops, conditional statement, or go to statement.
*   **Other CoT Techniques:**
    *   Self-Correction/Reflection
    *   Plan Generation
    *   Debate, simulating multiple characters
    *   Tree of thought
*   **Few-Shot Examples:** You *could* experiment with adding 1-2 high-quality AIME problem/solution examples directly into the `PROMPT_TEMPLATE` string (similar to how Weco attempted in one of the runs). Ensure the examples clearly show the desired reasoning style and the final `\boxed{XXX}` format.
*   **Play with format:** The way you format the prompt. Markdown, xml, json, code or natural language. Similarly for the thinking tokens themselves you can also try out different formats.
 
## 5. Constraints
*   **Ensure the final output reliably contains `\boxed{XXX}` as the evaluation script depends on it.**

Run Weco

Now run Weco to optimize your prompt:

weco run --source optimize.py \
     --eval-command "python eval.py" \
     --metric score \
     --goal maximize \
     --steps 15 \
     --model o4-mini \
     --additional-instructions "Improve the prompt to get better scores. Focus on clarity, specificity, and effective prompt engineering techniques."

Note: You can replace --model o4-mini with another powerful model like o3 or others, provided you have the respective API keys set.

During each evaluation round, you will see log lines similar to the following:

[setup] loading 20 problems from AIME 2024 …
[progress] 5/20 completed, accuracy: 0.0000, elapsed 7.3 s
[progress] 10/20 completed, accuracy: 0.1000, elapsed 14.6 s
[progress] 15/20 completed, accuracy: 0.0667, elapsed 21.8 s
[progress] 20/20 completed, accuracy: 0.0500, elapsed 28.9 s
accuracy: 0.0500

Weco then mutates the prompt instructions in optimize.py, tries again, and gradually pushes the accuracy higher.

How it works

eval.py slices the Maxwell-Jia/AIME_2024 dataset to twenty problems for fast feedback. You can change the slice in one line within the script.
The script sends model calls in parallel via ThreadPoolExecutor, so network latency is hidden.
Every five completed items, the script logs progress and elapsed time.
The final line accuracy: value is the only part Weco needs for guidance.

For more examples, visit the Examples section.