Writing Good Evaluation Scripts

Can you measure your task? If so, print it.

Examples: accuracy, speedup, latency-ms, tokens-per-second, F1-score.

If you've got metric in mind, all you have to do is print the metric name and the number:

print(f"{metric_name}: {metric_value}") # It doesn't have to be in this format, you can print it any way you like.

This is the same metric you pass to the --metric flag when you run the agent using weco run

If you don't have one in mind, this might help:

Is a metric all you need? Technically yes, but let's talk feedback.

Yes, all you have to do to get the weco agent working is to print the metric you are using to evaluate whatever code you want optimized to the terminal. However, you might want to provide more nuanced information or handle cases where a solution is clearly incorrect.

Putting it all together

Ensure the baseline code you want to optimize is somewhat modular so that the agent can modify it, but you can still import it to evaluate.
If you're measuring a metric that requires the baseline code to run, make sure to have that code available (maybe in the form of a copy) as the agent will aim to modify the baseline source code that its trying to optimize.

Some code to compute the metric you want to optimize for and print it out.

# evaluate.py
import sys
import traceback
from my_dataset import load_data
 
try:
    # Try to test the generated code
    from file_to_optimize import Model
except ImportError:
    # If it doesn't work, provide some helpful information to the agent
    print(f"Error: The generated code can not be evaluated. Here is the error message: {traceback.format_exc()}")
    sys.exit(1)
 
def evaluate():
    X_test, y_test = load_data()
    acc = (Model(X_test) == y_test).float().mean().item()
    print(f"accuracy: {acc:.6f}")
 
if __name__ == "__main__":
    evaluate()

Tell the agent how to run the evaluation script. You can do this by using the --eval-command flag when you run the agent using the weco run command.
weco run --source file_to_optimize.py --eval-command "python evaluate.py" --metric accuracy --goal maximize

From Good to Great: Supercharge Your Evaluation Script

Elevate a working eval script into a robust, efficient, and reliable driver to optimize your code through the weco agent with these tips:

Jump Right In

Now you're ready to start optimizing your code! Check out the CLI Reference for more information about the commands and options available, the FAQ for answers to common questions, and the Examples to see Weco in action.

Writing Good Evaluation Scripts

Can you measure your task? If so, print it.

Is there some metric that captures what an improvement looks like for this task?

Will that number change if the code improves?

Does an increase or decrease in the metric correspond to an improvement in the task?

Is a metric all you need? Technically yes, but let's talk feedback.

How can I provide more feedback to the agent?

What if a solution is measurable but clearly buggy?

What is reward hacking and how do I prevent it?

Putting it all together

From Good to Great: Supercharge Your Evaluation Script

⚡ Speed & Performance

🛡️ Reliability & Trust

🧹 Cleanup

Jump Right In

On this page