Weco Logo
Weco Docs

Writing Good Evaluation Scripts

Guide for writing good evaluation scripts for Weco

Can you measure your task? If so, print it.

Examples: accuracy, speedup, latency-ms, tokens-per-second, F1-score.

If you've got metric in mind, all you have to do is print the metric name and the number:

print(f"{metric_name}: {metric_value}") # It doesn't have to be in this format, you can print it any way you like.

This is the same metric you pass to the --metric flag when you run the agent using weco run

If you don't have one in mind, this might help:

Is a metric all you need? Technically yes, but let's talk feedback.

Yes, all you have to do to get the weco agent working is to print the metric you are using to evaluate whatever code you want optimized to the terminal. However, you might want to provide more nuanced information or handle cases where a solution is clearly incorrect.

Putting it all together

  1. Ensure the baseline code you want to optimize is somewhat modular so that the agent can modify it, but you can still import it to evaluate.

  2. If you're measuring a metric that requires the baseline code to run, make sure to have that code available (maybe in the form of a copy) as the agent will aim to modify the baseline source code that its trying to optimize.

  3. Some code to compute the metric you want to optimize for and print it out.

    # evaluate.py
    import sys
    import traceback
    from my_dataset import load_data
     
    try:
        # Try to test the generated code
        from file_to_optimize import Model
    except ImportError:
        # If it doesn't work, provide some helpful information to the agent
        print(f"Error: The generated code can not be evaluated. Here is the error message: {traceback.format_exc()}")
        sys.exit(1)
     
    def evaluate():
        X_test, y_test = load_data()
        acc = (Model(X_test) == y_test).float().mean().item()
        print(f"accuracy: {acc:.6f}")
     
    if __name__ == "__main__":
        evaluate()
  4. Tell the agent how to run the evaluation script. You can do this by using the --eval-command flag when you run the agent using weco run.

    weco run \
    --source file_to_optimize.py \
    --eval-command "python evaluate.py" \
    --metric accuracy \
    --goal maximize

From Good to Great: Supercharge Your Evaluation Script

Elevate a working eval script into a robust, efficient, and reliable driver to optimize your code through the weco agent with these tips:

  • Keep it fast: target eval script runtimes in the order of minutes; a fast eval script allows the agent to run more iterations in the same amount of time.
  • Include informative feedback: catch errors and return informative messages as if they were the SEV-2 message your manager would see at 2AM; this enables the agent to debug faster, saving you time and money.
  • Make it robust: pin random seeds (e.g., torch.manual_seed(0)), verify correctness (e.g., numeric diff ≤ 1e-6), average over multiple trials; this ensures the agent can trust the results of the eval script.
  • Clean up: free GPU cache and temporary files between runs to avoid state leakage.

FAQ

Jump Right In

Now you're ready to start optimizing your code! Check out the CLI Reference for more information about the commands and options available and the Examples to see Weco in action.

On this page