Writing Good Evaluation Scripts
Guide for writing good evaluation scripts for Weco
Can you measure your task? If so, print it.
Examples: accuracy, speedup, latency-ms, tokens-per-second, F1-score.
If you've got metric in mind, all you have to do is print the metric name and the number:
This is the same metric you pass to the
--metric
flag when you run the agent using weco run
If you don't have one in mind, this might help:
Is a metric all you need? Technically yes, but let's talk feedback.
Yes, all you have to do to get the weco agent working is to print the metric you are using to evaluate whatever code you want optimized to the terminal. However, you might want to provide more nuanced information or handle cases where a solution is clearly incorrect.
Putting it all together
-
Ensure the baseline code you want to optimize is somewhat modular so that the agent can modify it, but you can still import it to evaluate.
-
If you're measuring a metric that requires the baseline code to run, make sure to have that code available (maybe in the form of a copy) as the agent will aim to modify the baseline source code that its trying to optimize.
-
Some code to compute the metric you want to optimize for and print it out.
-
Tell the agent how to run the evaluation script. You can do this by using the
--eval-command
flag when you run the agent using weco run.
From Good to Great: Supercharge Your Evaluation Script
Elevate a working eval script into a robust, efficient, and reliable driver to optimize your code through the weco agent with these tips:
- Keep it fast: target eval script runtimes in the order of minutes; a fast eval script allows the agent to run more iterations in the same amount of time.
- Include informative feedback: catch errors and return informative messages as if they were the SEV-2 message your manager would see at 2AM; this enables the agent to debug faster, saving you time and money.
- Make it robust: pin random seeds (e.g.,
torch.manual_seed(0)
), verify correctness (e.g., numeric diff ≤ 1e-6), average over multiple trials; this ensures the agent can trust the results of the eval script. - Clean up: free GPU cache and temporary files between runs to avoid state leakage.
FAQ
Jump Right In
Now you're ready to start optimizing your code! Check out the CLI Reference for more information about the commands and options available and the Examples to see Weco in action.