CUDA Optimization
Optimize a PyTorch self-attention module using custom CUDA kernels
This example showcases using Weco to optimize a PyTorch causal multi-head self-attention implementation by generating custom CUDA kernels. This approach aims for low-level optimization beyond standard PyTorch or even Triton for potentially higher performance on NVIDIA GPUs.
You can find the complete files for this example here.
Setup
If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip:
Install the required dependencies:
Tips:
- This example requires a compatible NVIDIA GPU and the CUDA Toolkit installed on your system for compiling and running the generated CUDA code.
- If compatible, install flash attention (
pip install flash-attn --no-build-isolation).
Run Weco
Now run Weco to optimize your code:
Explanation
--source module.py: The initial PyTorch self-attention code to be optimized with CUDA.--eval-command "python evaluate.py --path module.py": Runs the evaluation script, which compiles (if necessary) and benchmarks the CUDA-enhanced code inmodule.pyagainst a baseline, printing thespeedup.--metric speedup: The optimization target metric.--goal maximize: Weco aims to increase the speedup.--steps 50: The number of optimization iterations.--model gpt-5: The LLM used for code generation.--additional-instructions "...": Provides guidance to the LLM on the optimization approach.--eval-timeout 600: Stop running the evaluation script if it does not complete in 600 seconds.
Weco will iteratively modify module.py, generating and integrating CUDA code, guided by the evaluation results and the additional instructions provided.
What's Next?
- Higher-level GPU programming: Try Triton Optimization for easier kernel development
- Different optimization types: Explore Model Development or Prompt Engineering
- Simpler GPU optimization: Start with PyTorch Optimization
- Better evaluation scripts: Learn Writing Good Evaluation Scripts
- All command options: Check the CLI Reference