CUDA Optimization
Optimize a PyTorch self-attention module using custom CUDA kernels
This example showcases using Weco to optimize a PyTorch causal multi-head self-attention implementation by generating custom CUDA kernels.
This approach aims for low-level optimization beyond standard PyTorch or even Triton for potentially higher performance on NVIDIA GPUs.
This example uses a separate Markdown file (guide.md
) to provide detailed instructions and context to the LLM.
You can find the complete files for this example here.
Setup
If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip
:
Google AI Studio has a free API usage quota. Create a key here to use weco
for free.
Install the required dependency:
Note: This example requires a compatible NVIDIA GPU and the CUDA Toolkit installed on your system for compiling and running the generated CUDA code.
Create the Baseline to Optimize
Create a file called optimize.py
with the following content:
Create the Evaluation Script
Create a file called evaluate.py
with the following content:
Create the Guidance File
Create a file called guide.md
with the following content:
Optimized Code
The optimized version employs a custom CUDA kernel for fused element-wise addition. The kernel is defined and compiled inline using PyTorch's load_inline
.
Run Weco
Now run Weco to optimize your code:
Explanation
--source optimize.py
: The initial PyTorch self-attention code to be optimized with CUDA.--eval-command "python evaluate.py --solution-path optimize.py"
: Runs the evaluation script, which compiles (if necessary) and benchmarks the CUDA-enhanced code inoptimize.py
against a baseline, printing thespeedup
.--metric speedup
: The optimization target metric.--maximize true
: Weco aims to increase the speedup.--steps 30
: The number of optimization iterations.--model gemini-2.5-pro-exp-03-25
: The LLM used for code generation.--additional-instructions guide.md
: Points Weco to the guidance file created above.
Weco will iteratively modify optimize.py
, potentially generating and integrating CUDA C++ code, guided by the evaluation results and the instructions in guide.md
.
For more examples, visit the Examples section.