CUDA Optimization
Optimize a PyTorch self-attention module using custom CUDA kernels
This example showcases using Weco to optimize a PyTorch causal multi-head self-attention implementation by generating custom CUDA kernels.
This approach aims for low-level optimization beyond standard PyTorch or even Triton for potentially higher performance on NVIDIA GPUs.
This example uses a separate Markdown file (guide.md
) to provide detailed instructions and context to the LLM.
You can find the complete files for this example here.
Setup
If you haven't already, follow the Installation guide to install the Weco CLI. Otherwise, install the CLI using pip
:
Choose your LLM provider:
Create your OpenAI API key here.
Install the required dependencies:
Note: This example requires a compatible NVIDIA GPU and the CUDA Toolkit installed on your system for compiling and running the generated CUDA code.
Create the Guidance File
Create a file called guide.md
with the following content:
Optimized Code
The optimized version employs a custom CUDA kernel for fused element-wise addition. The kernel is defined and compiled inline using PyTorch's load_inline
.
Run Weco
Now run Weco to optimize your code:
Explanation
--source optimize.py
: The initial PyTorch self-attention code to be optimized with CUDA.--eval-command "python evaluate.py --solution-path optimize.py"
: Runs the evaluation script, which compiles (if necessary) and benchmarks the CUDA-enhanced code inoptimize.py
against a baseline, printing thespeedup
.--metric speedup
: The optimization target metric.--goal maximize
: Weco aims to increase the speedup.--steps 15
: The number of optimization iterations.--model o4-mini
: The LLM used for code generation.--additional-instructions "Write a CUDA kernel to optimize the matrix multiplication and element-wise operations. Ensure numerical correctness and maintain the same interface."
: Provides guidance to the LLM on the optimization approach.
Weco will iteratively modify optimize.py
, potentially generating and integrating CUDA C++ code, guided by the evaluation results and the instructions in guide.md
.
Next Steps
After mastering CUDA kernel optimization, explore Triton Optimization for a higher-level GPU programming approach that's often easier to work with. If you're interested in different types of optimization, check out Model Development for end-to-end machine learning workflows or Prompt Engineering for optimizing LLM interactions.
For more advanced usage and configuration options, visit the CLI Reference or learn about Writing Good Evaluation Scripts to improve your optimization results.
For more examples, visit the Examples section.