CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels
CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels
CuteDSL converts popular ML models into optimized versions with fused operations, custom attention kernels, and reduced memory allocations – all while maintaining output equivalence.
The goal is to build the fastest possible frontier model implementations and catalog them, much like the transformers/diffusers ecosystem. A lot of this is created by autoresearch-style bots that try to maintain evals while speeding up models – fusing kernels with CuteDSL has been working well.
CuteChronos2 – 24x Faster Time Series Forecasting
The first target is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation:
- Unscaled tiled attention (FlashAttention-style, avoids materializing the S*S attention matrix)
- Fused RoPE (inv_freq + cos/sin + Q/K rotation in one kernel)
- Fused RMS LayerNorm + Linear (eliminates normalized intermediate tensors)
- Fused MLP (two-layer MLP without materializing the 3072-wide hidden)
- Fused preprocessing (NaN-aware normalize + arcsinh + patch + time encoding)
- C++/CUDA preprocessing kernels for NaN-aware normalization and patching
- torch.compile support with
reduce-overheadmode
Benchmark Results
On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:
| Implementation | Latency (ms) | Speedup |
|---|---|---|
| Original Chronos2Pipeline | 30.9 | baseline |
| CuteChronos2 (eager) | 24.0 | 1.3x |
| CuteChronos2 (torch.compile) | 1.3 | 24.4x |
The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.
Quick Start
1 | pip install uv |
1 | import torch |
CuteZImage – Accelerated Text-to-Image
The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone with:
- Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
- Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
- Complex-valued RoPE kernel – fused reshape + complex multiply + flatten
- from_diffusers() weight loading – load from any HuggingFace Z-Image checkpoint
Architecture: 30 main layers + 2 refiner layers, dim=3840, 30 heads, SiLU-gated FFN (hidden=10240).
The Triton Kernel Approach
Each kernel fuses multiple PyTorch operations into a single GPU kernel launch, eliminating intermediate tensor allocations and memory bandwidth bottlenecks:
| Kernel | What it fuses |
|---|---|
unscaled_attention |
QK^T + mask + softmax + V multiply |
rms_layernorm |
T5-style RMS norm (FP32 variance) |
rope |
inv_freq + cos/sin + Q/K rotation |
fused_rms_norm_linear |
RMS LayerNorm + linear projection |
fused_mlp_relu |
Two-layer MLP (linear + relu + linear) |
fused_preprocess |
NaN-aware normalize + arcsinh + patch + time_enc |
fused_silu_gate_ffn |
SiLU + gating + FFN |
fused_adaln_norm |
AdaLN + RMS norm |
Adding New Models
The pattern for accelerating any model:
- Profile the original model to identify bottleneck operations
- Write Triton kernels that fuse multiple operations
- Create a model class that loads original weights and uses fused kernels
- Validate output equivalence within tight tolerance (max abs error < 1e-4)
- Benchmark to confirm speedup
Check out the project on GitHub: github.com/lee101/cutedsl