CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels

Mar 22, 2026

CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels

CuteDSL converts popular ML models into optimized versions with fused operations, custom attention kernels, and reduced memory allocations – all while maintaining output equivalence.

The goal is to build the fastest possible frontier model implementations and catalog them, much like the transformers/diffusers ecosystem. A lot of this is created by autoresearch-style bots that try to maintain evals while speeding up models – fusing kernels with CuteDSL has been working well.

CuteChronos2 – 24x Faster Time Series Forecasting

The first target is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation:

Unscaled tiled attention (FlashAttention-style, avoids materializing the S*S attention matrix)
Fused RoPE (inv_freq + cos/sin + Q/K rotation in one kernel)
Fused RMS LayerNorm + Linear (eliminates normalized intermediate tensors)
Fused MLP (two-layer MLP without materializing the 3072-wide hidden)
Fused preprocessing (NaN-aware normalize + arcsinh + patch + time encoding)
C++/CUDA preprocessing kernels for NaN-aware normalization and patching
torch.compile support with reduce-overhead mode

Benchmark Results

On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:

Implementation	Latency (ms)	Speedup
Original Chronos2Pipeline	30.9	baseline
CuteChronos2 (eager)	24.0	1.3x
CuteChronos2 (torch.compile)	1.3	24.4x

The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.

Quick Start

pip install uv
uv venv && source .venv/bin/activate
uv pip install -e .

# Convert and benchmark
python -m cutechronos.convert --benchmark --benchmark-compiled

import torch
from cutechronos.model import CuteChronos2Model

model = CuteChronos2Model.from_pretrained_compiled(
    "amazon/chronos-bolt-base",
    compile_mode="reduce-overhead",
)
model = model.to("cuda", torch.bfloat16)

context = torch.randn(1, 512, device="cuda")
with torch.inference_mode():
    quantile_preds = model(context)  # (batch, 21_quantiles, prediction_length)

CuteZImage – Accelerated Text-to-Image

The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone with:

Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
Complex-valued RoPE kernel – fused reshape + complex multiply + flatten
from_diffusers() weight loading – load from any HuggingFace Z-Image checkpoint

Architecture: 30 main layers + 2 refiner layers, dim=3840, 30 heads, SiLU-gated FFN (hidden=10240).

The Triton Kernel Approach

Each kernel fuses multiple PyTorch operations into a single GPU kernel launch, eliminating intermediate tensor allocations and memory bandwidth bottlenecks:

Kernel	What it fuses
`unscaled_attention`	QK^T + mask + softmax + V multiply
`rms_layernorm`	T5-style RMS norm (FP32 variance)
`rope`	inv_freq + cos/sin + Q/K rotation
`fused_rms_norm_linear`	RMS LayerNorm + linear projection
`fused_mlp_relu`	Two-layer MLP (linear + relu + linear)
`fused_preprocess`	NaN-aware normalize + arcsinh + patch + time_enc
`fused_silu_gate_ffn`	SiLU + gating + FFN
`fused_adaln_norm`	AdaLN + RMS norm

Adding New Models

The pattern for accelerating any model:

Profile the original model to identify bottleneck operations
Write Triton kernels that fuse multiple operations
Create a model class that loads original weights and uses fused kernels
Validate output equivalence within tight tolerance (max abs error < 1e-4)
Benchmark to confirm speedup

Check out the project on GitHub: github.com/lee101/cutedsl

Comment and share

Audio Processing with Librosa and the Espeak Phonemizer

Jan 31, 2026

Audio Processing with Librosa and the Espeak Phonemizer

In this tutorial, we’ll explore how to use two powerful Python libraries: Librosa for extracting audio features and the Espeak Phonemizer for converting text into phonemes. Much like the improvements seen in AWS Elastic Beanstalk v3, these tools can significantly simplify your work with audio and speech data.

What We’ll Cover

Loading an audio file and extracting features using Librosa.
Working with several spectral features such as MFCCs, spectral centroids, and more.
Converting text to a phonemic representation using the Espeak Phonemizer.

1. Getting Started with Librosa

Librosa offers a rich set of functions for audio analysis. For instance, you can compute a mel-scaled spectrogram, extract mel-frequency cepstral coefficients (MFCCs), and calculate spectral properties like the centroid. Here’s a friendly example to get you started:

import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt

# Load an audio file (replace with your own file path)
audio_path = 'path/to/your/audio.wav'
y, sr = librosa.load(audio_path)

# Compute a Mel-scaled spectrogram
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr)
# Convert the mel spectrogram to log scale (dB)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)

# Compute MFCCs (Mel-Frequency Cepstral Coefficients)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# Plot the MFCCs
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('MFCC')
plt.tight_layout()
plt.show()

In addition to these, Librosa provides functions for:

Chroma features: use chroma_stft, chroma_cqt, or chroma_cens to capture pitch class profiles.
RMS energy: compute it with rms for volume estimations.
Spectral properties: such as spectral_centroid, spectral_bandwidth, spectral_contrast, and spectral_rolloff.
Rhythm and tempo detection: try tempo and tempogram for beat tracking.

Feel free to experiment with these functions to effectively capture various characteristics of your audio signals!

2. Converting Text to Phonemes with the Espeak Phonemizer

For many speech processing applications, it’s useful to convert text into its phonemic transcription. This can be particularly handy for aligning audio with text or for linguistic analysis. The phonemizer package makes this straightforward using the Espeak backend:

from phonemizer import phonemize

# Define the text you want to convert to phonemes
text = "Hello, welcome to this audio processing tutorial!"

# Convert the text to phonemes using the Espeak backend
phonemes = phonemize(text, backend='espeak', language='en-us', strip=True)

print("Phonemic Representation:")
print(phonemes)

The above code will output the phonemic representation of the given text, which you can integrate into further audio or speech processing tasks.

Final Thoughts

By combining the power of Librosa’s audio feature extraction with the simplicity of the Espeak Phonemizer for phonemic conversion, you can build robust audio processing applications with ease. Experiment with different parameters and functions in both libraries to tailor the workflow to your specific needs.

Happy coding and enjoy exploring the fascinating world of audio processing!

Comment and share

page 1 of 1

CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels

CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels

CuteChronos2 – 24x Faster Time Series Forecasting

Benchmark Results

Quick Start

CuteZImage – Accelerated Text-to-Image

The Triton Kernel Approach

Adding New Models

Audio Processing with Librosa and the Espeak Phonemizer

Audio Processing with Librosa and the Espeak Phonemizer

What We’ll Cover

1. Getting Started with Librosa

2. Converting Text to Phonemes with the Espeak Phonemizer

Final Thoughts

Lee Penkman

Nerd/Geek, Crypto/Software/Games/VFX/ML, Multiple hat wearer

Image Processing/ML Engineer @ Canva

Sydney