CuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels

CuteDSL converts popular ML models into optimized versions with fused operations, custom attention kernels, and reduced memory allocations – all while maintaining output equivalence.

The goal is to build the fastest possible frontier model implementations and catalog them, much like the transformers/diffusers ecosystem. A lot of this is created by autoresearch-style bots that try to maintain evals while speeding up models – fusing kernels with CuteDSL has been working well.

CuteChronos2 – 24x Faster Time Series Forecasting

The first target is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation:

  • Unscaled tiled attention (FlashAttention-style, avoids materializing the S*S attention matrix)
  • Fused RoPE (inv_freq + cos/sin + Q/K rotation in one kernel)
  • Fused RMS LayerNorm + Linear (eliminates normalized intermediate tensors)
  • Fused MLP (two-layer MLP without materializing the 3072-wide hidden)
  • Fused preprocessing (NaN-aware normalize + arcsinh + patch + time encoding)
  • C++/CUDA preprocessing kernels for NaN-aware normalization and patching
  • torch.compile support with reduce-overhead mode

Benchmark Results

On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:

Implementation Latency (ms) Speedup
Original Chronos2Pipeline 30.9 baseline
CuteChronos2 (eager) 24.0 1.3x
CuteChronos2 (torch.compile) 1.3 24.4x

The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.

Quick Start

1
2
3
4
5
6
pip install uv
uv venv && source .venv/bin/activate
uv pip install -e .

# Convert and benchmark
python -m cutechronos.convert --benchmark --benchmark-compiled
1
2
3
4
5
6
7
8
9
10
11
12
import torch
from cutechronos.model import CuteChronos2Model

model = CuteChronos2Model.from_pretrained_compiled(
"amazon/chronos-bolt-base",
compile_mode="reduce-overhead",
)
model = model.to("cuda", torch.bfloat16)

context = torch.randn(1, 512, device="cuda")
with torch.inference_mode():
quantile_preds = model(context) # (batch, 21_quantiles, prediction_length)

CuteZImage – Accelerated Text-to-Image

The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone with:

  • Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
  • Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
  • Complex-valued RoPE kernel – fused reshape + complex multiply + flatten
  • from_diffusers() weight loading – load from any HuggingFace Z-Image checkpoint

Architecture: 30 main layers + 2 refiner layers, dim=3840, 30 heads, SiLU-gated FFN (hidden=10240).

The Triton Kernel Approach

Each kernel fuses multiple PyTorch operations into a single GPU kernel launch, eliminating intermediate tensor allocations and memory bandwidth bottlenecks:

Kernel What it fuses
unscaled_attention QK^T + mask + softmax + V multiply
rms_layernorm T5-style RMS norm (FP32 variance)
rope inv_freq + cos/sin + Q/K rotation
fused_rms_norm_linear RMS LayerNorm + linear projection
fused_mlp_relu Two-layer MLP (linear + relu + linear)
fused_preprocess NaN-aware normalize + arcsinh + patch + time_enc
fused_silu_gate_ffn SiLU + gating + FFN
fused_adaln_norm AdaLN + RMS norm

Adding New Models

The pattern for accelerating any model:

  1. Profile the original model to identify bottleneck operations
  2. Write Triton kernels that fuse multiple operations
  3. Create a model class that loads original weights and uses fused kernels
  4. Validate output equivalence within tight tolerance (max abs error < 1e-4)
  5. Benchmark to confirm speedup

Check out the project on GitHub: github.com/lee101/cutedsl

Comment and share

CuteDSL - Accelerating ML Inference with Fused Triton and CUDA Kernels

Production ML inference is bottlenecked by memory bandwidth, not compute. Every PyTorch operation launches a separate GPU kernel, allocates intermediate tensors, and round-trips through global memory. CuteDSL fixes this by fusing multiple operations into single Triton/CUDA kernels – delivering up to 24x speedups while maintaining output equivalence.

The Problem

A standard transformer forward pass is death by a thousand cuts. RMS norm writes a normalized tensor to global memory, then the next linear layer reads it back. Multiply that by 12 layers, add attention, MLP, preprocessing, and postprocessing – you get dozens of unnecessary memory round-trips per inference call.

Kernel Fusion

The core idea: combine multiple sequential operations into a single GPU kernel. Instead of launching separate kernels for RMS norm, then Q/K/V projection, CuteDSL fuses them so the normalized values never leave registers.

Kernel Fused Operations
unscaled_attention QK^T + mask + softmax + V multiply
rms_layernorm T5-style RMS norm (FP32 variance)
rope inv_freq + cos/sin + Q/K rotation
fused_rms_norm_linear RMS LayerNorm + linear projection
fused_rms_norm_qkv RMS LayerNorm + Q/K/V projections
fused_mlp_relu Two-layer MLP (linear + relu + linear)
fused_preprocess NaN-aware normalize + arcsinh + patch + time encoding
fused_output_transform Rearrange + sinh + unscale

Each kernel eliminates intermediate tensor allocations and reduces global memory traffic.

CuteChronos2: 24x Faster Time Series Forecasting

The first model is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation, plus C++/CUDA preprocessing kernels for NaN-aware normalization and patching.

Benchmarks

On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:

Implementation Latency (ms) Speedup
Original Chronos2Pipeline 30.9 baseline
CuteChronos2 (eager) 24.0 1.3x
CuteChronos2 (torch.compile) 1.3 24.4x

Output equivalence maintained throughout: max absolute error < 1e-4.

The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.

Usage

Drop-in replacement API matching HuggingFace upstream:

1
2
3
4
5
6
7
8
9
import torch
from cutechronos.model import CuteChronos2Model

model = CuteChronos2Model.from_pretrained("amazon/chronos-bolt-base")
model = model.to("cuda", torch.bfloat16)

context = torch.randn(1, 512, device="cuda")
with torch.inference_mode():
quantile_preds = model(context)

For maximum performance with torch.compile:

1
2
3
4
model = CuteChronos2Model.from_pretrained_compiled(
"amazon/chronos-bolt-base",
compile_mode="reduce-overhead",
)

There’s also a pipeline API that handles variable-length batching:

1
2
3
4
5
6
7
8
from cutechronos.pipeline import CuteChronos2Pipeline

pipe = CuteChronos2Pipeline.from_pretrained(
"amazon/chronos-bolt-base",
device="cuda",
dtype=torch.bfloat16,
)
predictions = pipe.predict(torch.randn(512), prediction_length=30)

CuteZImage: Accelerated Text-to-Image

The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone (30 main layers + 2 refiner layers, dim=3840, 30 heads) with:

  • Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
  • Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
  • Complex-valued RoPE kernel – fused reshape + complex multiply + flatten for multi-axis rotations
  • from_diffusers() weight loading – load directly from any HuggingFace Z-Image checkpoint

Design Philosophy

CuteDSL provides pure PyTorch fallbacks for every fused kernel, so models run on CPU without Triton. On GPU, kernels are swapped in transparently. No vendor lock-in – if a kernel doesn’t load, the fallback activates silently.

Autoresearch Bots

Much of this work is driven by automated research bots. They profile models, identify fusion opportunities, generate candidate Triton kernels, and validate output equivalence – all while maintaining eval metrics. The bots search the space of possible kernel fusions and keep what works. CuteDSL’s kernel fusion approach has been particularly effective as a target for this automated optimization pipeline.

Comment and share

Supervisor vs systemd + Monitoring and AI Agent Auto-Fix Playbooks

If you run Linux services long enough, you eventually pick a side: Supervisor or systemd. The right answer is usually: systemd for the host, Supervisor when you need a lightweight, app-level process manager inside a container or a legacy stack.

This post is a practical comparison and then a walk-through of monitoring strategies, including how we wire AI agents into monitoring to auto-fix common outages (like what we keep in ../netwrck/monitoring).

Continue reading

Audio Processing with Librosa and the Espeak Phonemizer

In this tutorial, we’ll explore how to use two powerful Python libraries: Librosa for extracting audio features and the Espeak Phonemizer for converting text into phonemes. Much like the improvements seen in AWS Elastic Beanstalk v3, these tools can significantly simplify your work with audio and speech data.

What We’ll Cover

  • Loading an audio file and extracting features using Librosa.
  • Working with several spectral features such as MFCCs, spectral centroids, and more.
  • Converting text to a phonemic representation using the Espeak Phonemizer.

1. Getting Started with Librosa

Librosa offers a rich set of functions for audio analysis. For instance, you can compute a mel-scaled spectrogram, extract mel-frequency cepstral coefficients (MFCCs), and calculate spectral properties like the centroid. Here’s a friendly example to get you started:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt

# Load an audio file (replace with your own file path)
audio_path = 'path/to/your/audio.wav'
y, sr = librosa.load(audio_path)

# Compute a Mel-scaled spectrogram
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr)
# Convert the mel spectrogram to log scale (dB)
log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)

# Compute MFCCs (Mel-Frequency Cepstral Coefficients)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# Plot the MFCCs
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('MFCC')
plt.tight_layout()
plt.show()

In addition to these, Librosa provides functions for:

  • Chroma features: use chroma_stft, chroma_cqt, or chroma_cens to capture pitch class profiles.
  • RMS energy: compute it with rms for volume estimations.
  • Spectral properties: such as spectral_centroid, spectral_bandwidth, spectral_contrast, and spectral_rolloff.
  • Rhythm and tempo detection: try tempo and tempogram for beat tracking.

Feel free to experiment with these functions to effectively capture various characteristics of your audio signals!

2. Converting Text to Phonemes with the Espeak Phonemizer

For many speech processing applications, it’s useful to convert text into its phonemic transcription. This can be particularly handy for aligning audio with text or for linguistic analysis. The phonemizer package makes this straightforward using the Espeak backend:

1
2
3
4
5
6
7
8
9
10
from phonemizer import phonemize

# Define the text you want to convert to phonemes
text = "Hello, welcome to this audio processing tutorial!"

# Convert the text to phonemes using the Espeak backend
phonemes = phonemize(text, backend='espeak', language='en-us', strip=True)

print("Phonemic Representation:")
print(phonemes)

The above code will output the phonemic representation of the given text, which you can integrate into further audio or speech processing tasks.

Final Thoughts

By combining the power of Librosa’s audio feature extraction with the simplicity of the Espeak Phonemizer for phonemic conversion, you can build robust audio processing applications with ease. Experiment with different parameters and functions in both libraries to tailor the workflow to your specific needs.

Happy coding and enjoy exploring the fascinating world of audio processing!

Comment and share

There’s a great video on YouTube about teaching kids Lua programming through Minecraft:
https://www.youtube.com/watch?v=gyyuOyC7hzQ

Using Minecraft’s computer blocks to teach programming is an excellent approach - it makes coding fun and interactive! Let’s explore the basics of Lua programming that you’ll need to get started.

Lua Programming Basics

1. Variables and Data Types

In Lua, you can store different types of data in variables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Numbers
local age = 10
local height = 1.75

-- Strings (text)
local name = "Steve"
local message = 'Hello Minecraft!'

-- Booleans
local isPlaying = true
local isSleeping = false

-- Tables (arrays/lists)
local inventory = {"sword", "pickaxe", "torch"}

2. Basic Operations

1
2
3
4
5
6
7
8
9
10
-- Math operations
local blocks = 5 + 3 -- Addition
local diamonds = 10 - 2 -- Subtraction
local torches = 4 * 3 -- Multiplication
local shares = 15 / 3 -- Division

-- String concatenation (joining text)
local firstName = "Steve"
local lastName = "Minecraft"
local fullName = firstName .. " " .. lastName

3. Control Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- If statements
local diamonds = 5

if diamonds > 10 then
print("You have lots of diamonds!")
elseif diamonds > 0 then
print("You have some diamonds")
else
print("No diamonds yet!")
end

-- While loops
local trees = 3
while trees > 0 do
print("Chopping tree...")
trees = trees - 1
end

-- For loops
for i = 1, 5 do
print("Mining block " .. i)
end

4. Functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Basic function
function sayHello(playerName)
return "Hello, " .. playerName .. "!"
end

-- Using the function
local greeting = sayHello("Alex")
print(greeting)

-- Function with multiple returns
function getPlayerStats()
return "Steve", 100, 20 -- name, health, armor
end

local name, health, armor = getPlayerStats()

5. Tables (Arrays and Dictionaries)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Creating a table as an array
local blocks = {"dirt", "stone", "wood"}
print(blocks[1]) -- prints "dirt"

-- Table as a dictionary
local player = {
name = "Steve",
health = 20,
inventory = {
diamonds = 5,
wood = 64
}
}

print(player.name) -- prints "Steve"
print(player.inventory.diamonds) -- prints 5

Minecraft ComputerCraft Commands

Here are some essential turtle commands you’ll use in ComputerCraft:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Movement
turtle.forward() -- Move forward
turtle.back() -- Move backward
turtle.up() -- Move up
turtle.down() -- Move down
turtle.turnLeft() -- Turn left
turtle.turnRight()-- Turn right

-- Actions
turtle.dig() -- Mine block in front
turtle.digUp() -- Mine block above
turtle.digDown() -- Mine block below
turtle.place() -- Place block from selected slot
turtle.select(1) -- Select inventory slot 1

Fun Projects for Learning

1. Simple Tree Chopper

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function chopTree()
-- Check if there's a tree in front
while turtle.detect() do
turtle.dig()
print("Chopping block...")
turtle.up()
end

-- Return to ground
while not turtle.detectDown() do
turtle.down()
end

print("Tree chopped!")
end

-- Run the program
chopTree()

2. Automatic Farm Builder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function buildFarm(width, length)
-- Place dirt blocks in a rectangle
for w = 1, width do
for l = 1, length do
turtle.placeDown()
turtle.forward()
end

-- Turn around at the end of each row
if w < width then
if w % 2 == 1 then
turtle.turnRight()
turtle.forward()
turtle.turnRight()
else
turtle.turnLeft()
turtle.forward()
turtle.turnLeft()
end
end
end
end

-- Build a 3x4 farm
buildFarm(3, 4)

3. House Builder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
function buildWall(length)
for i = 1, length do
turtle.place()
turtle.forward()
end
end

function buildHouse(size)
-- Build four walls
for i = 1, 4 do
buildWall(size)
turtle.turnRight()
end

-- Build roof
turtle.up()
for i = 1, size do
for j = 1, size do
turtle.placeDown()
if j < size then
turtle.forward()
end
end
if i < size then
turtle.turnRight()
turtle.forward()
turtle.turnLeft()
turtle.back(size - 1)
end
end
end

-- Build a 5x5 house
buildHouse(5)

Practice Exercises

  1. Beginner: Make a turtle that digs a 3x3 hole
  2. Intermediate: Create a program that plants saplings in a checkerboard pattern
  3. Advanced: Build a multi-story building with windows and doors

Tips for Teaching Kids

  1. Start Small: Begin with simple programs that show immediate results
  2. Visual Feedback: Use turtle commands that provide visual feedback
  3. Encourage Experimentation: Let kids modify the code and see what happens
  4. Debug Together: When something goes wrong, use it as a learning opportunity
  5. Project-Based Learning: Create goals like “build a house” or “create a farm”

Common Mistakes to Watch For

  • Forgetting to fuel the turtle
  • Not checking if the turtle has enough blocks
  • Infinite loops (always have a way to stop the program)
  • Not handling errors when the turtle is blocked

Next Steps

Once comfortable with these basics, you can explore:

  • Reading and writing files
  • Using redstone integration
  • Creating graphical interfaces with monitors
  • Building complex automation systems

Remember, the key to teaching kids programming is making it fun and relevant to their interests. Minecraft provides an excellent platform for this, as they can immediately see the results of their code in a familiar and engaging environment.

Resources

Happy coding in Minecraft! 🎮👾

Comment and share

DeepSeek-R1: Advancing Reasoning Capabilities Through Pure Reinforcement Learning

DeepSeek recently released their DeepSeek-R1 model, achieving reasoning capabilities on par with OpenAI’s o1 models through pure reinforcement learning. Let’s explore how they did it and what Hugging Face is doing with Open-R1.

What is DeepSeek-R1?

If you’ve ever struggled with a tough math problem, you know how useful it is to think longer and work through it carefully. OpenAI’s o1 model showed that when LLMs are trained to do the same—by using more compute during inference—they get significantly better at solving reasoning tasks like mathematics, coding, and logic.

However, the recipe behind OpenAI’s reasoning models has been a well kept secret. That is, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the internet (and the stock market!).

Besides performing as well or better than o1, the DeepSeek-R1 release was accompanied by a detailed tech report outlining their training recipe. This recipe involved several innovations, most notably the application of pure reinforcement learning to teach a base language model how to reason without any human supervision.

The Training Process

DeepSeek-R1 is built on the foundation of DeepSeek-V3, a 671B parameter Mixture of Experts (MoE) model that performs on par with models like Sonnet 3.5 and GPT-4o. What’s especially impressive is how cost-efficient it was to train—just $5.5M—thanks to architectural optimizations.

The training process involved two key models:

  1. DeepSeek-R1-Zero: This model skipped supervised fine-tuning entirely and relied on pure reinforcement learning using Group Relative Policy Optimization (GRPO). A simple reward system guided the model based on answer accuracy and structure. While it developed strong reasoning skills, its outputs often lacked clarity.

  2. DeepSeek-R1: This model started with a “cold start” phase using carefully crafted examples to improve clarity. It then went through multiple rounds of RL and refinement, including rejecting low-quality outputs using both human preference and verifiable rewards.

The Open-R1 Project

While DeepSeek released their model weights, the datasets and training code remain closed. This prompted Hugging Face to launch the Open-R1 project, which aims to:

  1. Replicate R1-Distill models by distilling reasoning datasets from DeepSeek-R1
  2. Recreate the pure RL pipeline used for R1-Zero
  3. Demonstrate the complete training pipeline from base model → SFT → RL

The project will focus on:

  • Creating synthetic datasets for fine-tuning LLMs into reasoning models
  • Developing training recipes for building similar models from scratch
  • Exploring applications beyond math into areas like code and medicine

Key Innovations and Results

Some notable achievements of DeepSeek-R1 include:

  • 79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217
  • 97.3% score on MATH-500
  • 2,029 Elo rating on Codeforces (outperforming 96.3% of human participants)
  • Strong performance on knowledge benchmarks like MMLU (90.8%) and MMLU-Pro (84.0%)

Looking Forward

The release of DeepSeek-R1 represents a significant step forward in open-source AI development. By demonstrating that pure reinforcement learning can create powerful reasoning models, it opens new possibilities for advancing AI capabilities without relying on extensive human supervision.

The Open-R1 project aims to make these advances even more accessible to the research community, potentially accelerating progress in areas like mathematical reasoning, coding, and scientific problem-solving.

Try Deepseek on Netwrck

Comment and share

Lee Penkman

Nerd/Geek, Crypto/Software/Games/VFX/ML, Multiple hat wearer


Image Processing/ML Engineer @ Canva


Sydney