If you read my earlier post on teaching kids programming with Minecraft and Lua, you know I think game-based coding is one of the best ways to get kids hooked on programming. But ComputerCraft isn’t the only option. Let’s compare the four major platforms parents and educators actually use.
Continue readingCuteDSL - Accelerated ML Inference with Custom Triton and CUDA Kernels
CuteDSL converts popular ML models into optimized versions with fused operations, custom attention kernels, and reduced memory allocations – all while maintaining output equivalence.
The goal is to build the fastest possible frontier model implementations and catalog them, much like the transformers/diffusers ecosystem. A lot of this is created by autoresearch-style bots that try to maintain evals while speeding up models – fusing kernels with CuteDSL has been working well.
CuteChronos2 – 24x Faster Time Series Forecasting
The first target is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation:
- Unscaled tiled attention (FlashAttention-style, avoids materializing the S*S attention matrix)
- Fused RoPE (inv_freq + cos/sin + Q/K rotation in one kernel)
- Fused RMS LayerNorm + Linear (eliminates normalized intermediate tensors)
- Fused MLP (two-layer MLP without materializing the 3072-wide hidden)
- Fused preprocessing (NaN-aware normalize + arcsinh + patch + time encoding)
- C++/CUDA preprocessing kernels for NaN-aware normalization and patching
- torch.compile support with
reduce-overheadmode
Benchmark Results
On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:
| Implementation | Latency (ms) | Speedup |
|---|---|---|
| Original Chronos2Pipeline | 30.9 | baseline |
| CuteChronos2 (eager) | 24.0 | 1.3x |
| CuteChronos2 (torch.compile) | 1.3 | 24.4x |
The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.
Quick Start
1 | pip install uv |
1 | import torch |
CuteZImage – Accelerated Text-to-Image
The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone with:
- Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
- Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
- Complex-valued RoPE kernel – fused reshape + complex multiply + flatten
- from_diffusers() weight loading – load from any HuggingFace Z-Image checkpoint
Architecture: 30 main layers + 2 refiner layers, dim=3840, 30 heads, SiLU-gated FFN (hidden=10240).
The Triton Kernel Approach
Each kernel fuses multiple PyTorch operations into a single GPU kernel launch, eliminating intermediate tensor allocations and memory bandwidth bottlenecks:
| Kernel | What it fuses |
|---|---|
unscaled_attention |
QK^T + mask + softmax + V multiply |
rms_layernorm |
T5-style RMS norm (FP32 variance) |
rope |
inv_freq + cos/sin + Q/K rotation |
fused_rms_norm_linear |
RMS LayerNorm + linear projection |
fused_mlp_relu |
Two-layer MLP (linear + relu + linear) |
fused_preprocess |
NaN-aware normalize + arcsinh + patch + time_enc |
fused_silu_gate_ffn |
SiLU + gating + FFN |
fused_adaln_norm |
AdaLN + RMS norm |
Adding New Models
The pattern for accelerating any model:
- Profile the original model to identify bottleneck operations
- Write Triton kernels that fuse multiple operations
- Create a model class that loads original weights and uses fused kernels
- Validate output equivalence within tight tolerance (max abs error < 1e-4)
- Benchmark to confirm speedup
Check out the project on GitHub: github.com/lee101/cutedsl
CuteDSL - Accelerating ML Inference with Fused Triton and CUDA Kernels
Production ML inference is bottlenecked by memory bandwidth, not compute. Every PyTorch operation launches a separate GPU kernel, allocates intermediate tensors, and round-trips through global memory. CuteDSL fixes this by fusing multiple operations into single Triton/CUDA kernels – delivering up to 24x speedups while maintaining output equivalence.
The Problem
A standard transformer forward pass is death by a thousand cuts. RMS norm writes a normalized tensor to global memory, then the next linear layer reads it back. Multiply that by 12 layers, add attention, MLP, preprocessing, and postprocessing – you get dozens of unnecessary memory round-trips per inference call.
Kernel Fusion
The core idea: combine multiple sequential operations into a single GPU kernel. Instead of launching separate kernels for RMS norm, then Q/K/V projection, CuteDSL fuses them so the normalized values never leave registers.
| Kernel | Fused Operations |
|---|---|
unscaled_attention |
QK^T + mask + softmax + V multiply |
rms_layernorm |
T5-style RMS norm (FP32 variance) |
rope |
inv_freq + cos/sin + Q/K rotation |
fused_rms_norm_linear |
RMS LayerNorm + linear projection |
fused_rms_norm_qkv |
RMS LayerNorm + Q/K/V projections |
fused_mlp_relu |
Two-layer MLP (linear + relu + linear) |
fused_preprocess |
NaN-aware normalize + arcsinh + patch + time encoding |
fused_output_transform |
Rearrange + sinh + unscale |
Each kernel eliminates intermediate tensor allocations and reduces global memory traffic.
CuteChronos2: 24x Faster Time Series Forecasting
The first model is Amazon Chronos-2, a state-of-the-art time series forecasting model. CuteChronos2 is a from-scratch reimplementation with custom Triton kernels for every major operation, plus C++/CUDA preprocessing kernels for NaN-aware normalization and patching.
Benchmarks
On an RTX 5090 with Chronos-2 base (768 d_model, 12 layers), batch=1, length=512:
| Implementation | Latency (ms) | Speedup |
|---|---|---|
| Original Chronos2Pipeline | 30.9 | baseline |
| CuteChronos2 (eager) | 24.0 | 1.3x |
| CuteChronos2 (torch.compile) | 1.3 | 24.4x |
Output equivalence maintained throughout: max absolute error < 1e-4.
The compiled mode uses torch.compile(mode="reduce-overhead") which captures CUDA graphs for near-zero kernel launch overhead.
Usage
Drop-in replacement API matching HuggingFace upstream:
1 | import torch |
For maximum performance with torch.compile:
1 | model = CuteChronos2Model.from_pretrained_compiled( |
There’s also a pipeline API that handles variable-length batching:
1 | from cutechronos.pipeline import CuteChronos2Pipeline |
CuteZImage: Accelerated Text-to-Image
The second model is Z-Image Turbo, a fast text-to-image diffusion model. CuteZImage reimplements the transformer backbone (30 main layers + 2 refiner layers, dim=3840, 30 heads) with:
- Fused SiLU-gated FFN – eliminates the 10240-wide intermediate allocation
- Fused AdaLN + RMS Norm – timestep conditioning fused with normalization
- Complex-valued RoPE kernel – fused reshape + complex multiply + flatten for multi-axis rotations
- from_diffusers() weight loading – load directly from any HuggingFace Z-Image checkpoint
Design Philosophy
CuteDSL provides pure PyTorch fallbacks for every fused kernel, so models run on CPU without Triton. On GPU, kernels are swapped in transparently. No vendor lock-in – if a kernel doesn’t load, the fallback activates silently.
Autoresearch Bots
Much of this work is driven by automated research bots. They profile models, identify fusion opportunities, generate candidate Triton kernels, and validate output equivalence – all while maintaining eval metrics. The bots search the space of possible kernel fusions and keep what works. CuteDSL’s kernel fusion approach has been particularly effective as a target for this automated optimization pipeline.
Links
- GitHub: github.com/lee101/cutedsl
Supervisor vs systemd + Monitoring and AI Agent Auto-Fix Playbooks
If you run Linux services long enough, you eventually pick a side: Supervisor or systemd. The right answer is usually: systemd for the host, Supervisor when you need a lightweight, app-level process manager inside a container or a legacy stack.
This post is a practical comparison and then a walk-through of monitoring strategies, including how we wire AI agents into monitoring to auto-fix common outages (like what we keep in ../netwrck/monitoring).
Audio Processing with Librosa and the Espeak Phonemizer
In this tutorial, we’ll explore how to use two powerful Python libraries: Librosa for extracting audio features and the Espeak Phonemizer for converting text into phonemes. Much like the improvements seen in AWS Elastic Beanstalk v3, these tools can significantly simplify your work with audio and speech data.
What We’ll Cover
- Loading an audio file and extracting features using Librosa.
- Working with several spectral features such as MFCCs, spectral centroids, and more.
- Converting text to a phonemic representation using the Espeak Phonemizer.
1. Getting Started with Librosa
Librosa offers a rich set of functions for audio analysis. For instance, you can compute a mel-scaled spectrogram, extract mel-frequency cepstral coefficients (MFCCs), and calculate spectral properties like the centroid. Here’s a friendly example to get you started:
1 | import numpy as np |
In addition to these, Librosa provides functions for:
- Chroma features: use
chroma_stft,chroma_cqt, orchroma_censto capture pitch class profiles. - RMS energy: compute it with
rmsfor volume estimations. - Spectral properties: such as
spectral_centroid,spectral_bandwidth,spectral_contrast, andspectral_rolloff. - Rhythm and tempo detection: try
tempoandtempogramfor beat tracking.
Feel free to experiment with these functions to effectively capture various characteristics of your audio signals!
2. Converting Text to Phonemes with the Espeak Phonemizer
For many speech processing applications, it’s useful to convert text into its phonemic transcription. This can be particularly handy for aligning audio with text or for linguistic analysis. The phonemizer package makes this straightforward using the Espeak backend:
1 | from phonemizer import phonemize |
The above code will output the phonemic representation of the given text, which you can integrate into further audio or speech processing tasks.
Final Thoughts
By combining the power of Librosa’s audio feature extraction with the simplicity of the Espeak Phonemizer for phonemic conversion, you can build robust audio processing applications with ease. Experiment with different parameters and functions in both libraries to tailor the workflow to your specific needs.
Happy coding and enjoy exploring the fascinating world of audio processing!
There’s a great video on YouTube about teaching kids Lua programming through Minecraft:
https://www.youtube.com/watch?v=gyyuOyC7hzQ
Using Minecraft’s computer blocks to teach programming is an excellent approach - it makes coding fun and interactive! Let’s explore the basics of Lua programming that you’ll need to get started.
Lua Programming Basics
1. Variables and Data Types
In Lua, you can store different types of data in variables:
1 | -- Numbers |
2. Basic Operations
1 | -- Math operations |
3. Control Flow
1 | -- If statements |
4. Functions
1 | -- Basic function |
5. Tables (Arrays and Dictionaries)
1 | -- Creating a table as an array |
Minecraft ComputerCraft Commands
Here are some essential turtle commands you’ll use in ComputerCraft:
1 | -- Movement |
Fun Projects for Learning
1. Simple Tree Chopper
1 | function chopTree() |
2. Automatic Farm Builder
1 | function buildFarm(width, length) |
3. House Builder
1 | function buildWall(length) |
Practice Exercises
- Beginner: Make a turtle that digs a 3x3 hole
- Intermediate: Create a program that plants saplings in a checkerboard pattern
- Advanced: Build a multi-story building with windows and doors
Tips for Teaching Kids
- Start Small: Begin with simple programs that show immediate results
- Visual Feedback: Use turtle commands that provide visual feedback
- Encourage Experimentation: Let kids modify the code and see what happens
- Debug Together: When something goes wrong, use it as a learning opportunity
- Project-Based Learning: Create goals like “build a house” or “create a farm”
Common Mistakes to Watch For
- Forgetting to fuel the turtle
- Not checking if the turtle has enough blocks
- Infinite loops (always have a way to stop the program)
- Not handling errors when the turtle is blocked
Next Steps
Once comfortable with these basics, you can explore:
- Reading and writing files
- Using redstone integration
- Creating graphical interfaces with monitors
- Building complex automation systems
Remember, the key to teaching kids programming is making it fun and relevant to their interests. Minecraft provides an excellent platform for this, as they can immediately see the results of their code in a familiar and engaging environment.
Resources
Happy coding in Minecraft! 🎮👾
DeepSeek-R1: Advancing Reasoning Capabilities Through Pure Reinforcement Learning
DeepSeek recently released their DeepSeek-R1 model, achieving reasoning capabilities on par with OpenAI’s o1 models through pure reinforcement learning. Let’s explore how they did it and what Hugging Face is doing with Open-R1.
What is DeepSeek-R1?
If you’ve ever struggled with a tough math problem, you know how useful it is to think longer and work through it carefully. OpenAI’s o1 model showed that when LLMs are trained to do the same—by using more compute during inference—they get significantly better at solving reasoning tasks like mathematics, coding, and logic.
However, the recipe behind OpenAI’s reasoning models has been a well kept secret. That is, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the internet (and the stock market!).
Besides performing as well or better than o1, the DeepSeek-R1 release was accompanied by a detailed tech report outlining their training recipe. This recipe involved several innovations, most notably the application of pure reinforcement learning to teach a base language model how to reason without any human supervision.
The Training Process
DeepSeek-R1 is built on the foundation of DeepSeek-V3, a 671B parameter Mixture of Experts (MoE) model that performs on par with models like Sonnet 3.5 and GPT-4o. What’s especially impressive is how cost-efficient it was to train—just $5.5M—thanks to architectural optimizations.
The training process involved two key models:
DeepSeek-R1-Zero: This model skipped supervised fine-tuning entirely and relied on pure reinforcement learning using Group Relative Policy Optimization (GRPO). A simple reward system guided the model based on answer accuracy and structure. While it developed strong reasoning skills, its outputs often lacked clarity.
DeepSeek-R1: This model started with a “cold start” phase using carefully crafted examples to improve clarity. It then went through multiple rounds of RL and refinement, including rejecting low-quality outputs using both human preference and verifiable rewards.
The Open-R1 Project
While DeepSeek released their model weights, the datasets and training code remain closed. This prompted Hugging Face to launch the Open-R1 project, which aims to:
- Replicate R1-Distill models by distilling reasoning datasets from DeepSeek-R1
- Recreate the pure RL pipeline used for R1-Zero
- Demonstrate the complete training pipeline from base model → SFT → RL
The project will focus on:
- Creating synthetic datasets for fine-tuning LLMs into reasoning models
- Developing training recipes for building similar models from scratch
- Exploring applications beyond math into areas like code and medicine
Key Innovations and Results
Some notable achievements of DeepSeek-R1 include:
- 79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217
- 97.3% score on MATH-500
- 2,029 Elo rating on Codeforces (outperforming 96.3% of human participants)
- Strong performance on knowledge benchmarks like MMLU (90.8%) and MMLU-Pro (84.0%)
Looking Forward
The release of DeepSeek-R1 represents a significant step forward in open-source AI development. By demonstrating that pure reinforcement learning can create powerful reasoning models, it opens new possibilities for advancing AI capabilities without relying on extensive human supervision.
The Open-R1 project aims to make these advances even more accessible to the research community, potentially accelerating progress in areas like mathematical reasoning, coding, and scientific problem-solving.
Try Deepseek on Netwrck
Big Multiplayer Chess is a multiplayer free for all chess variant where many players on a large board can move pawns in any direction and can slide castles, bishops and queens upto 8 places.
Some metrics or heuristics must be used to score how good a board configuration is for a player,
Continue reading
I created webfiddle.net which lets you easily add your own CSS and JavaScript to the web and share the results.
Part of the product includes a proxy server which injects your code, webfiddle.net is currently going fairly viral (2M requests in the last 5 days) and costing too much!
Continue reading