[DEPRECATED] Moved to ROCm/rocm-systems repo
-
Updated
Mar 24, 2026 - Python
[DEPRECATED] Moved to ROCm/rocm-systems repo
Online CUDA Occupancy Calculator
(Spring 2017) Assignment 2: GPU Executor
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
GPU Drano Static Analysis for GPU programs.
Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.
A self-hosted low-level functional-style programming language 🌀
High-performance GPU-accelerated C# scripting for Rhino Grasshopper, powered by ILGPU
Medical AI diagnostics system implementing real compiled Mojo GPU kernels with MAX Graph integration
🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy
Runtime correctness checker for custom CUDA kernels. Attach a single decorator to periodically verify outputs against a reference implementation, with outlier-biased sampling and zero training graph impact.
A lightweight utility for monitoring and analyzing Triton kernel compilation cache behavior.
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
Benchmarking hand-written CUDA C, Numba, and Triton self-attention kernels against PyTorch's SDPA - how fast can you go depending on the tool?
LLM primitives rebuilt in Triton — FlashAttention 2.52×, fused AdamW 3.45×, Bias+GELU 14.65× faster than PyTorch
Triton optimizations ran on AMD GPU
Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.
To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."