Helion competion - Kernel optimization by alivenkickin · Pull Request #134 · gpu-mode/reference-kernels

alivenkickin · 2026-03-15T00:44:31Z

In Helion Hackathon - to optimize the code for different kernels

~2x improvement Optimized: Benchmark 0: 0.0047 ms (min=0.0045, max=0.0051) {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146} Benchmark 1: 0.0069 ms (min=0.0068, max=0.0075) {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129} Benchmark 2: 0.0719 ms (min=0.0718, max=0.0725) {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352} Baseline: Benchmark 0: 0.0070 ms (min=0.0068, max=0.0075) {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146} Benchmark 1: 0.0120 ms (min=0.0118, max=0.0125) {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129} Benchmark 2: 0.1563 ms (min=0.1562, max=0.1570) {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352}

Results: Benchmark 0: 0.0231 ms (min=0.0230, max=0.0236) {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146} Benchmark 1: 0.0352 ms (min=0.0349, max=0.0361) {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129} Benchmark 2: 0.0651 ms (min=0.0650, max=0.0659) {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352} Baseline: Benchmark 0: 0.2165 ms (min=0.2163, max=0.2176) {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146} Benchmark 1: 0.3577 ms (min=0.3574, max=0.3587) {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129} Benchmark 2: 0.7097 ms (min=0.7094, max=0.7108) {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352} Optimizations: 1. Eliminated tripled accumulator — replaced acc1/acc2/acc3 averaged with a single acc, removing 3x redundant compute and memory reads 2. Larger block sizes — [1, 64] or [1, 128] instead of [1, 8], increasing parallelism 3. More warps — num_warps=2-4 instead of 1, better SM utilization 4. More pipeline stages — num_stages=2-4 for large shapes to hide HBM latency via software pipelining

Results: Benchmark 0: 0.0041 ms (min=0.0040, max=0.0042) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.1887 ms (min=0.1886, max=0.1888) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 0.3787 ms (min=0.3785, max=0.3788) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146} Baseline: Benchmark 0: 0.0042 ms (min=0.0042, max=0.0045) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.5036 ms (min=0.5034, max=0.5039) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 1.0005 ms (min=1.0000, max=1.0011) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146}

Results: Benchmark 0: 0.0059 ms (min=0.0058, max=0.0060) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.0067 ms (min=0.0067, max=0.0069) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 0.0092 ms (min=0.0091, max=0.0094) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146} Baseline: Benchmark 0: 0.1635 ms (min=0.1634, max=0.1638) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.1448 ms (min=0.1446, max=0.1450) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 0.1453 ms (min=0.1452, max=0.1454) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146} Key details: - Used separate [B, H, T, K] and [B, H, T, V] tile loops (not flattened BH) to avoid Helion's launcher-side variable scoping issues - Two clean hl.dot matmul calls replacing the baseline's double scalar accumulation loop - hl.register_block_size(K) and hl.register_block_size(V) for the two block sizes per config - All variable names are distinct between the two loops to avoid loop dependency errors

…rence Results: Benchmark 0: 0.2179 ms (min=0.2178, max=0.2182) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.2227 ms (min=0.2226, max=0.2230) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 0.2307 ms (min=0.2306, max=0.2311) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146} Baseline: Benchmark 0: 0.0197 ms (min=0.0196, max=0.0198) {'B': 1, 'T': 64, 'H': 1, 'K': 64, 'V': 64, 'seed': 31232} Benchmark 1: 0.0210 ms (min=0.0209, max=0.0213) {'B': 2, 'T': 512, 'H': 3, 'K': 64, 'V': 64, 'seed': 4052} Benchmark 2: 0.0214 ms (min=0.0213, max=0.0217) {'B': 2, 'T': 1024, 'H': 3, 'K': 64, 'V': 64, 'seed': 2146} Details: 1. Added block_v = hl.register_block_size(V) to register a tunable block size over the V dimension 2. Changed the tile loop from hl.tile([BH, T], ...) with flat B*H indexing to hl.tile([B, H, T, V], block_size=[1, 1, C, block_v]) with explicit dimensions 3. Sliced v_tile and h over tile_v so both dot products produce [C, block_v] outputs 4. Updated SHAPE_CONFIGS with block_sizes=[V] matching each shape's V dimension and num_warps=4, num_stages=2

rdspring1 added 8 commits March 14, 2026 15:30

Add Meta Helion submodule

520b8c5

Create requirements

b8ba7d2

TODO: Optimization Idea for fp8 - fuse reshape into kernel

451993d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helion competion - Kernel optimization#134

Helion competion - Kernel optimization#134
alivenkickin wants to merge 8 commits intogpu-mode:mainfrom
rdspring1:helion-comp

alivenkickin commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alivenkickin commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants