Helion com o by alivenkickin · Pull Request #133 · gpu-mode/reference-kernels

alivenkickin · 2026-03-14T21:53:45Z

PR for o kernel

~2x improvement Optimized: Benchmark 0: 0.0047 ms (min=0.0045, max=0.0051) {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146} Benchmark 1: 0.0069 ms (min=0.0068, max=0.0075) {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129} Benchmark 2: 0.0719 ms (min=0.0718, max=0.0725) {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352} Baseline: Benchmark 0: 0.0070 ms (min=0.0068, max=0.0075) {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146} Benchmark 1: 0.0120 ms (min=0.0118, max=0.0125) {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129} Benchmark 2: 0.1563 ms (min=0.1562, max=0.1570) {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352}

Results: Benchmark 0: 0.0231 ms (min=0.0230, max=0.0236) {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146} Benchmark 1: 0.0352 ms (min=0.0349, max=0.0361) {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129} Benchmark 2: 0.0651 ms (min=0.0650, max=0.0659) {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352} Baseline: Benchmark 0: 0.2165 ms (min=0.2163, max=0.2176) {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146} Benchmark 1: 0.3577 ms (min=0.3574, max=0.3587) {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129} Benchmark 2: 0.7097 ms (min=0.7094, max=0.7108) {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352} Optimizations: 1. Eliminated tripled accumulator — replaced acc1/acc2/acc3 averaged with a single acc, removing 3x redundant compute and memory reads 2. Larger block sizes — [1, 64] or [1, 128] instead of [1, 8], increasing parallelism 3. More warps — num_warps=2-4 instead of 1, better SM utilization 4. More pipeline stages — num_stages=2-4 for large shapes to hide HBM latency via software pipelining

rdspring1 and others added 6 commits March 14, 2026 12:02

Add Meta Helion submodule

7b860ba

Create requirements

7344864

TODO: Optimization Idea for fp8 - fuse reshape into kernel

06ff037

changes by raka

e56d9fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helion com o#133

Helion com o#133
alivenkickin wants to merge 6 commits intogpu-mode:mainfrom
rdspring1:helion-com-o

alivenkickin commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alivenkickin commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants