Skip to content

Helion com o#133

Open
alivenkickin wants to merge 6 commits intogpu-mode:mainfrom
rdspring1:helion-com-o
Open

Helion com o#133
alivenkickin wants to merge 6 commits intogpu-mode:mainfrom
rdspring1:helion-com-o

Conversation

@alivenkickin
Copy link

PR for o kernel

rdspring1 and others added 6 commits March 14, 2026 12:02
~2x improvement
Optimized:
  Benchmark 0: 0.0047 ms (min=0.0045, max=0.0051)  {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146}
  Benchmark 1: 0.0069 ms (min=0.0068, max=0.0075)  {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129}
  Benchmark 2: 0.0719 ms (min=0.0718, max=0.0725)  {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352}

Baseline:
  Benchmark 0: 0.0070 ms (min=0.0068, max=0.0075)  {'num_tokens': 256, 'hidden_dim': 4096, 'group_size': 128, 'seed': 2146}
  Benchmark 1: 0.0120 ms (min=0.0118, max=0.0125)  {'num_tokens': 256, 'hidden_dim': 8192, 'group_size': 128, 'seed': 3129}
  Benchmark 2: 0.1563 ms (min=0.1562, max=0.1570)  {'num_tokens': 4096, 'hidden_dim': 7168, 'group_size': 128, 'seed': 54352}
Results:
 Benchmark 0: 0.0231 ms (min=0.0230, max=0.0236)  {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146}
 Benchmark 1: 0.0352 ms (min=0.0349, max=0.0361)  {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129}
 Benchmark 2: 0.0651 ms (min=0.0650, max=0.0659)  {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352}

Baseline:
 Benchmark 0: 0.2165 ms (min=0.2163, max=0.2176)  {'B': 1, 'D': 1536, 'S': 2048, 'W': 4, 'seed': 2146}
 Benchmark 1: 0.3577 ms (min=0.3574, max=0.3587)  {'B': 1, 'D': 2560, 'S': 2048, 'W': 4, 'seed': 3129}
 Benchmark 2: 0.7097 ms (min=0.7094, max=0.7108)  {'B': 1, 'D': 2560, 'S': 4096, 'W': 4, 'seed': 54352}

Optimizations:
  1. Eliminated tripled accumulator — replaced acc1/acc2/acc3 averaged with a single acc, removing 3x redundant compute and memory reads
  2. Larger block sizes — [1, 64] or [1, 128] instead of [1, 8], increasing parallelism
  3. More warps — num_warps=2-4 instead of 1, better SM utilization
  4. More pipeline stages — num_stages=2-4 for large shapes to hide HBM latency via software pipelining
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants