【训练营】FlashAttention接入 @simon_chou by Simon-CHOU · Pull Request #128 · InfiniTensor/InfiniTrain

Simon-CHOU · 2026-03-16T13:53:07Z

[Feat] FlashAttention Integration (GPT-2/LLaMA-3)

Summary

Integrates FlashAttention kernels into InfiniTrain to optimize memory usage and support long-sequence training. Aligns with project requirements for the 2025 Winter Training Camp.

Key Changes

Kernels: Added FP32-based FlashAttentionForwardKernel and FlashAttentionBackwardKernel with causal masking and scaling support.
Ops: Implemented ScaledDotProductAttention autograd function, mirroring PyTorch's interface.
Models: Enabled --flash flag for GPT-2 and LLaMA-3 to toggle between baseline and FlashAttention.
Build: Updated CMake to support CUDA kernels and enforce Release optimization.

Verification

Precision (GPT-2): Loss alignment within 0.2% of baseline (FP32), verifying numerical correctness.
Stability: Fixed NaN issues by correcting shared memory initialization and gradient accumulation.
Functionality: LLaMA-3 (1B) training runs successfully without OOM.

Performance

Memory: ~15.7% reduction in peak memory usage on GPT-2 (SeqLen=1024).
Throughput: Comparable to baseline (FP32 SIMT kernel used for precision).

Notes

Known Issue: LLaMA-3 backward pass throughput is limited by global atomicAdd.
Docs: Full report available at cuda-report/cuda-report_20260315_simon_chou.md.

…ith autograd

…Bc=32, D=64)

…ntiguous inputs

…t layout and kernel dimensions

…longer sequence

- Initialize shared memory padding to 0 to prevent NaN propagation in WMMA - Fix s_Qi buffer reuse bug by correctly casting to float* - Vectorize output store to float4

- Reduce input range to [-1, 1] to prevent expf overflow - Enable Causal Mask in test - Add multi-tile test cases (T=16, 32, 1024)

- Log NaN fix and Performance Regression (Story 7) - Update Tasking Plan status - Update performance report with latest benchmark results

- Implement Tiled Backward Kernel (Block-level accumulation for dK/dV) - Use Shared Memory for dK/dV accumulation (reduce atomics by 32x) - Add Backward Gradient Check to test_flash_layout

- Implement Tiled Backward Kernel to remove global atomicAdd bottleneck - Fix stride issues in GEMM helpers (Bc_pad vs Bc_bw) - Enable Dynamic Shared Memory (>48KB) for Backward Kernel - Verify correctness with Gradient Check (test_flash_layout) - Benchmark: 7814 TPS (0.64x Baseline), 9x improvement over Story 7

…or and strict nvcc flags

…ability

Switch to FP32 kernel and double accumulators to fix precision issues. Correct mask value and epsilon logic.

Record root cause for LLaMA-3 performance bottleneck (atomicAdd) and precision alignment fixes.

Add evaluation conclusion for precision alignment and LLaMA-3 performance analysis.

…ort logs

…upporting logs

Add visualization script and embed PNG chart in report to demonstrate precision alignment.

kilinchange · 2026-03-17T06:17:39Z

请移除 pr 中不必要的提交，pr 中只需包含代码部分修改，项目报告相关内容请作为邮件附件发送；
请解决目前 pr 与 master 分支的冲突。

Simon Chou added 30 commits March 10, 2026 17:47

feat(config): support flash attention configuration

bff41d4

feat(nn): add ScaledDotProductAttention functional interface

12e0905

feat(cli): add --flash command line argument and update model config

f03f21c

feat(model): integrate flash attention branch in forward pass

bef7f93

docs: add audit/problems.log.md to track issues

9a6dbf4

feat(autograd): implement ScaledDotProductAttention function skeleton

13456b2

docs: move tasking plan to audit directory

9b8c285

feat(kernel): add flash attention forward kernel stub and integrate w…

b5858e2

…ith autograd

feat(kernel): implement naive flash attention forward kernel (Br=32, …

7221421

…Bc=32, D=64)

feat(flash-attn): add backward kernel path and validate gpt2 training

6921323

test: update benchmark script for stability and llama3 support

63c0591

feat: support flash attention flag in llama3 from_llmc and enforce co…

0e849ec

…ntiguous inputs

docs: update performance report and problem log for story 4

9af70e4

fix: resolve LLaMA-3 FlashAttention accuracy issue by correcting inpu…

07645cd

…t layout and kernel dimensions

docs: update problem log and mark story 5 as completed

454837e

feat: optimize FlashAttention kernel with tiling and float4, support …

4d29e05

…longer sequence

fix(kernels): resolve NaN issue in FlashAttention WMMA kernel

061e266

- Initialize shared memory padding to 0 to prevent NaN propagation in WMMA - Fix s_Qi buffer reuse bug by correctly casting to float* - Vectorize output store to float4

test(flash): update test case input range and logic

14df6b4

- Reduce input range to [-1, 1] to prevent expf overflow - Enable Causal Mask in test - Add multi-tile test cases (T=16, 32, 1024)

docs(audit): update problems log and tasking plan

b73ef19

- Log NaN fix and Performance Regression (Story 7) - Update Tasking Plan status - Update performance report with latest benchmark results

perf(kernels): optimize Backward Pass by reducing global atomicAdd

d2bec5c

- Implement Tiled Backward Kernel (Block-level accumulation for dK/dV) - Use Shared Memory for dK/dV accumulation (reduce atomics by 32x) - Add Backward Gradient Check to test_flash_layout

docs: migrate cuda-report to project root

5ea26d0

fix(kernels): enhance flash attention precision with double accumulat…

7c4ec74

…or and strict nvcc flags

test(kernels): add fp32 reference kernel and precision validation test

eebd8d8

fix(runtime): improve cuda error handling and model initialization st…

8f5e561

…ability

feat(examples): add llama3 reproduction scripts and dockerfile

5e6b7d8

fix(kernels): align FlashAttention precision with baseline

1ac0063

Switch to FP32 kernel and double accumulators to fix precision issues. Correct mask value and epsilon logic.

docs(audit): log LLaMA-3 performance and precision issues

e23c857

Record root cause for LLaMA-3 performance bottleneck (atomicAdd) and precision alignment fixes.

docs(report): update judge results to v0.5.0

d3c5eb8

Add evaluation conclusion for precision alignment and LLaMA-3 performance analysis.

chore(git): update gitignore to exclude build artifacts and track rep…

215f807

…ort logs

Simon Chou added 3 commits March 16, 2026 19:59

docs(report): add v0.5.0 performance go/no-go decision document and s…

b75f784

…upporting logs

docs(report): add tl;dr reproduction instructions for reviewer

0d7422e

docs(report): embed GPT-2 loss alignment chart

2c7a075

Add visualization script and embed PNG chart in report to demonstrate precision alignment.

kilinchange self-requested a review March 17, 2026 06:17

kilinchange self-assigned this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【训练营】FlashAttention接入 @simon_chou#128

【训练营】FlashAttention接入 @simon_chou#128
Simon-CHOU wants to merge 33 commits intoInfiniTensor:masterfrom
Simon-CHOU:simon_chou-20260316

Simon-CHOU commented Mar 16, 2026

Uh oh!

kilinchange commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Simon-CHOU commented Mar 16, 2026

[Feat] FlashAttention Integration (GPT-2/LLaMA-3)

Summary

Key Changes

Verification

Performance

Notes

Uh oh!

kilinchange commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants