Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
一、 问题描述
在目前的 InfiniTrain 框架中,BF16(
nv_bfloat16)类型的Elementwise算子(如UnaryForwardKernel等)默认采用了最基础的标量访存模式。由于 BF16 单个元素仅占 2 个字节,当 GPU 线程逐个读取数据时,完全无法有效利用现代 GPU 128-bit(16 字节)的显存事务宽度,导致极大的带宽浪费。这种 访存瓶颈 问题在进行大规模张量运算时,会显著拖慢底层算子的执行效率。
二、 当前(优化前)的性能记录和分析
三、 优化方法与开发历程记录
1. 优化历程与踩坑记录
在性能剖析阶段,我经历了极为曲折但收获颇丰的排查过程:
nsys和ncu时,遭遇了云容器共享存储空间爆满(/tmp目录No space left on device)的问题,通过重定向TMPDIR成功绕过了编译器的 IO 限制。ncu -k精准狙击抓取到了目标算子的底层硬件计数器数据。2. 核心优化方案:128-bit 向量化访存
为了解决带宽浪费,我重构了
infini_train/src/kernels/cuda/elementwise.cu中的核心逻辑:AlignedVector<T, kVecSize>,通过alignas(sizeof(T) * VecSize)强制 16 字节对齐。针对 BF16,kVecSize设为 8,使得一个线程能一次性吞吐 8 个元素(16 bytes)。IsAligned运行时指针校验,确保仅在输入/输出指针严格对齐时开启向量化模式,否则安全回退至标量处理。#pragma unroll,促使编译器在底层发射LDG.E.128和STG.E.128宽指令。tail_start,将无法被 8 整除的剩余元素交由标量安全处理,保证精度下的结果绝对正确。四、 最终优化后的性能记录和分析
五、 NCU 使用和分析
为深度剖析瓶颈,我使用
ncu -o bf16_unary_report -f -k "UnaryForwardKernel" -c 2成功抓取了该算子的 Profiling 报告。报告数据发现:
MulScalarForward/PowForward)的执行时间极短(Duration 分别为 3.46 μs 和 4.26 μs)。Memory Throughput和SM Throughput均不足 1%。Launch Statistics指出,该 Kernel 启动的Grid Size仅为 1(Threads = 256)。深度分析:
在配备 108 个 SM 的顶级 GPU 上,处理极小尺寸(≤256 元素)时会引发严重的“尾部效应”。NCU 警告指出 "only 0.0 full waves across all SMs",证明当前框架中部分 Elementwise 算子处于 Kernel Launch Overhead Bound(内核启动延迟瓶颈),这也为下一步的优化指明了方向。
六、 未来可继续提升的地方
基于 NCU 的诊断分析,未来的优化方向应聚焦于:
elementwise.cu底部硬编码实现了一个FusedAddSigmoidKernel作为概念验证 (PoC)。BinaryForwardKernel及反向传播逻辑中。