基于CUDA的实时图像双边滤波优化及多平台移植#48
Open
Snowkyo16 wants to merge 9 commits intoInfiniTensor:2025-winter-projectfrom
Open
基于CUDA的实时图像双边滤波优化及多平台移植#48Snowkyo16 wants to merge 9 commits intoInfiniTensor:2025-winter-projectfrom
Snowkyo16 wants to merge 9 commits intoInfiniTensor:2025-winter-projectfrom
Conversation
- 实现CPU双边滤波 (bilateral_cpu.cpp) - stb_image 图像读写 - params.txt 参数配置 - OpenCV 对比验证脚本 - Makefile 编译系统
- 新增src/kernels.cu: 一个线程处理一个像素的naive kernel - 新增include/benchmark.h + src/benchmark.cpp:通用计时/对比框架 - 重构main.cu为版本调度器,支持MODE参数
- 新增V2 kernel: shared memory tiling with halo协作加载 - 修复颜色权重: L1范数,与OpenCV一致 - 修复窗口形状,圆形窗口,与OpenCV一致 - 统一输出PNG无损格式
新增V3 Kernel: 常量内存LUT, 使用__expf, #pragma unroll通道展开 优化benchmark框架:1次预热+N次计时取平均,GPU版本默认10轮
- 预分配device buffer,消除每帧cudaaMalloc/cudaFree - cudaHostAlloc 分配pinned memory,支持cudaMemcpyAsync - 4路CUDA stream流水线,kernel和D2H重叠执行
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
本项目实现了基于 CUDA 的高性能双边滤波算法,通过渐进式优化,从 CPU 基线到 Stream 流水线共 5 个版本:V0(CPU)→ V1(Naive GPU,726.59x)→ V2(Shared Memory,643.52x)→ V3(常量内存 LUT,819.98x)→ V4(Stream 流水线,987.14x)。最终在 NVIDIA A100 上达到 652.04 MP/s 吞吐量,等效 4K帧率 78.61 fps,成功达成 4K@60fps 实时处理目标。
项目成功从 NVIDIA CUDA 移植到 3 个国产 GPU 平台(天数智芯 BI-V100、沐曦 C500、摩尔线程 S5000),验证了 CUDA 编程模型在国产生态中的可移植性。所有版本与 OpenCV 对比 MAE = 0.0489(< 1),PSNR = 61.16 dB(> 40 dB),满足正确性要求。