diff --git a/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md b/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md new file mode 100644 index 0000000..20c7aae --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md @@ -0,0 +1,123 @@ +# NF4 量化算子优化项目总结报告 (Final Project Report) + +## 1. 项目概述 (Project Overview) + +本项目旨在实现并优化一个高性能的 NF4 (Normal Float 4-bit) 到 FP16/BF16 的反量化 (Dequantization) CUDA Kernel。该算子是大语言模型 (LLM) QLoRA 推理中的核心组件。我们不仅实现了功能上的正确性(支持双重量化、任意形状矩阵),还在 NVIDIA A100 平台上进行了深度的性能优化,最终性能达到 **317 GB/s**,为 `bitsandbytes` 工业级实现的 **64%**。 + +--- + +## 2. 实现思路与功能验证 (Implementation & Verification) + +### 2.1 核心功能实现 +代码位置:[dequant_kernel.cu](file:///d:/thu-project/Learning-CUDA-master/Learning-CUDA-master/nf4/dequant_kernel.cu) (v4 implementation) + +我们严格按照 `bitsandbytes` 的规范实现了以下功能: + +1. **NF4 映射表 (Lookup Table)**: + - 使用 `__device__ __constant__` 存储 16 个预定义的正态分布分位数。 + - **优化**: 16个 float 仅占用 64 字节,完美放入 L1 Constant Cache,确保存取无延迟。 + +2. **双重量化缩放 (Double Quantization Scaling)**: + - 公式: `w = NF4[idx] * (code2[absmax_q] * absmax2) + offset` + - 实现了两级缩放逻辑:第一级 `absmax_q` (uint8) 查表映射到 float,第二级 `absmax2` (float) 作为 Group 级缩放。 + +3. **向量化内存访问 (Packed Store)**: + - **读取**: 每个线程读取 1 个 `uint8` (包含 2 个 NF4 索引)。 + - **计算**: 解码出 2 个 FP16/BF16 值。 + - **写入**: 使用 `reinterpret_cast` 将 2 个 16-bit 结果打包为 1 个 32-bit 写入指令。 + - **优势**: 减少了 50% 的 Global Memory 写入指令数,大幅提升了 Store 效率。 + +4. **边界处理 (Boundary Handling)**: + - Kernel 基于 1D `numel` 索引,天然支持任意形状 (Rows/Cols) 的矩阵。 + - 针对奇数个元素的情况,代码中包含边界检查 (`if (elem1 < numel) ... else ...`),确保不发生越界访问。 + +### 2.2 正确性验证 +- **对比对象**: `bitsandbytes` (v0.49.2) CPU/CUDA 结果。 +- **验证指标**: 平均绝对误差 (MAE)。 +- **结果**: MAE = `2.30755e-05`,远优于要求阈值 `1e-2`。 + +--- + +## 3. 优化历程与方法 (Optimization Journey) + +我们经历了四个版本的迭代,性能从最初的 58 GB/s 提升至 317 GB/s。 + +### v1: Naive 实现 (Baseline) +- **思路**: 每个线程处理 1 个元素。 +- **问题**: 内存访问极其低效(1 字节读,2 字节写),显存带宽利用率仅 ~3%。 +- **性能**: ~58 GB/s。 + +### v2: 向量化读写 (Vectorized Access) +- **优化**: 每个线程处理 2 个元素 (1 个 `uint8`)。 +- **手段**: 引入 `pack` 读和 `half2` 写。 +- **效果**: 访存指令减半,带宽利用率提升显著。 + +### v3: 激进向量化 (Aggressive Vectorization) +- **优化**: 每个线程处理 8 或 16 对元素 (使用 `int4` 加载 128 位)。 +- **问题**: 寄存器压力剧增,导致 Occupancy (活跃 Warp 数) 下降,发生 Register Spilling。 +- **教训**: 在 Memory Bound 算子中,过度的单线程指令级并行 (ILP) 可能会损害线程级并行 (TLP)。 + +### v4: 动态 Occupancy 控制 (Current Best) +- **优化**: + 1. **回退到 `int2` 加载**: 降低单线程寄存器压力。 + 2. **`__launch_bounds__(128, 8)`**: 强制编译器限制寄存器使用,确保每个 SM 至少能跑 8 个 Block (1024 线程)。 + 3. **动态 Block Size**: 使用 `cudaOccupancyMaxPotentialBlockSize` 自动计算最优配置。 +- **原理**: 利用 Roofline 模型,通过增加并发 Warp 数量来掩盖 Global Memory 的长延迟。 +- **性能**: **317.25 GB/s** (5.4x speedup vs Baseline)。 + +--- + +## 4. 性能指标与分析 (Performance Analysis) + +### 4.1 最终指标 (Final Metrics) +测试环境: NVIDIA A100-SXM4-80GB, Matrix 8192x8192 + +| Metric | Value | Note | +| :--- | :--- | :--- | +| **Time** | 0.532 ms | 极低延迟 | +| **Bandwidth** | **317.25 GB/s** | 有效带宽 | +| **MAE** | 2.30e-05 | 精度达标 | +| **vs bitsandbytes** | 64.4% | 工业级对标 | + +### 4.2 Nsight Compute (NCU) 分析 +由于服务器环境限制(权限或驱动版本问题),我们未能在最终的 A100 环境上成功收集到 `ncu` 的详细指标(如 Memory/Compute Throughput 占比)。目前的性能分析主要基于以下理论推导和实验观察: + +1. **Memory Bound 特征**: + - Kernel 执行时间极短 (0.532 ms),且计算量极小(仅做简单的查表和乘加)。 + - 带宽达到 317 GB/s,远超单纯计算密集型任务在未优化访存时的表现。 + - 根据 Roofline 模型,低算术强度 (Arithmetic Intensity) 的算子必然受限于显存带宽。 + +2. **Occupancy 优化验证**: + - 我们在代码中显式使用了 `__launch_bounds__(128, 8)`。 + - 实验表明,相比未加 bounds 的版本 (v3),性能提升了 8.4%。这间接证明了增加活跃 Warp 数量(即提高 Occupancy)成功掩盖了部分 Global Memory 延迟。 + +3. **Coalescing 验证**: + - 代码设计上,我们使用了 `uint32_t` 类型的 Packed Store,保证了每个 Warp 的 32 个线程写入连续的 128 字节 (32 * 4 bytes),这完全符合 NVIDIA GPU 的 L2 Cache Line (32 字节) 和显存事务 (32/128 字节) 的对齐要求。 + +### 4.3 Nsight Systems (NSYS) 分析 +- **Timeline**: `nsys` 成功运行。从 Timeline 来看,Kernel 执行时间非常短,GPU 利用率主要受限于 Kernel 启动开销和数据传输 (H2D/D2H)。 +- **System View**: 在端到端推理中,反量化通常与矩阵乘法 (GEMM) 紧密相连。单独测试反量化时,数据搬运占据了主导地位。 + +--- + +## 5. 未来优化方向 (Future Improvements) + +虽然 v4 已经是一个优秀的工程实现,但距离 `bitsandbytes` (492 GB/s) 仍有 36% 的差距。未来的优化方向包括: + +1. **PTX 内联汇编 (Inline PTX)**: + - 手动控制 SASS 指令调度,消除编译器生成的冗余移动指令。 + - 微调寄存器分配,进一步减少 Bank Conflict。 + +2. **异步拷贝 (Async Copy)**: + - 使用 Ampere 架构的 `cp.async` 指令,实现 Global Memory 到 Shared Memory 的硬件级异步传输,彻底打断流水线停顿。 + +3. **算子融合 (Kernel Fusion)**: + - **终极方案**: 将 Dequant 与后续的 GEMM (矩阵乘) 融合。 + - **收益**: 反量化后的 FP16 数据直接在寄存器中参与乘法,完全省去写回 Global Memory 的过程,理论上可获得 2x 以上的端到端性能提升。 + +--- + +## 6. 附件 (Appendix) +- **源代码**: `dequant_kernel.cu`, `main.cpp` +- **测试脚本**: `benchmark_vs_bnb.py` +- **性能日志**: `run_log_remote.md` diff --git a/03_nf4_dequant/SkyHigh-achieving/README.md b/03_nf4_dequant/SkyHigh-achieving/README.md new file mode 100644 index 0000000..e59b291 --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/README.md @@ -0,0 +1,28 @@ +# SkyHigh-achieving + +本项目为 SkyHigh-achieving 项目的技术总结报告,包含实现思路、优化历程与性能分析。 + +## 📁 项目结构 + +```tree +SkyHigh-achieving/ +├── Final_Project_Report.md +├── README.md +├── benchmark_vs_bnb.py +├── dequant_kernel.cu +├── dequant_kernel.h +├── dequant_kernel.ptx +├── dequant_kernel_v2.cu +├── main.cpp +└── run_log_remote.md +``` + +- **Final_Project_Report.md** → 详细的技术总结报告,包含实现思路、优化历程与性能分析 +- **README.md** → 项目提交说明与文件结构介绍(本文件) +- **benchmark_vs_bnb.py** → 工业级对比脚本,用于对标 bitsandbytes 库的性能与精度 +- **dequant_kernel.cu** → 核心 NF4 解量化 Kernel 实现(v4 优化版),包含 Packed Store 与 Bounds 优化 +- **dequant_kernel.h** → Kernel 函数头文件定义,提供 C++ 调用接口 +- **dequant_kernel.ptx** → NVCC 编译生成的 PTX 汇编代码,用于指令级分析 +- **dequant_kernel_v2.cu** → 早期版本的 Kernel 实现(v2),用于性能对比参考 +- **main.cpp** → C++ 测试驱动程序,包含随机数据生成、MAE 精度验证与基础性能测试逻辑 +- **run_log_remote.md** → A100 服务器上的完整运行日志与性能实测数据记录 diff --git a/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py b/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py new file mode 100644 index 0000000..9e055f4 --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py @@ -0,0 +1,77 @@ + +import torch +import time +import sys + +def benchmark_bnb(rows=8192, cols=8192, repeats=50): + try: + import bitsandbytes as bnb + from bitsandbytes.functional import dequantize_4bit, quantize_4bit + except ImportError: + print("bitsandbytes not installed. Run: pip install bitsandbytes") + return None + + if not torch.cuda.is_available(): + print("CUDA not available") + return None + + print(f"Benchmarking bitsandbytes on {torch.cuda.get_device_name(0)}...") + + # 生成 fp32 权重并量化 + device = torch.device("cuda:0") + # fp16 input usually for weights in LLMs before quantization, but bnb quantizes from fp16/fp32 + w = torch.randn(rows, cols, device=device, dtype=torch.float16) + + # blocksize=64, quant_type='nf4' + # quantize_4bit returns: (quantized_data, quantization_state) + # The signature might vary by version, but usually it's input, blocksize, quant_type + try: + w_q, quant_state = bnb.functional.quantize_4bit( + w.reshape(1, -1), blocksize=64, quant_type='nf4' + ) + except TypeError: + # Fallback for some versions + w_q, quant_state = bnb.functional.quantize_4bit( + w.reshape(1, -1), blocksize=64, quant_type='nf4', compress_statistics=True + ) + + # Warmup + print("Warmup...") + for _ in range(5): + out = bnb.functional.dequantize_4bit(w_q, quant_state, quant_type='nf4') + torch.cuda.synchronize() + + # Benchmark + print("Benchmarking...") + t0 = time.perf_counter() + for _ in range(repeats): + out = bnb.functional.dequantize_4bit(w_q, quant_state, quant_type='nf4') + torch.cuda.synchronize() + t1 = time.perf_counter() + + # Calculate metrics + ms_per_call = (t1 - t0) / repeats * 1000 + + # Data transfer: + # Read: 4-bit quantized data + quantization metadata (scales, absmax) + # Write: FP16 output + # Input size: rows * cols / 2 bytes (4-bit) + # Output size: rows * cols * 2 bytes (fp16) + # Metadata is negligible for bandwidth calculation usually, but strict calculation includes it. + # For comparison with our kernel, we usually count load(compressed) + store(decompressed). + + numel = rows * cols + bytes_in = numel // 2 # 0.5 bytes per element + bytes_out = numel * 2 # 2 bytes per element + total_bytes = bytes_in + bytes_out + + bw_gbs = (total_bytes) / (ms_per_call / 1000) / 1e9 + + print(f"bitsandbytes dequantize_4bit ({rows}x{cols}, nf4, blocksize=64):") + print(f" Time: {ms_per_call:.3f} ms") + print(f" Bandwidth: {bw_gbs:.2f} GB/s") + + return ms_per_call, bw_gbs + +if __name__ == "__main__": + benchmark_bnb(8192, 8192) diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu new file mode 100644 index 0000000..3090463 --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu @@ -0,0 +1 @@ +#include "dequant_kernel_v2.cu" diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h new file mode 100644 index 0000000..da6fcb9 --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h @@ -0,0 +1,29 @@ +#pragma once + +#include +#include + +enum class ComputeType { + FP16, + BF16 +}; + +struct DequantConfig { + int64_t rows; + int64_t cols; + int32_t blocksize; + ComputeType compute_type; +}; + +struct NF4Binary { + DequantConfig config; + std::vector packed_weights; + std::vector absmax_q; + std::vector absmax2_raw; + std::vector code2_raw; + float offset; +}; + +bool load_nf4_binary(const char* file_path, NF4Binary& out); +bool save_float_output(const char* file_path, const std::vector& data); +bool run_dequant_cuda(const NF4Binary& input, std::vector& output, float& mae); diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx new file mode 100644 index 0000000..cc80ebe --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx @@ -0,0 +1,5017 @@ +// +// Generated by NVIDIA NVVM Compiler +// +// Compiler Build ID: CL-32267302 +// Cuda compilation tools, release 12.0, V12.0.140 +// Based on NVVM 7.0.1 +// + +.version 8.0 +.target sm_80 +.address_size 64 + +.const .align 4 .b8 _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E[64] = {0, 0, 128, 191, 177, 57, 50, 191, 48, 107, 6, 191, 160, 50, 202, 190, 77, 162, 145, 190, 63, 53, 61, 190, 113, 120, 186, 189, 0, 0, 0, 0, 255, 250, 162, 61, 227, 202, 36, 62, 221, 4, 124, 62, 58, 3, 173, 62, 184, 164, 225, 62, 171, 7, 16, 63, 179, 19, 57, 63, 0, 0, 128, 63}; + +.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT_( + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_0, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_1, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_2, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_3, + .param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_4, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_5, + .param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_6, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_7 +) +{ + .reg .pred %p<5>; + .reg .b16 %rs<12>; + .reg .f32 %f<14>; + .reg .b32 %r<17>; + .reg .b64 %rd<58>; + .loc 1 102 0 + + + ld.param.u64 %rd15, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_0]; + ld.param.u64 %rd18, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_1]; + ld.param.u64 %rd19, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_2]; + ld.param.u64 %rd20, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_3]; + ld.param.f32 %f2, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_4]; + ld.param.u64 %rd16, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_5]; + ld.param.u32 %r1, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_6]; + ld.param.u64 %rd17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_7]; + .loc 1 113 28 + cvta.to.global.u64 %rd1, %rd19; + cvta.to.global.u64 %rd2, %rd20; + cvta.to.global.u64 %rd3, %rd18; + mov.u32 %r2, %ctaid.x; + mov.u32 %r3, %ntid.x; + mul.wide.u32 %rd21, %r2, %r3; + mov.u32 %r4, %tid.x; + cvt.u64.u32 %rd22, %r4; + add.s64 %rd4, %rd21, %rd22; + .loc 1 114 25 + shl.b64 %rd5, %rd4, 1; + .loc 1 115 5 + setp.ge.s64 %p1, %rd5, %rd16; + @%p1 bra $L__BB0_10; + + .loc 1 113 28 + cvta.to.global.u64 %rd23, %rd15; + .loc 1 121 26 + add.s64 %rd24, %rd23, %rd4; + ld.global.nc.u8 %rs1, [%rd24]; + .loc 1 130 30 + cvt.s64.s32 %rd6, %r1; + or.b64 %rd25, %rd5, %rd6; + and.b64 %rd26, %rd25, -4294967296; + setp.eq.s64 %p2, %rd26, 0; + @%p2 bra $L__BB0_3; + + .loc 1 0 30 + div.s64 %rd56, %rd5, %rd6; + bra.uni $L__BB0_4; + +$L__BB0_3: + cvt.u32.u64 %r5, %rd6; + cvt.u32.u64 %r6, %rd5; + div.u32 %r7, %r6, %r5; + cvt.u64.u32 %rd56, %r7; + +$L__BB0_4: + .loc 1 132 24 + add.s64 %rd27, %rd3, %rd56; + ld.global.nc.u8 %rs2, [%rd27]; + cvt.u32.u16 %r8, %rs2; + and.b32 %r9, %r8, 255; + mul.wide.u32 %rd28, %r9, 4; + add.s64 %rd29, %rd2, %rd28; + .loc 1 131 30 + shr.s64 %rd30, %rd56, 63; + shr.u64 %rd31, %rd30, 56; + add.s64 %rd32, %rd56, %rd31; + shr.s64 %rd33, %rd32, 8; + .loc 1 132 24 + shl.b64 %rd34, %rd33, 2; + add.s64 %rd35, %rd1, %rd34; + ld.global.nc.f32 %f3, [%rd35]; + ld.global.nc.f32 %f4, [%rd29]; + mul.f32 %f5, %f4, %f3; + .loc 1 133 20 + shl.b16 %rs3, %rs1, 2; + cvt.u64.u16 %rd36, %rs3; + and.b64 %rd37, %rd36, 60; + mov.u64 %rd38, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E; + add.s64 %rd39, %rd38, %rd37; + ld.const.f32 %f6, [%rd39]; + fma.rn.f32 %f1, %f6, %f5, %f2; + .loc 1 136 25 + add.s64 %rd10, %rd5, 1; + .loc 1 137 5 + setp.lt.s64 %p3, %rd10, %rd16; + .loc 1 113 28 + cvta.to.global.u64 %rd40, %rd17; + .loc 1 149 9 + shl.b64 %rd41, %rd5, 1; + add.s64 %rd11, %rd40, %rd41; + .loc 1 137 5 + @%p3 bra $L__BB0_6; + bra.uni $L__BB0_5; + +$L__BB0_6: + .loc 1 0 5 + or.b64 %rd42, %rd10, %rd6; + and.b64 %rd43, %rd42, -4294967296; + setp.eq.s64 %p4, %rd43, 0; + @%p4 bra $L__BB0_8; + + div.s64 %rd57, %rd10, %rd6; + bra.uni $L__BB0_9; + +$L__BB0_5: + .loc 1 152 22 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 152 22 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs4, %f1;} + + // end inline asm + .loc 1 152 22 + st.global.u16 [%rd11], %rs4; + bra.uni $L__BB0_10; + +$L__BB0_8: + .loc 1 0 22 + cvt.u32.u64 %r10, %rd6; + cvt.u32.u64 %r11, %rd10; + div.u32 %r12, %r11, %r10; + cvt.u64.u32 %rd57, %r12; + +$L__BB0_9: + .loc 1 143 28 + add.s64 %rd44, %rd3, %rd57; + ld.global.nc.u8 %rs9, [%rd44]; + cvt.u32.u16 %r14, %rs9; + and.b32 %r15, %r14, 255; + mul.wide.u32 %rd45, %r15, 4; + add.s64 %rd46, %rd2, %rd45; + .loc 1 142 34 + shr.s64 %rd47, %rd57, 63; + shr.u64 %rd48, %rd47, 56; + add.s64 %rd49, %rd57, %rd48; + shr.s64 %rd50, %rd49, 8; + .loc 1 143 28 + shl.b64 %rd51, %rd50, 2; + add.s64 %rd52, %rd1, %rd51; + ld.global.nc.f32 %f10, [%rd52]; + ld.global.nc.f32 %f11, [%rd46]; + mul.f32 %f12, %f11, %f10; + .loc 1 123 20 + and.b16 %rs10, %rs1, 240; + shr.u16 %rs11, %rs10, 4; + .loc 1 144 24 + cvt.u32.u16 %r16, %rs11; + mul.wide.u32 %rd53, %r16, 4; + add.s64 %rd55, %rd38, %rd53; + ld.const.f32 %f13, [%rd55]; + fma.rn.f32 %f9, %f13, %f12, %f2; + .loc 1 149 36 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 149 36 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs5, %f1;} + + // end inline asm + .loc 1 149 52 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 149 52 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs6, %f9;} + + // end inline asm + .loc 1 149 9 + .loc 1 74 22, function_name $L__info_string2, inlined_at 1 149 9 + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r13, {%rs5,%rs6};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 149 9 + st.global.u32 [%rd11], %r13; + +$L__BB0_10: + .loc 1 154 1 + ret; + +} +.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT_( + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_0, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_1, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_2, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_3, + .param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_4, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_5, + .param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_6, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_7 +) +{ + .reg .pred %p<82>; + .reg .b16 %rs<189>; + .reg .f32 %f<194>; + .reg .b32 %r<327>; + .reg .b64 %rd<664>; + .loc 1 157 0 + + + ld.param.u64 %rd137, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_0]; + ld.param.u64 %rd140, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_1]; + ld.param.u64 %rd141, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_2]; + ld.param.u64 %rd142, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_3]; + ld.param.f32 %f17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_4]; + ld.param.u64 %rd138, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_5]; + ld.param.u32 %r64, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_6]; + ld.param.u64 %rd139, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_7]; + cvta.to.global.u64 %rd1, %rd141; + cvta.to.global.u64 %rd2, %rd142; + cvta.to.global.u64 %rd3, %rd140; + .loc 1 169 23 + mov.u32 %r65, %ctaid.x; + mov.u32 %r66, %ntid.x; + mul.wide.u32 %rd143, %r65, %r66; + mov.u32 %r67, %tid.x; + cvt.u64.u32 %rd144, %r67; + add.s64 %rd145, %rd143, %rd144; + .loc 1 170 29 + shl.b64 %rd4, %rd145, 4; + .loc 1 171 29 + shl.b64 %rd5, %rd145, 5; + .loc 1 172 35 + add.s64 %rd146, %rd138, 1; + shr.u64 %rd147, %rd146, 63; + add.s64 %rd148, %rd146, %rd147; + shr.s64 %rd6, %rd148, 1; + .loc 1 174 5 + setp.ge.s64 %p1, %rd5, %rd138; + @%p1 bra $L__BB1_194; + + .loc 1 178 5 + add.s64 %rd149, %rd4, 16; + setp.gt.s64 %p2, %rd149, %rd6; + cvta.to.global.u64 %rd150, %rd137; + .loc 1 206 13 + add.s64 %rd7, %rd150, %rd4; + .loc 1 178 5 + @%p2 bra $L__BB1_3; + bra.uni $L__BB1_2; + +$L__BB1_3: + .loc 1 206 13 + setp.ge.s64 %p3, %rd4, %rd6; + mov.u32 %r321, 0; + mov.u32 %r322, %r321; + @%p3 bra $L__BB1_5; + + ld.global.nc.u8 %rs17, [%rd7]; + cvt.u32.u16 %r73, %rs17; + and.b32 %r322, %r73, 255; + +$L__BB1_5: + .loc 1 205 29 + add.s64 %rd151, %rd4, 1; + .loc 1 206 13 + setp.ge.s64 %p4, %rd151, %rd6; + @%p4 bra $L__BB1_7; + + ld.global.nc.u8 %rs18, [%rd7+1]; + cvt.u32.u16 %r75, %rs18; + and.b32 %r321, %r75, 255; + +$L__BB1_7: + .loc 1 205 29 + add.s64 %rd152, %rd4, 2; + .loc 1 206 13 + setp.ge.s64 %p5, %rd152, %rd6; + mov.u32 %r319, 0; + mov.u32 %r320, %r319; + @%p5 bra $L__BB1_9; + + ld.global.nc.u8 %rs19, [%rd7+2]; + cvt.u32.u16 %r77, %rs19; + and.b32 %r320, %r77, 255; + +$L__BB1_9: + .loc 1 205 29 + add.s64 %rd153, %rd4, 3; + .loc 1 206 13 + setp.ge.s64 %p6, %rd153, %rd6; + @%p6 bra $L__BB1_11; + + ld.global.nc.u8 %rs20, [%rd7+3]; + cvt.u32.u16 %r79, %rs20; + and.b32 %r319, %r79, 255; + +$L__BB1_11: + .loc 1 205 29 + add.s64 %rd154, %rd4, 4; + .loc 1 206 13 + setp.ge.s64 %p7, %rd154, %rd6; + mov.u32 %r317, 0; + mov.u32 %r318, %r317; + @%p7 bra $L__BB1_13; + + ld.global.nc.u8 %rs21, [%rd7+4]; + cvt.u32.u16 %r81, %rs21; + and.b32 %r318, %r81, 255; + +$L__BB1_13: + .loc 1 205 29 + add.s64 %rd155, %rd4, 5; + .loc 1 206 13 + setp.ge.s64 %p8, %rd155, %rd6; + @%p8 bra $L__BB1_15; + + ld.global.nc.u8 %rs22, [%rd7+5]; + cvt.u32.u16 %r83, %rs22; + and.b32 %r317, %r83, 255; + +$L__BB1_15: + .loc 1 205 29 + add.s64 %rd156, %rd4, 6; + .loc 1 206 13 + setp.ge.s64 %p9, %rd156, %rd6; + mov.u32 %r315, 0; + mov.u32 %r316, %r315; + @%p9 bra $L__BB1_17; + + ld.global.nc.u8 %rs23, [%rd7+6]; + cvt.u32.u16 %r85, %rs23; + and.b32 %r316, %r85, 255; + +$L__BB1_17: + .loc 1 205 29 + add.s64 %rd157, %rd4, 7; + .loc 1 206 13 + setp.ge.s64 %p10, %rd157, %rd6; + @%p10 bra $L__BB1_19; + + ld.global.nc.u8 %rs24, [%rd7+7]; + cvt.u32.u16 %r87, %rs24; + and.b32 %r315, %r87, 255; + +$L__BB1_19: + .loc 1 205 29 + add.s64 %rd158, %rd4, 8; + .loc 1 206 13 + setp.ge.s64 %p11, %rd158, %rd6; + mov.u32 %r313, 0; + mov.u32 %r314, %r313; + @%p11 bra $L__BB1_21; + + ld.global.nc.u8 %rs25, [%rd7+8]; + cvt.u32.u16 %r89, %rs25; + and.b32 %r314, %r89, 255; + +$L__BB1_21: + .loc 1 205 29 + add.s64 %rd159, %rd4, 9; + .loc 1 206 13 + setp.ge.s64 %p12, %rd159, %rd6; + @%p12 bra $L__BB1_23; + + ld.global.nc.u8 %rs26, [%rd7+9]; + cvt.u32.u16 %r91, %rs26; + and.b32 %r313, %r91, 255; + +$L__BB1_23: + .loc 1 205 29 + add.s64 %rd160, %rd4, 10; + .loc 1 206 13 + setp.ge.s64 %p13, %rd160, %rd6; + mov.u32 %r311, 0; + mov.u32 %r312, %r311; + @%p13 bra $L__BB1_25; + + ld.global.nc.u8 %rs27, [%rd7+10]; + cvt.u32.u16 %r93, %rs27; + and.b32 %r312, %r93, 255; + +$L__BB1_25: + .loc 1 205 29 + add.s64 %rd161, %rd4, 11; + .loc 1 206 13 + setp.ge.s64 %p14, %rd161, %rd6; + @%p14 bra $L__BB1_27; + + ld.global.nc.u8 %rs28, [%rd7+11]; + cvt.u32.u16 %r95, %rs28; + and.b32 %r311, %r95, 255; + +$L__BB1_27: + .loc 1 205 29 + add.s64 %rd162, %rd4, 12; + .loc 1 206 13 + setp.ge.s64 %p15, %rd162, %rd6; + mov.u32 %r324, 0; + mov.u32 %r323, %r324; + @%p15 bra $L__BB1_29; + + ld.global.nc.u8 %rs29, [%rd7+12]; + cvt.u32.u16 %r97, %rs29; + and.b32 %r323, %r97, 255; + +$L__BB1_29: + .loc 1 205 29 + add.s64 %rd163, %rd4, 13; + .loc 1 206 13 + setp.ge.s64 %p16, %rd163, %rd6; + @%p16 bra $L__BB1_31; + + ld.global.nc.u8 %rs30, [%rd7+13]; + cvt.u32.u16 %r99, %rs30; + and.b32 %r324, %r99, 255; + +$L__BB1_31: + .loc 1 205 29 + add.s64 %rd164, %rd4, 14; + .loc 1 206 13 + setp.ge.s64 %p17, %rd164, %rd6; + mov.u32 %r326, 0; + mov.u32 %r325, %r326; + @%p17 bra $L__BB1_33; + + ld.global.nc.u8 %rs31, [%rd7+14]; + cvt.u32.u16 %r101, %rs31; + and.b32 %r325, %r101, 255; + +$L__BB1_33: + .loc 1 205 29 + add.s64 %rd165, %rd4, 15; + .loc 1 206 13 + setp.ge.s64 %p18, %rd165, %rd6; + @%p18 bra $L__BB1_35; + + ld.global.nc.u8 %rs32, [%rd7+15]; + cvt.u32.u16 %r103, %rs32; + and.b32 %r326, %r103, 255; + bra.uni $L__BB1_35; + +$L__BB1_2: + .loc 1 180 29 + ld.global.nc.v4.u32 {%r322, %r318, %r314, %r323}, [%rd7]; + .loc 1 186 17 + shr.u32 %r321, %r322, 8; + .loc 1 187 17 + shr.u32 %r320, %r322, 16; + .loc 1 188 17 + shr.u32 %r319, %r322, 24; + .loc 1 186 17 + shr.u32 %r317, %r318, 8; + .loc 1 187 17 + shr.u32 %r316, %r318, 16; + .loc 1 188 17 + shr.u32 %r315, %r318, 24; + .loc 1 186 17 + shr.u32 %r313, %r314, 8; + .loc 1 187 17 + shr.u32 %r312, %r314, 16; + .loc 1 188 17 + shr.u32 %r311, %r314, 24; + .loc 1 186 17 + shr.u32 %r324, %r323, 8; + .loc 1 187 17 + shr.u32 %r325, %r323, 16; + .loc 1 188 17 + shr.u32 %r326, %r323, 24; + +$L__BB1_35: + .loc 1 213 9 + cvt.u16.u32 %rs1, %r326; + cvt.u16.u32 %rs2, %r325; + cvt.u16.u32 %rs3, %r324; + cvt.u16.u32 %rs4, %r323; + cvt.u16.u32 %rs5, %r321; + cvt.u16.u32 %rs6, %r320; + cvt.u16.u32 %rs7, %r319; + cvt.u16.u32 %rs8, %r318; + cvt.u16.u32 %rs9, %r317; + cvt.u16.u32 %rs10, %r316; + cvt.u16.u32 %rs11, %r315; + cvt.u16.u32 %rs12, %r314; + cvt.u16.u32 %rs13, %r313; + cvt.u16.u32 %rs14, %r312; + cvt.u16.u32 %rs15, %r311; + cvt.u16.u32 %rs16, %r322; + cvt.s64.s32 %rd8, %r64; + or.b64 %rd166, %rd5, %rd8; + and.b64 %rd167, %rd166, -4294967296; + setp.eq.s64 %p19, %rd167, 0; + @%p19 bra $L__BB1_37; + + .loc 1 0 9 + div.s64 %rd632, %rd5, %rd8; + bra.uni $L__BB1_38; + +$L__BB1_37: + cvt.u32.u64 %r104, %rd8; + cvt.u32.u64 %r105, %rd5; + div.u32 %r106, %r105, %r104; + cvt.u64.u32 %rd632, %r106; + +$L__BB1_38: + .loc 1 219 28 + add.s64 %rd168, %rd3, %rd632; + ld.global.nc.u8 %rs33, [%rd168]; + cvt.u32.u16 %r107, %rs33; + and.b32 %r108, %r107, 255; + mul.wide.u32 %rd169, %r108, 4; + add.s64 %rd170, %rd2, %rd169; + shr.s64 %rd171, %rd632, 63; + shr.u64 %rd172, %rd171, 56; + add.s64 %rd173, %rd632, %rd172; + shr.s64 %rd174, %rd173, 8; + shl.b64 %rd175, %rd174, 2; + add.s64 %rd176, %rd1, %rd175; + ld.global.nc.f32 %f18, [%rd176]; + ld.global.nc.f32 %f19, [%rd170]; + mul.f32 %f20, %f19, %f18; + .loc 1 220 24 + shl.b16 %rs34, %rs16, 2; + cvt.u64.u16 %rd177, %rs34; + and.b64 %rd178, %rd177, 60; + mov.u64 %rd179, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E; + add.s64 %rd180, %rd179, %rd178; + ld.const.f32 %f21, [%rd180]; + fma.rn.f32 %f1, %f21, %f20, %f17; + .loc 1 222 29 + add.s64 %rd12, %rd5, 1; + .loc 1 223 9 + setp.lt.s64 %p20, %rd12, %rd138; + cvta.to.global.u64 %rd181, %rd139; + .loc 1 227 13 + shl.b64 %rd182, %rd5, 1; + add.s64 %rd13, %rd181, %rd182; + .loc 1 223 9 + @%p20 bra $L__BB1_40; + bra.uni $L__BB1_39; + +$L__BB1_40: + .loc 1 0 9 + or.b64 %rd183, %rd12, %rd8; + and.b64 %rd184, %rd183, -4294967296; + setp.eq.s64 %p21, %rd184, 0; + @%p21 bra $L__BB1_42; + + div.s64 %rd633, %rd12, %rd8; + bra.uni $L__BB1_43; + +$L__BB1_39: + .loc 1 229 26 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 229 26 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs35, %f1;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13], %rs35; + bra.uni $L__BB1_44; + +$L__BB1_42: + .loc 1 0 26 + cvt.u32.u64 %r109, %rd8; + cvt.u32.u64 %r110, %rd12; + div.u32 %r111, %r110, %r109; + cvt.u64.u32 %rd633, %r111; + +$L__BB1_43: + .loc 1 225 32 + add.s64 %rd185, %rd3, %rd633; + ld.global.nc.u8 %rs40, [%rd185]; + cvt.u32.u16 %r113, %rs40; + and.b32 %r114, %r113, 255; + mul.wide.u32 %rd186, %r114, 4; + add.s64 %rd187, %rd2, %rd186; + shr.s64 %rd188, %rd633, 63; + shr.u64 %rd189, %rd188, 56; + add.s64 %rd190, %rd633, %rd189; + shr.s64 %rd191, %rd190, 8; + shl.b64 %rd192, %rd191, 2; + add.s64 %rd193, %rd1, %rd192; + ld.global.nc.f32 %f25, [%rd193]; + ld.global.nc.f32 %f26, [%rd187]; + mul.f32 %f27, %f26, %f25; + .loc 1 216 24 + and.b16 %rs41, %rs16, 240; + shr.u16 %rs42, %rs41, 4; + .loc 1 226 28 + cvt.u32.u16 %r115, %rs42; + mul.wide.u32 %rd194, %r115, 4; + add.s64 %rd196, %rd179, %rd194; + ld.const.f32 %f28, [%rd196]; + fma.rn.f32 %f24, %f28, %f27, %f17; + .loc 1 227 40 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 227 40 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs36, %f1;} + + // end inline asm + .loc 1 227 56 + .loc 1 62 71, function_name $L__info_string0, inlined_at 1 227 56 + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs37, %f24;} + + // end inline asm + .loc 1 227 13 + .loc 1 74 22, function_name $L__info_string2, inlined_at 1 227 13 + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r112, {%rs36,%rs37};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13], %r112; + +$L__BB1_44: + .loc 1 212 29 + add.s64 %rd17, %rd5, 2; + .loc 1 213 9 + setp.ge.s64 %p22, %rd17, %rd138; + @%p22 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd197, %rd17, %rd8; + and.b64 %rd198, %rd197, -4294967296; + setp.eq.s64 %p23, %rd198, 0; + @%p23 bra $L__BB1_47; + + div.s64 %rd634, %rd17, %rd8; + bra.uni $L__BB1_48; + +$L__BB1_47: + cvt.u32.u64 %r116, %rd8; + cvt.u32.u64 %r117, %rd17; + div.u32 %r118, %r117, %r116; + cvt.u64.u32 %rd634, %r118; + +$L__BB1_48: + .loc 1 219 28 + add.s64 %rd199, %rd3, %rd634; + ld.global.nc.u8 %rs43, [%rd199]; + cvt.u32.u16 %r119, %rs43; + and.b32 %r120, %r119, 255; + mul.wide.u32 %rd200, %r120, 4; + add.s64 %rd201, %rd2, %rd200; + shr.s64 %rd202, %rd634, 63; + shr.u64 %rd203, %rd202, 56; + add.s64 %rd204, %rd634, %rd203; + shr.s64 %rd205, %rd204, 8; + shl.b64 %rd206, %rd205, 2; + add.s64 %rd207, %rd1, %rd206; + ld.global.nc.f32 %f29, [%rd207]; + ld.global.nc.f32 %f30, [%rd201]; + mul.f32 %f31, %f30, %f29; + .loc 1 220 24 + shl.b16 %rs44, %rs5, 2; + cvt.u64.u16 %rd208, %rs44; + and.b64 %rd209, %rd208, 60; + add.s64 %rd211, %rd179, %rd209; + ld.const.f32 %f32, [%rd211]; + fma.rn.f32 %f2, %f32, %f31, %f17; + .loc 1 222 29 + add.s64 %rd21, %rd5, 3; + .loc 1 223 9 + setp.lt.s64 %p24, %rd21, %rd138; + @%p24 bra $L__BB1_50; + bra.uni $L__BB1_49; + +$L__BB1_50: + .loc 1 0 9 + or.b64 %rd212, %rd21, %rd8; + and.b64 %rd213, %rd212, -4294967296; + setp.eq.s64 %p25, %rd213, 0; + @%p25 bra $L__BB1_52; + + div.s64 %rd635, %rd21, %rd8; + bra.uni $L__BB1_53; + +$L__BB1_49: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs45, %f2;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+4], %rs45; + bra.uni $L__BB1_54; + +$L__BB1_52: + .loc 1 0 26 + cvt.u32.u64 %r121, %rd8; + cvt.u32.u64 %r122, %rd21; + div.u32 %r123, %r122, %r121; + cvt.u64.u32 %rd635, %r123; + +$L__BB1_53: + .loc 1 225 32 + add.s64 %rd214, %rd3, %rd635; + ld.global.nc.u8 %rs50, [%rd214]; + cvt.u32.u16 %r125, %rs50; + and.b32 %r126, %r125, 255; + mul.wide.u32 %rd215, %r126, 4; + add.s64 %rd216, %rd2, %rd215; + shr.s64 %rd217, %rd635, 63; + shr.u64 %rd218, %rd217, 56; + add.s64 %rd219, %rd635, %rd218; + shr.s64 %rd220, %rd219, 8; + shl.b64 %rd221, %rd220, 2; + add.s64 %rd222, %rd1, %rd221; + ld.global.nc.f32 %f36, [%rd222]; + ld.global.nc.f32 %f37, [%rd216]; + mul.f32 %f38, %f37, %f36; + .loc 1 216 24 + and.b16 %rs51, %rs5, 240; + shr.u16 %rs52, %rs51, 4; + .loc 1 226 28 + cvt.u32.u16 %r127, %rs52; + mul.wide.u32 %rd223, %r127, 4; + add.s64 %rd225, %rd179, %rd223; + ld.const.f32 %f39, [%rd225]; + fma.rn.f32 %f35, %f39, %f38, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs46, %f2;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs47, %f35;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r124, {%rs46,%rs47};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+4], %r124; + +$L__BB1_54: + .loc 1 212 29 + add.s64 %rd25, %rd5, 4; + .loc 1 213 9 + setp.ge.s64 %p26, %rd25, %rd138; + @%p26 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd226, %rd25, %rd8; + and.b64 %rd227, %rd226, -4294967296; + setp.eq.s64 %p27, %rd227, 0; + @%p27 bra $L__BB1_57; + + div.s64 %rd636, %rd25, %rd8; + bra.uni $L__BB1_58; + +$L__BB1_57: + cvt.u32.u64 %r128, %rd8; + cvt.u32.u64 %r129, %rd25; + div.u32 %r130, %r129, %r128; + cvt.u64.u32 %rd636, %r130; + +$L__BB1_58: + .loc 1 219 28 + add.s64 %rd228, %rd3, %rd636; + ld.global.nc.u8 %rs53, [%rd228]; + cvt.u32.u16 %r131, %rs53; + and.b32 %r132, %r131, 255; + mul.wide.u32 %rd229, %r132, 4; + add.s64 %rd230, %rd2, %rd229; + shr.s64 %rd231, %rd636, 63; + shr.u64 %rd232, %rd231, 56; + add.s64 %rd233, %rd636, %rd232; + shr.s64 %rd234, %rd233, 8; + shl.b64 %rd235, %rd234, 2; + add.s64 %rd236, %rd1, %rd235; + ld.global.nc.f32 %f40, [%rd236]; + ld.global.nc.f32 %f41, [%rd230]; + mul.f32 %f42, %f41, %f40; + .loc 1 220 24 + shl.b16 %rs54, %rs6, 2; + cvt.u64.u16 %rd237, %rs54; + and.b64 %rd238, %rd237, 60; + add.s64 %rd240, %rd179, %rd238; + ld.const.f32 %f43, [%rd240]; + fma.rn.f32 %f3, %f43, %f42, %f17; + .loc 1 222 29 + add.s64 %rd29, %rd5, 5; + .loc 1 223 9 + setp.lt.s64 %p28, %rd29, %rd138; + @%p28 bra $L__BB1_60; + bra.uni $L__BB1_59; + +$L__BB1_60: + .loc 1 0 9 + or.b64 %rd241, %rd29, %rd8; + and.b64 %rd242, %rd241, -4294967296; + setp.eq.s64 %p29, %rd242, 0; + @%p29 bra $L__BB1_62; + + div.s64 %rd637, %rd29, %rd8; + bra.uni $L__BB1_63; + +$L__BB1_59: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs55, %f3;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+8], %rs55; + bra.uni $L__BB1_64; + +$L__BB1_62: + .loc 1 0 26 + cvt.u32.u64 %r133, %rd8; + cvt.u32.u64 %r134, %rd29; + div.u32 %r135, %r134, %r133; + cvt.u64.u32 %rd637, %r135; + +$L__BB1_63: + .loc 1 225 32 + add.s64 %rd243, %rd3, %rd637; + ld.global.nc.u8 %rs60, [%rd243]; + cvt.u32.u16 %r137, %rs60; + and.b32 %r138, %r137, 255; + mul.wide.u32 %rd244, %r138, 4; + add.s64 %rd245, %rd2, %rd244; + shr.s64 %rd246, %rd637, 63; + shr.u64 %rd247, %rd246, 56; + add.s64 %rd248, %rd637, %rd247; + shr.s64 %rd249, %rd248, 8; + shl.b64 %rd250, %rd249, 2; + add.s64 %rd251, %rd1, %rd250; + ld.global.nc.f32 %f47, [%rd251]; + ld.global.nc.f32 %f48, [%rd245]; + mul.f32 %f49, %f48, %f47; + .loc 1 216 24 + and.b16 %rs61, %rs6, 240; + shr.u16 %rs62, %rs61, 4; + .loc 1 226 28 + cvt.u32.u16 %r139, %rs62; + mul.wide.u32 %rd252, %r139, 4; + add.s64 %rd254, %rd179, %rd252; + ld.const.f32 %f50, [%rd254]; + fma.rn.f32 %f46, %f50, %f49, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs56, %f3;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs57, %f46;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r136, {%rs56,%rs57};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+8], %r136; + +$L__BB1_64: + .loc 1 212 29 + add.s64 %rd33, %rd5, 6; + .loc 1 213 9 + setp.ge.s64 %p30, %rd33, %rd138; + @%p30 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd255, %rd33, %rd8; + and.b64 %rd256, %rd255, -4294967296; + setp.eq.s64 %p31, %rd256, 0; + @%p31 bra $L__BB1_67; + + div.s64 %rd638, %rd33, %rd8; + bra.uni $L__BB1_68; + +$L__BB1_67: + cvt.u32.u64 %r140, %rd8; + cvt.u32.u64 %r141, %rd33; + div.u32 %r142, %r141, %r140; + cvt.u64.u32 %rd638, %r142; + +$L__BB1_68: + .loc 1 219 28 + add.s64 %rd257, %rd3, %rd638; + ld.global.nc.u8 %rs63, [%rd257]; + cvt.u32.u16 %r143, %rs63; + and.b32 %r144, %r143, 255; + mul.wide.u32 %rd258, %r144, 4; + add.s64 %rd259, %rd2, %rd258; + shr.s64 %rd260, %rd638, 63; + shr.u64 %rd261, %rd260, 56; + add.s64 %rd262, %rd638, %rd261; + shr.s64 %rd263, %rd262, 8; + shl.b64 %rd264, %rd263, 2; + add.s64 %rd265, %rd1, %rd264; + ld.global.nc.f32 %f51, [%rd265]; + ld.global.nc.f32 %f52, [%rd259]; + mul.f32 %f53, %f52, %f51; + .loc 1 220 24 + shl.b16 %rs64, %rs7, 2; + cvt.u64.u16 %rd266, %rs64; + and.b64 %rd267, %rd266, 60; + add.s64 %rd269, %rd179, %rd267; + ld.const.f32 %f54, [%rd269]; + fma.rn.f32 %f4, %f54, %f53, %f17; + .loc 1 222 29 + add.s64 %rd37, %rd5, 7; + .loc 1 223 9 + setp.lt.s64 %p32, %rd37, %rd138; + @%p32 bra $L__BB1_70; + bra.uni $L__BB1_69; + +$L__BB1_70: + .loc 1 0 9 + or.b64 %rd270, %rd37, %rd8; + and.b64 %rd271, %rd270, -4294967296; + setp.eq.s64 %p33, %rd271, 0; + @%p33 bra $L__BB1_72; + + div.s64 %rd639, %rd37, %rd8; + bra.uni $L__BB1_73; + +$L__BB1_69: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs65, %f4;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+12], %rs65; + bra.uni $L__BB1_74; + +$L__BB1_72: + .loc 1 0 26 + cvt.u32.u64 %r145, %rd8; + cvt.u32.u64 %r146, %rd37; + div.u32 %r147, %r146, %r145; + cvt.u64.u32 %rd639, %r147; + +$L__BB1_73: + .loc 1 225 32 + add.s64 %rd272, %rd3, %rd639; + ld.global.nc.u8 %rs70, [%rd272]; + cvt.u32.u16 %r149, %rs70; + and.b32 %r150, %r149, 255; + mul.wide.u32 %rd273, %r150, 4; + add.s64 %rd274, %rd2, %rd273; + shr.s64 %rd275, %rd639, 63; + shr.u64 %rd276, %rd275, 56; + add.s64 %rd277, %rd639, %rd276; + shr.s64 %rd278, %rd277, 8; + shl.b64 %rd279, %rd278, 2; + add.s64 %rd280, %rd1, %rd279; + ld.global.nc.f32 %f58, [%rd280]; + ld.global.nc.f32 %f59, [%rd274]; + mul.f32 %f60, %f59, %f58; + .loc 1 216 24 + shr.u16 %rs71, %rs7, 4; + .loc 1 226 28 + cvt.u32.u16 %r151, %rs71; + mul.wide.u32 %rd281, %r151, 4; + add.s64 %rd283, %rd179, %rd281; + ld.const.f32 %f61, [%rd283]; + fma.rn.f32 %f57, %f61, %f60, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs66, %f4;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs67, %f57;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r148, {%rs66,%rs67};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+12], %r148; + +$L__BB1_74: + .loc 1 212 29 + add.s64 %rd41, %rd5, 8; + .loc 1 213 9 + setp.ge.s64 %p34, %rd41, %rd138; + @%p34 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd284, %rd41, %rd8; + and.b64 %rd285, %rd284, -4294967296; + setp.eq.s64 %p35, %rd285, 0; + @%p35 bra $L__BB1_77; + + div.s64 %rd640, %rd41, %rd8; + bra.uni $L__BB1_78; + +$L__BB1_77: + cvt.u32.u64 %r152, %rd8; + cvt.u32.u64 %r153, %rd41; + div.u32 %r154, %r153, %r152; + cvt.u64.u32 %rd640, %r154; + +$L__BB1_78: + .loc 1 219 28 + add.s64 %rd286, %rd3, %rd640; + ld.global.nc.u8 %rs72, [%rd286]; + cvt.u32.u16 %r155, %rs72; + and.b32 %r156, %r155, 255; + mul.wide.u32 %rd287, %r156, 4; + add.s64 %rd288, %rd2, %rd287; + shr.s64 %rd289, %rd640, 63; + shr.u64 %rd290, %rd289, 56; + add.s64 %rd291, %rd640, %rd290; + shr.s64 %rd292, %rd291, 8; + shl.b64 %rd293, %rd292, 2; + add.s64 %rd294, %rd1, %rd293; + ld.global.nc.f32 %f62, [%rd294]; + ld.global.nc.f32 %f63, [%rd288]; + mul.f32 %f64, %f63, %f62; + .loc 1 220 24 + shl.b16 %rs73, %rs8, 2; + cvt.u64.u16 %rd295, %rs73; + and.b64 %rd296, %rd295, 60; + add.s64 %rd298, %rd179, %rd296; + ld.const.f32 %f65, [%rd298]; + fma.rn.f32 %f5, %f65, %f64, %f17; + .loc 1 222 29 + add.s64 %rd45, %rd5, 9; + .loc 1 223 9 + setp.lt.s64 %p36, %rd45, %rd138; + @%p36 bra $L__BB1_80; + bra.uni $L__BB1_79; + +$L__BB1_80: + .loc 1 0 9 + or.b64 %rd299, %rd45, %rd8; + and.b64 %rd300, %rd299, -4294967296; + setp.eq.s64 %p37, %rd300, 0; + @%p37 bra $L__BB1_82; + + div.s64 %rd641, %rd45, %rd8; + bra.uni $L__BB1_83; + +$L__BB1_79: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs74, %f5;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+16], %rs74; + bra.uni $L__BB1_84; + +$L__BB1_82: + .loc 1 0 26 + cvt.u32.u64 %r157, %rd8; + cvt.u32.u64 %r158, %rd45; + div.u32 %r159, %r158, %r157; + cvt.u64.u32 %rd641, %r159; + +$L__BB1_83: + .loc 1 225 32 + add.s64 %rd301, %rd3, %rd641; + ld.global.nc.u8 %rs79, [%rd301]; + cvt.u32.u16 %r161, %rs79; + and.b32 %r162, %r161, 255; + mul.wide.u32 %rd302, %r162, 4; + add.s64 %rd303, %rd2, %rd302; + shr.s64 %rd304, %rd641, 63; + shr.u64 %rd305, %rd304, 56; + add.s64 %rd306, %rd641, %rd305; + shr.s64 %rd307, %rd306, 8; + shl.b64 %rd308, %rd307, 2; + add.s64 %rd309, %rd1, %rd308; + ld.global.nc.f32 %f69, [%rd309]; + ld.global.nc.f32 %f70, [%rd303]; + mul.f32 %f71, %f70, %f69; + .loc 1 216 24 + and.b16 %rs80, %rs8, 240; + shr.u16 %rs81, %rs80, 4; + .loc 1 226 28 + cvt.u32.u16 %r163, %rs81; + mul.wide.u32 %rd310, %r163, 4; + add.s64 %rd312, %rd179, %rd310; + ld.const.f32 %f72, [%rd312]; + fma.rn.f32 %f68, %f72, %f71, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs75, %f5;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs76, %f68;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r160, {%rs75,%rs76};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+16], %r160; + +$L__BB1_84: + .loc 1 212 29 + add.s64 %rd49, %rd5, 10; + .loc 1 213 9 + setp.ge.s64 %p38, %rd49, %rd138; + @%p38 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd313, %rd49, %rd8; + and.b64 %rd314, %rd313, -4294967296; + setp.eq.s64 %p39, %rd314, 0; + @%p39 bra $L__BB1_87; + + div.s64 %rd642, %rd49, %rd8; + bra.uni $L__BB1_88; + +$L__BB1_87: + cvt.u32.u64 %r164, %rd8; + cvt.u32.u64 %r165, %rd49; + div.u32 %r166, %r165, %r164; + cvt.u64.u32 %rd642, %r166; + +$L__BB1_88: + .loc 1 219 28 + add.s64 %rd315, %rd3, %rd642; + ld.global.nc.u8 %rs82, [%rd315]; + cvt.u32.u16 %r167, %rs82; + and.b32 %r168, %r167, 255; + mul.wide.u32 %rd316, %r168, 4; + add.s64 %rd317, %rd2, %rd316; + shr.s64 %rd318, %rd642, 63; + shr.u64 %rd319, %rd318, 56; + add.s64 %rd320, %rd642, %rd319; + shr.s64 %rd321, %rd320, 8; + shl.b64 %rd322, %rd321, 2; + add.s64 %rd323, %rd1, %rd322; + ld.global.nc.f32 %f73, [%rd323]; + ld.global.nc.f32 %f74, [%rd317]; + mul.f32 %f75, %f74, %f73; + .loc 1 220 24 + shl.b16 %rs83, %rs9, 2; + cvt.u64.u16 %rd324, %rs83; + and.b64 %rd325, %rd324, 60; + add.s64 %rd327, %rd179, %rd325; + ld.const.f32 %f76, [%rd327]; + fma.rn.f32 %f6, %f76, %f75, %f17; + .loc 1 222 29 + add.s64 %rd53, %rd5, 11; + .loc 1 223 9 + setp.lt.s64 %p40, %rd53, %rd138; + @%p40 bra $L__BB1_90; + bra.uni $L__BB1_89; + +$L__BB1_90: + .loc 1 0 9 + or.b64 %rd328, %rd53, %rd8; + and.b64 %rd329, %rd328, -4294967296; + setp.eq.s64 %p41, %rd329, 0; + @%p41 bra $L__BB1_92; + + div.s64 %rd643, %rd53, %rd8; + bra.uni $L__BB1_93; + +$L__BB1_89: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs84, %f6;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+20], %rs84; + bra.uni $L__BB1_94; + +$L__BB1_92: + .loc 1 0 26 + cvt.u32.u64 %r169, %rd8; + cvt.u32.u64 %r170, %rd53; + div.u32 %r171, %r170, %r169; + cvt.u64.u32 %rd643, %r171; + +$L__BB1_93: + .loc 1 225 32 + add.s64 %rd330, %rd3, %rd643; + ld.global.nc.u8 %rs89, [%rd330]; + cvt.u32.u16 %r173, %rs89; + and.b32 %r174, %r173, 255; + mul.wide.u32 %rd331, %r174, 4; + add.s64 %rd332, %rd2, %rd331; + shr.s64 %rd333, %rd643, 63; + shr.u64 %rd334, %rd333, 56; + add.s64 %rd335, %rd643, %rd334; + shr.s64 %rd336, %rd335, 8; + shl.b64 %rd337, %rd336, 2; + add.s64 %rd338, %rd1, %rd337; + ld.global.nc.f32 %f80, [%rd338]; + ld.global.nc.f32 %f81, [%rd332]; + mul.f32 %f82, %f81, %f80; + .loc 1 216 24 + and.b16 %rs90, %rs9, 240; + shr.u16 %rs91, %rs90, 4; + .loc 1 226 28 + cvt.u32.u16 %r175, %rs91; + mul.wide.u32 %rd339, %r175, 4; + add.s64 %rd341, %rd179, %rd339; + ld.const.f32 %f83, [%rd341]; + fma.rn.f32 %f79, %f83, %f82, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs85, %f6;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs86, %f79;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r172, {%rs85,%rs86};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+20], %r172; + +$L__BB1_94: + .loc 1 212 29 + add.s64 %rd57, %rd5, 12; + .loc 1 213 9 + setp.ge.s64 %p42, %rd57, %rd138; + @%p42 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd342, %rd57, %rd8; + and.b64 %rd343, %rd342, -4294967296; + setp.eq.s64 %p43, %rd343, 0; + @%p43 bra $L__BB1_97; + + div.s64 %rd644, %rd57, %rd8; + bra.uni $L__BB1_98; + +$L__BB1_97: + cvt.u32.u64 %r176, %rd8; + cvt.u32.u64 %r177, %rd57; + div.u32 %r178, %r177, %r176; + cvt.u64.u32 %rd644, %r178; + +$L__BB1_98: + .loc 1 219 28 + add.s64 %rd344, %rd3, %rd644; + ld.global.nc.u8 %rs92, [%rd344]; + cvt.u32.u16 %r179, %rs92; + and.b32 %r180, %r179, 255; + mul.wide.u32 %rd345, %r180, 4; + add.s64 %rd346, %rd2, %rd345; + shr.s64 %rd347, %rd644, 63; + shr.u64 %rd348, %rd347, 56; + add.s64 %rd349, %rd644, %rd348; + shr.s64 %rd350, %rd349, 8; + shl.b64 %rd351, %rd350, 2; + add.s64 %rd352, %rd1, %rd351; + ld.global.nc.f32 %f84, [%rd352]; + ld.global.nc.f32 %f85, [%rd346]; + mul.f32 %f86, %f85, %f84; + .loc 1 220 24 + shl.b16 %rs93, %rs10, 2; + cvt.u64.u16 %rd353, %rs93; + and.b64 %rd354, %rd353, 60; + add.s64 %rd356, %rd179, %rd354; + ld.const.f32 %f87, [%rd356]; + fma.rn.f32 %f7, %f87, %f86, %f17; + .loc 1 222 29 + add.s64 %rd61, %rd5, 13; + .loc 1 223 9 + setp.lt.s64 %p44, %rd61, %rd138; + @%p44 bra $L__BB1_100; + bra.uni $L__BB1_99; + +$L__BB1_100: + .loc 1 0 9 + or.b64 %rd357, %rd61, %rd8; + and.b64 %rd358, %rd357, -4294967296; + setp.eq.s64 %p45, %rd358, 0; + @%p45 bra $L__BB1_102; + + div.s64 %rd645, %rd61, %rd8; + bra.uni $L__BB1_103; + +$L__BB1_99: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs94, %f7;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+24], %rs94; + bra.uni $L__BB1_104; + +$L__BB1_102: + .loc 1 0 26 + cvt.u32.u64 %r181, %rd8; + cvt.u32.u64 %r182, %rd61; + div.u32 %r183, %r182, %r181; + cvt.u64.u32 %rd645, %r183; + +$L__BB1_103: + .loc 1 225 32 + add.s64 %rd359, %rd3, %rd645; + ld.global.nc.u8 %rs99, [%rd359]; + cvt.u32.u16 %r185, %rs99; + and.b32 %r186, %r185, 255; + mul.wide.u32 %rd360, %r186, 4; + add.s64 %rd361, %rd2, %rd360; + shr.s64 %rd362, %rd645, 63; + shr.u64 %rd363, %rd362, 56; + add.s64 %rd364, %rd645, %rd363; + shr.s64 %rd365, %rd364, 8; + shl.b64 %rd366, %rd365, 2; + add.s64 %rd367, %rd1, %rd366; + ld.global.nc.f32 %f91, [%rd367]; + ld.global.nc.f32 %f92, [%rd361]; + mul.f32 %f93, %f92, %f91; + .loc 1 216 24 + and.b16 %rs100, %rs10, 240; + shr.u16 %rs101, %rs100, 4; + .loc 1 226 28 + cvt.u32.u16 %r187, %rs101; + mul.wide.u32 %rd368, %r187, 4; + add.s64 %rd370, %rd179, %rd368; + ld.const.f32 %f94, [%rd370]; + fma.rn.f32 %f90, %f94, %f93, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs95, %f7;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs96, %f90;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r184, {%rs95,%rs96};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+24], %r184; + +$L__BB1_104: + .loc 1 212 29 + add.s64 %rd65, %rd5, 14; + .loc 1 213 9 + setp.ge.s64 %p46, %rd65, %rd138; + @%p46 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd371, %rd65, %rd8; + and.b64 %rd372, %rd371, -4294967296; + setp.eq.s64 %p47, %rd372, 0; + @%p47 bra $L__BB1_107; + + div.s64 %rd646, %rd65, %rd8; + bra.uni $L__BB1_108; + +$L__BB1_107: + cvt.u32.u64 %r188, %rd8; + cvt.u32.u64 %r189, %rd65; + div.u32 %r190, %r189, %r188; + cvt.u64.u32 %rd646, %r190; + +$L__BB1_108: + .loc 1 219 28 + add.s64 %rd373, %rd3, %rd646; + ld.global.nc.u8 %rs102, [%rd373]; + cvt.u32.u16 %r191, %rs102; + and.b32 %r192, %r191, 255; + mul.wide.u32 %rd374, %r192, 4; + add.s64 %rd375, %rd2, %rd374; + shr.s64 %rd376, %rd646, 63; + shr.u64 %rd377, %rd376, 56; + add.s64 %rd378, %rd646, %rd377; + shr.s64 %rd379, %rd378, 8; + shl.b64 %rd380, %rd379, 2; + add.s64 %rd381, %rd1, %rd380; + ld.global.nc.f32 %f95, [%rd381]; + ld.global.nc.f32 %f96, [%rd375]; + mul.f32 %f97, %f96, %f95; + .loc 1 220 24 + shl.b16 %rs103, %rs11, 2; + cvt.u64.u16 %rd382, %rs103; + and.b64 %rd383, %rd382, 60; + add.s64 %rd385, %rd179, %rd383; + ld.const.f32 %f98, [%rd385]; + fma.rn.f32 %f8, %f98, %f97, %f17; + .loc 1 222 29 + add.s64 %rd69, %rd5, 15; + .loc 1 223 9 + setp.lt.s64 %p48, %rd69, %rd138; + @%p48 bra $L__BB1_110; + bra.uni $L__BB1_109; + +$L__BB1_110: + .loc 1 0 9 + or.b64 %rd386, %rd69, %rd8; + and.b64 %rd387, %rd386, -4294967296; + setp.eq.s64 %p49, %rd387, 0; + @%p49 bra $L__BB1_112; + + div.s64 %rd647, %rd69, %rd8; + bra.uni $L__BB1_113; + +$L__BB1_109: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs104, %f8;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+28], %rs104; + bra.uni $L__BB1_114; + +$L__BB1_112: + .loc 1 0 26 + cvt.u32.u64 %r193, %rd8; + cvt.u32.u64 %r194, %rd69; + div.u32 %r195, %r194, %r193; + cvt.u64.u32 %rd647, %r195; + +$L__BB1_113: + .loc 1 225 32 + add.s64 %rd388, %rd3, %rd647; + ld.global.nc.u8 %rs109, [%rd388]; + cvt.u32.u16 %r197, %rs109; + and.b32 %r198, %r197, 255; + mul.wide.u32 %rd389, %r198, 4; + add.s64 %rd390, %rd2, %rd389; + shr.s64 %rd391, %rd647, 63; + shr.u64 %rd392, %rd391, 56; + add.s64 %rd393, %rd647, %rd392; + shr.s64 %rd394, %rd393, 8; + shl.b64 %rd395, %rd394, 2; + add.s64 %rd396, %rd1, %rd395; + ld.global.nc.f32 %f102, [%rd396]; + ld.global.nc.f32 %f103, [%rd390]; + mul.f32 %f104, %f103, %f102; + .loc 1 216 24 + shr.u16 %rs110, %rs11, 4; + .loc 1 226 28 + cvt.u32.u16 %r199, %rs110; + mul.wide.u32 %rd397, %r199, 4; + add.s64 %rd399, %rd179, %rd397; + ld.const.f32 %f105, [%rd399]; + fma.rn.f32 %f101, %f105, %f104, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs105, %f8;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs106, %f101;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r196, {%rs105,%rs106};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+28], %r196; + +$L__BB1_114: + .loc 1 212 29 + add.s64 %rd73, %rd5, 16; + .loc 1 213 9 + setp.ge.s64 %p50, %rd73, %rd138; + @%p50 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd400, %rd73, %rd8; + and.b64 %rd401, %rd400, -4294967296; + setp.eq.s64 %p51, %rd401, 0; + @%p51 bra $L__BB1_117; + + div.s64 %rd648, %rd73, %rd8; + bra.uni $L__BB1_118; + +$L__BB1_117: + cvt.u32.u64 %r200, %rd8; + cvt.u32.u64 %r201, %rd73; + div.u32 %r202, %r201, %r200; + cvt.u64.u32 %rd648, %r202; + +$L__BB1_118: + .loc 1 219 28 + add.s64 %rd402, %rd3, %rd648; + ld.global.nc.u8 %rs111, [%rd402]; + cvt.u32.u16 %r203, %rs111; + and.b32 %r204, %r203, 255; + mul.wide.u32 %rd403, %r204, 4; + add.s64 %rd404, %rd2, %rd403; + shr.s64 %rd405, %rd648, 63; + shr.u64 %rd406, %rd405, 56; + add.s64 %rd407, %rd648, %rd406; + shr.s64 %rd408, %rd407, 8; + shl.b64 %rd409, %rd408, 2; + add.s64 %rd410, %rd1, %rd409; + ld.global.nc.f32 %f106, [%rd410]; + ld.global.nc.f32 %f107, [%rd404]; + mul.f32 %f108, %f107, %f106; + .loc 1 220 24 + shl.b16 %rs112, %rs12, 2; + cvt.u64.u16 %rd411, %rs112; + and.b64 %rd412, %rd411, 60; + add.s64 %rd414, %rd179, %rd412; + ld.const.f32 %f109, [%rd414]; + fma.rn.f32 %f9, %f109, %f108, %f17; + .loc 1 222 29 + add.s64 %rd77, %rd5, 17; + .loc 1 223 9 + setp.lt.s64 %p52, %rd77, %rd138; + @%p52 bra $L__BB1_120; + bra.uni $L__BB1_119; + +$L__BB1_120: + .loc 1 0 9 + or.b64 %rd415, %rd77, %rd8; + and.b64 %rd416, %rd415, -4294967296; + setp.eq.s64 %p53, %rd416, 0; + @%p53 bra $L__BB1_122; + + div.s64 %rd649, %rd77, %rd8; + bra.uni $L__BB1_123; + +$L__BB1_119: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs113, %f9;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+32], %rs113; + bra.uni $L__BB1_124; + +$L__BB1_122: + .loc 1 0 26 + cvt.u32.u64 %r205, %rd8; + cvt.u32.u64 %r206, %rd77; + div.u32 %r207, %r206, %r205; + cvt.u64.u32 %rd649, %r207; + +$L__BB1_123: + .loc 1 225 32 + add.s64 %rd417, %rd3, %rd649; + ld.global.nc.u8 %rs118, [%rd417]; + cvt.u32.u16 %r209, %rs118; + and.b32 %r210, %r209, 255; + mul.wide.u32 %rd418, %r210, 4; + add.s64 %rd419, %rd2, %rd418; + shr.s64 %rd420, %rd649, 63; + shr.u64 %rd421, %rd420, 56; + add.s64 %rd422, %rd649, %rd421; + shr.s64 %rd423, %rd422, 8; + shl.b64 %rd424, %rd423, 2; + add.s64 %rd425, %rd1, %rd424; + ld.global.nc.f32 %f113, [%rd425]; + ld.global.nc.f32 %f114, [%rd419]; + mul.f32 %f115, %f114, %f113; + .loc 1 216 24 + and.b16 %rs119, %rs12, 240; + shr.u16 %rs120, %rs119, 4; + .loc 1 226 28 + cvt.u32.u16 %r211, %rs120; + mul.wide.u32 %rd426, %r211, 4; + add.s64 %rd428, %rd179, %rd426; + ld.const.f32 %f116, [%rd428]; + fma.rn.f32 %f112, %f116, %f115, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs114, %f9;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs115, %f112;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r208, {%rs114,%rs115};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+32], %r208; + +$L__BB1_124: + .loc 1 212 29 + add.s64 %rd81, %rd5, 18; + .loc 1 213 9 + setp.ge.s64 %p54, %rd81, %rd138; + @%p54 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd429, %rd81, %rd8; + and.b64 %rd430, %rd429, -4294967296; + setp.eq.s64 %p55, %rd430, 0; + @%p55 bra $L__BB1_127; + + div.s64 %rd650, %rd81, %rd8; + bra.uni $L__BB1_128; + +$L__BB1_127: + cvt.u32.u64 %r212, %rd8; + cvt.u32.u64 %r213, %rd81; + div.u32 %r214, %r213, %r212; + cvt.u64.u32 %rd650, %r214; + +$L__BB1_128: + .loc 1 219 28 + add.s64 %rd431, %rd3, %rd650; + ld.global.nc.u8 %rs121, [%rd431]; + cvt.u32.u16 %r215, %rs121; + and.b32 %r216, %r215, 255; + mul.wide.u32 %rd432, %r216, 4; + add.s64 %rd433, %rd2, %rd432; + shr.s64 %rd434, %rd650, 63; + shr.u64 %rd435, %rd434, 56; + add.s64 %rd436, %rd650, %rd435; + shr.s64 %rd437, %rd436, 8; + shl.b64 %rd438, %rd437, 2; + add.s64 %rd439, %rd1, %rd438; + ld.global.nc.f32 %f117, [%rd439]; + ld.global.nc.f32 %f118, [%rd433]; + mul.f32 %f119, %f118, %f117; + .loc 1 220 24 + shl.b16 %rs122, %rs13, 2; + cvt.u64.u16 %rd440, %rs122; + and.b64 %rd441, %rd440, 60; + add.s64 %rd443, %rd179, %rd441; + ld.const.f32 %f120, [%rd443]; + fma.rn.f32 %f10, %f120, %f119, %f17; + .loc 1 222 29 + add.s64 %rd85, %rd5, 19; + .loc 1 223 9 + setp.lt.s64 %p56, %rd85, %rd138; + @%p56 bra $L__BB1_130; + bra.uni $L__BB1_129; + +$L__BB1_130: + .loc 1 0 9 + or.b64 %rd444, %rd85, %rd8; + and.b64 %rd445, %rd444, -4294967296; + setp.eq.s64 %p57, %rd445, 0; + @%p57 bra $L__BB1_132; + + div.s64 %rd651, %rd85, %rd8; + bra.uni $L__BB1_133; + +$L__BB1_129: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs123, %f10;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+36], %rs123; + bra.uni $L__BB1_134; + +$L__BB1_132: + .loc 1 0 26 + cvt.u32.u64 %r217, %rd8; + cvt.u32.u64 %r218, %rd85; + div.u32 %r219, %r218, %r217; + cvt.u64.u32 %rd651, %r219; + +$L__BB1_133: + .loc 1 225 32 + add.s64 %rd446, %rd3, %rd651; + ld.global.nc.u8 %rs128, [%rd446]; + cvt.u32.u16 %r221, %rs128; + and.b32 %r222, %r221, 255; + mul.wide.u32 %rd447, %r222, 4; + add.s64 %rd448, %rd2, %rd447; + shr.s64 %rd449, %rd651, 63; + shr.u64 %rd450, %rd449, 56; + add.s64 %rd451, %rd651, %rd450; + shr.s64 %rd452, %rd451, 8; + shl.b64 %rd453, %rd452, 2; + add.s64 %rd454, %rd1, %rd453; + ld.global.nc.f32 %f124, [%rd454]; + ld.global.nc.f32 %f125, [%rd448]; + mul.f32 %f126, %f125, %f124; + .loc 1 216 24 + and.b16 %rs129, %rs13, 240; + shr.u16 %rs130, %rs129, 4; + .loc 1 226 28 + cvt.u32.u16 %r223, %rs130; + mul.wide.u32 %rd455, %r223, 4; + add.s64 %rd457, %rd179, %rd455; + ld.const.f32 %f127, [%rd457]; + fma.rn.f32 %f123, %f127, %f126, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs124, %f10;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs125, %f123;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r220, {%rs124,%rs125};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+36], %r220; + +$L__BB1_134: + .loc 1 212 29 + add.s64 %rd89, %rd5, 20; + .loc 1 213 9 + setp.ge.s64 %p58, %rd89, %rd138; + @%p58 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd458, %rd89, %rd8; + and.b64 %rd459, %rd458, -4294967296; + setp.eq.s64 %p59, %rd459, 0; + @%p59 bra $L__BB1_137; + + div.s64 %rd652, %rd89, %rd8; + bra.uni $L__BB1_138; + +$L__BB1_137: + cvt.u32.u64 %r224, %rd8; + cvt.u32.u64 %r225, %rd89; + div.u32 %r226, %r225, %r224; + cvt.u64.u32 %rd652, %r226; + +$L__BB1_138: + .loc 1 219 28 + add.s64 %rd460, %rd3, %rd652; + ld.global.nc.u8 %rs131, [%rd460]; + cvt.u32.u16 %r227, %rs131; + and.b32 %r228, %r227, 255; + mul.wide.u32 %rd461, %r228, 4; + add.s64 %rd462, %rd2, %rd461; + shr.s64 %rd463, %rd652, 63; + shr.u64 %rd464, %rd463, 56; + add.s64 %rd465, %rd652, %rd464; + shr.s64 %rd466, %rd465, 8; + shl.b64 %rd467, %rd466, 2; + add.s64 %rd468, %rd1, %rd467; + ld.global.nc.f32 %f128, [%rd468]; + ld.global.nc.f32 %f129, [%rd462]; + mul.f32 %f130, %f129, %f128; + .loc 1 220 24 + shl.b16 %rs132, %rs14, 2; + cvt.u64.u16 %rd469, %rs132; + and.b64 %rd470, %rd469, 60; + add.s64 %rd472, %rd179, %rd470; + ld.const.f32 %f131, [%rd472]; + fma.rn.f32 %f11, %f131, %f130, %f17; + .loc 1 222 29 + add.s64 %rd93, %rd5, 21; + .loc 1 223 9 + setp.lt.s64 %p60, %rd93, %rd138; + @%p60 bra $L__BB1_140; + bra.uni $L__BB1_139; + +$L__BB1_140: + .loc 1 0 9 + or.b64 %rd473, %rd93, %rd8; + and.b64 %rd474, %rd473, -4294967296; + setp.eq.s64 %p61, %rd474, 0; + @%p61 bra $L__BB1_142; + + div.s64 %rd653, %rd93, %rd8; + bra.uni $L__BB1_143; + +$L__BB1_139: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs133, %f11;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+40], %rs133; + bra.uni $L__BB1_144; + +$L__BB1_142: + .loc 1 0 26 + cvt.u32.u64 %r229, %rd8; + cvt.u32.u64 %r230, %rd93; + div.u32 %r231, %r230, %r229; + cvt.u64.u32 %rd653, %r231; + +$L__BB1_143: + .loc 1 225 32 + add.s64 %rd475, %rd3, %rd653; + ld.global.nc.u8 %rs138, [%rd475]; + cvt.u32.u16 %r233, %rs138; + and.b32 %r234, %r233, 255; + mul.wide.u32 %rd476, %r234, 4; + add.s64 %rd477, %rd2, %rd476; + shr.s64 %rd478, %rd653, 63; + shr.u64 %rd479, %rd478, 56; + add.s64 %rd480, %rd653, %rd479; + shr.s64 %rd481, %rd480, 8; + shl.b64 %rd482, %rd481, 2; + add.s64 %rd483, %rd1, %rd482; + ld.global.nc.f32 %f135, [%rd483]; + ld.global.nc.f32 %f136, [%rd477]; + mul.f32 %f137, %f136, %f135; + .loc 1 216 24 + and.b16 %rs139, %rs14, 240; + shr.u16 %rs140, %rs139, 4; + .loc 1 226 28 + cvt.u32.u16 %r235, %rs140; + mul.wide.u32 %rd484, %r235, 4; + add.s64 %rd486, %rd179, %rd484; + ld.const.f32 %f138, [%rd486]; + fma.rn.f32 %f134, %f138, %f137, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs134, %f11;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs135, %f134;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r232, {%rs134,%rs135};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+40], %r232; + +$L__BB1_144: + .loc 1 212 29 + add.s64 %rd97, %rd5, 22; + .loc 1 213 9 + setp.ge.s64 %p62, %rd97, %rd138; + @%p62 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd487, %rd97, %rd8; + and.b64 %rd488, %rd487, -4294967296; + setp.eq.s64 %p63, %rd488, 0; + @%p63 bra $L__BB1_147; + + div.s64 %rd654, %rd97, %rd8; + bra.uni $L__BB1_148; + +$L__BB1_147: + cvt.u32.u64 %r236, %rd8; + cvt.u32.u64 %r237, %rd97; + div.u32 %r238, %r237, %r236; + cvt.u64.u32 %rd654, %r238; + +$L__BB1_148: + .loc 1 219 28 + add.s64 %rd489, %rd3, %rd654; + ld.global.nc.u8 %rs141, [%rd489]; + cvt.u32.u16 %r239, %rs141; + and.b32 %r240, %r239, 255; + mul.wide.u32 %rd490, %r240, 4; + add.s64 %rd491, %rd2, %rd490; + shr.s64 %rd492, %rd654, 63; + shr.u64 %rd493, %rd492, 56; + add.s64 %rd494, %rd654, %rd493; + shr.s64 %rd495, %rd494, 8; + shl.b64 %rd496, %rd495, 2; + add.s64 %rd497, %rd1, %rd496; + ld.global.nc.f32 %f139, [%rd497]; + ld.global.nc.f32 %f140, [%rd491]; + mul.f32 %f141, %f140, %f139; + .loc 1 220 24 + shl.b16 %rs142, %rs15, 2; + cvt.u64.u16 %rd498, %rs142; + and.b64 %rd499, %rd498, 60; + add.s64 %rd501, %rd179, %rd499; + ld.const.f32 %f142, [%rd501]; + fma.rn.f32 %f12, %f142, %f141, %f17; + .loc 1 222 29 + add.s64 %rd101, %rd5, 23; + .loc 1 223 9 + setp.lt.s64 %p64, %rd101, %rd138; + @%p64 bra $L__BB1_150; + bra.uni $L__BB1_149; + +$L__BB1_150: + .loc 1 0 9 + or.b64 %rd502, %rd101, %rd8; + and.b64 %rd503, %rd502, -4294967296; + setp.eq.s64 %p65, %rd503, 0; + @%p65 bra $L__BB1_152; + + div.s64 %rd655, %rd101, %rd8; + bra.uni $L__BB1_153; + +$L__BB1_149: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs143, %f12;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+44], %rs143; + bra.uni $L__BB1_154; + +$L__BB1_152: + .loc 1 0 26 + cvt.u32.u64 %r241, %rd8; + cvt.u32.u64 %r242, %rd101; + div.u32 %r243, %r242, %r241; + cvt.u64.u32 %rd655, %r243; + +$L__BB1_153: + .loc 1 225 32 + add.s64 %rd504, %rd3, %rd655; + ld.global.nc.u8 %rs148, [%rd504]; + cvt.u32.u16 %r245, %rs148; + and.b32 %r246, %r245, 255; + mul.wide.u32 %rd505, %r246, 4; + add.s64 %rd506, %rd2, %rd505; + shr.s64 %rd507, %rd655, 63; + shr.u64 %rd508, %rd507, 56; + add.s64 %rd509, %rd655, %rd508; + shr.s64 %rd510, %rd509, 8; + shl.b64 %rd511, %rd510, 2; + add.s64 %rd512, %rd1, %rd511; + ld.global.nc.f32 %f146, [%rd512]; + ld.global.nc.f32 %f147, [%rd506]; + mul.f32 %f148, %f147, %f146; + .loc 1 216 24 + shr.u16 %rs149, %rs15, 4; + .loc 1 226 28 + cvt.u32.u16 %r247, %rs149; + mul.wide.u32 %rd513, %r247, 4; + add.s64 %rd515, %rd179, %rd513; + ld.const.f32 %f149, [%rd515]; + fma.rn.f32 %f145, %f149, %f148, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs144, %f12;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs145, %f145;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r244, {%rs144,%rs145};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+44], %r244; + +$L__BB1_154: + .loc 1 212 29 + add.s64 %rd105, %rd5, 24; + .loc 1 213 9 + setp.ge.s64 %p66, %rd105, %rd138; + @%p66 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd516, %rd105, %rd8; + and.b64 %rd517, %rd516, -4294967296; + setp.eq.s64 %p67, %rd517, 0; + @%p67 bra $L__BB1_157; + + div.s64 %rd656, %rd105, %rd8; + bra.uni $L__BB1_158; + +$L__BB1_157: + cvt.u32.u64 %r248, %rd8; + cvt.u32.u64 %r249, %rd105; + div.u32 %r250, %r249, %r248; + cvt.u64.u32 %rd656, %r250; + +$L__BB1_158: + .loc 1 219 28 + add.s64 %rd518, %rd3, %rd656; + ld.global.nc.u8 %rs150, [%rd518]; + cvt.u32.u16 %r251, %rs150; + and.b32 %r252, %r251, 255; + mul.wide.u32 %rd519, %r252, 4; + add.s64 %rd520, %rd2, %rd519; + shr.s64 %rd521, %rd656, 63; + shr.u64 %rd522, %rd521, 56; + add.s64 %rd523, %rd656, %rd522; + shr.s64 %rd524, %rd523, 8; + shl.b64 %rd525, %rd524, 2; + add.s64 %rd526, %rd1, %rd525; + ld.global.nc.f32 %f150, [%rd526]; + ld.global.nc.f32 %f151, [%rd520]; + mul.f32 %f152, %f151, %f150; + .loc 1 220 24 + shl.b16 %rs151, %rs4, 2; + cvt.u64.u16 %rd527, %rs151; + and.b64 %rd528, %rd527, 60; + add.s64 %rd530, %rd179, %rd528; + ld.const.f32 %f153, [%rd530]; + fma.rn.f32 %f13, %f153, %f152, %f17; + .loc 1 222 29 + add.s64 %rd109, %rd5, 25; + .loc 1 223 9 + setp.lt.s64 %p68, %rd109, %rd138; + @%p68 bra $L__BB1_160; + bra.uni $L__BB1_159; + +$L__BB1_160: + .loc 1 0 9 + or.b64 %rd531, %rd109, %rd8; + and.b64 %rd532, %rd531, -4294967296; + setp.eq.s64 %p69, %rd532, 0; + @%p69 bra $L__BB1_162; + + div.s64 %rd657, %rd109, %rd8; + bra.uni $L__BB1_163; + +$L__BB1_159: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs152, %f13;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+48], %rs152; + bra.uni $L__BB1_164; + +$L__BB1_162: + .loc 1 0 26 + cvt.u32.u64 %r253, %rd8; + cvt.u32.u64 %r254, %rd109; + div.u32 %r255, %r254, %r253; + cvt.u64.u32 %rd657, %r255; + +$L__BB1_163: + .loc 1 225 32 + add.s64 %rd533, %rd3, %rd657; + ld.global.nc.u8 %rs157, [%rd533]; + cvt.u32.u16 %r257, %rs157; + and.b32 %r258, %r257, 255; + mul.wide.u32 %rd534, %r258, 4; + add.s64 %rd535, %rd2, %rd534; + shr.s64 %rd536, %rd657, 63; + shr.u64 %rd537, %rd536, 56; + add.s64 %rd538, %rd657, %rd537; + shr.s64 %rd539, %rd538, 8; + shl.b64 %rd540, %rd539, 2; + add.s64 %rd541, %rd1, %rd540; + ld.global.nc.f32 %f157, [%rd541]; + ld.global.nc.f32 %f158, [%rd535]; + mul.f32 %f159, %f158, %f157; + .loc 1 216 24 + and.b16 %rs158, %rs4, 240; + shr.u16 %rs159, %rs158, 4; + .loc 1 226 28 + cvt.u32.u16 %r259, %rs159; + mul.wide.u32 %rd542, %r259, 4; + add.s64 %rd544, %rd179, %rd542; + ld.const.f32 %f160, [%rd544]; + fma.rn.f32 %f156, %f160, %f159, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs153, %f13;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs154, %f156;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r256, {%rs153,%rs154};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+48], %r256; + +$L__BB1_164: + .loc 1 212 29 + add.s64 %rd113, %rd5, 26; + .loc 1 213 9 + setp.ge.s64 %p70, %rd113, %rd138; + @%p70 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd545, %rd113, %rd8; + and.b64 %rd546, %rd545, -4294967296; + setp.eq.s64 %p71, %rd546, 0; + @%p71 bra $L__BB1_167; + + div.s64 %rd658, %rd113, %rd8; + bra.uni $L__BB1_168; + +$L__BB1_167: + cvt.u32.u64 %r260, %rd8; + cvt.u32.u64 %r261, %rd113; + div.u32 %r262, %r261, %r260; + cvt.u64.u32 %rd658, %r262; + +$L__BB1_168: + .loc 1 219 28 + add.s64 %rd547, %rd3, %rd658; + ld.global.nc.u8 %rs160, [%rd547]; + cvt.u32.u16 %r263, %rs160; + and.b32 %r264, %r263, 255; + mul.wide.u32 %rd548, %r264, 4; + add.s64 %rd549, %rd2, %rd548; + shr.s64 %rd550, %rd658, 63; + shr.u64 %rd551, %rd550, 56; + add.s64 %rd552, %rd658, %rd551; + shr.s64 %rd553, %rd552, 8; + shl.b64 %rd554, %rd553, 2; + add.s64 %rd555, %rd1, %rd554; + ld.global.nc.f32 %f161, [%rd555]; + ld.global.nc.f32 %f162, [%rd549]; + mul.f32 %f163, %f162, %f161; + .loc 1 220 24 + shl.b16 %rs161, %rs3, 2; + cvt.u64.u16 %rd556, %rs161; + and.b64 %rd557, %rd556, 60; + add.s64 %rd559, %rd179, %rd557; + ld.const.f32 %f164, [%rd559]; + fma.rn.f32 %f14, %f164, %f163, %f17; + .loc 1 222 29 + add.s64 %rd117, %rd5, 27; + .loc 1 223 9 + setp.lt.s64 %p72, %rd117, %rd138; + @%p72 bra $L__BB1_170; + bra.uni $L__BB1_169; + +$L__BB1_170: + .loc 1 0 9 + or.b64 %rd560, %rd117, %rd8; + and.b64 %rd561, %rd560, -4294967296; + setp.eq.s64 %p73, %rd561, 0; + @%p73 bra $L__BB1_172; + + div.s64 %rd659, %rd117, %rd8; + bra.uni $L__BB1_173; + +$L__BB1_169: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs162, %f14;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+52], %rs162; + bra.uni $L__BB1_174; + +$L__BB1_172: + .loc 1 0 26 + cvt.u32.u64 %r265, %rd8; + cvt.u32.u64 %r266, %rd117; + div.u32 %r267, %r266, %r265; + cvt.u64.u32 %rd659, %r267; + +$L__BB1_173: + .loc 1 225 32 + add.s64 %rd562, %rd3, %rd659; + ld.global.nc.u8 %rs167, [%rd562]; + cvt.u32.u16 %r269, %rs167; + and.b32 %r270, %r269, 255; + mul.wide.u32 %rd563, %r270, 4; + add.s64 %rd564, %rd2, %rd563; + shr.s64 %rd565, %rd659, 63; + shr.u64 %rd566, %rd565, 56; + add.s64 %rd567, %rd659, %rd566; + shr.s64 %rd568, %rd567, 8; + shl.b64 %rd569, %rd568, 2; + add.s64 %rd570, %rd1, %rd569; + ld.global.nc.f32 %f168, [%rd570]; + ld.global.nc.f32 %f169, [%rd564]; + mul.f32 %f170, %f169, %f168; + .loc 1 216 24 + and.b16 %rs168, %rs3, 240; + shr.u16 %rs169, %rs168, 4; + .loc 1 226 28 + cvt.u32.u16 %r271, %rs169; + mul.wide.u32 %rd571, %r271, 4; + add.s64 %rd573, %rd179, %rd571; + ld.const.f32 %f171, [%rd573]; + fma.rn.f32 %f167, %f171, %f170, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs163, %f14;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs164, %f167;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r268, {%rs163,%rs164};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+52], %r268; + +$L__BB1_174: + .loc 1 212 29 + add.s64 %rd121, %rd5, 28; + .loc 1 213 9 + setp.ge.s64 %p74, %rd121, %rd138; + @%p74 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd574, %rd121, %rd8; + and.b64 %rd575, %rd574, -4294967296; + setp.eq.s64 %p75, %rd575, 0; + @%p75 bra $L__BB1_177; + + div.s64 %rd660, %rd121, %rd8; + bra.uni $L__BB1_178; + +$L__BB1_177: + cvt.u32.u64 %r272, %rd8; + cvt.u32.u64 %r273, %rd121; + div.u32 %r274, %r273, %r272; + cvt.u64.u32 %rd660, %r274; + +$L__BB1_178: + .loc 1 219 28 + add.s64 %rd576, %rd3, %rd660; + ld.global.nc.u8 %rs170, [%rd576]; + cvt.u32.u16 %r275, %rs170; + and.b32 %r276, %r275, 255; + mul.wide.u32 %rd577, %r276, 4; + add.s64 %rd578, %rd2, %rd577; + shr.s64 %rd579, %rd660, 63; + shr.u64 %rd580, %rd579, 56; + add.s64 %rd581, %rd660, %rd580; + shr.s64 %rd582, %rd581, 8; + shl.b64 %rd583, %rd582, 2; + add.s64 %rd584, %rd1, %rd583; + ld.global.nc.f32 %f172, [%rd584]; + ld.global.nc.f32 %f173, [%rd578]; + mul.f32 %f174, %f173, %f172; + .loc 1 220 24 + shl.b16 %rs171, %rs2, 2; + cvt.u64.u16 %rd585, %rs171; + and.b64 %rd586, %rd585, 60; + add.s64 %rd588, %rd179, %rd586; + ld.const.f32 %f175, [%rd588]; + fma.rn.f32 %f15, %f175, %f174, %f17; + .loc 1 222 29 + add.s64 %rd125, %rd5, 29; + .loc 1 223 9 + setp.lt.s64 %p76, %rd125, %rd138; + @%p76 bra $L__BB1_180; + bra.uni $L__BB1_179; + +$L__BB1_180: + .loc 1 0 9 + or.b64 %rd589, %rd125, %rd8; + and.b64 %rd590, %rd589, -4294967296; + setp.eq.s64 %p77, %rd590, 0; + @%p77 bra $L__BB1_182; + + div.s64 %rd661, %rd125, %rd8; + bra.uni $L__BB1_183; + +$L__BB1_179: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs172, %f15;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+56], %rs172; + bra.uni $L__BB1_184; + +$L__BB1_182: + .loc 1 0 26 + cvt.u32.u64 %r277, %rd8; + cvt.u32.u64 %r278, %rd125; + div.u32 %r279, %r278, %r277; + cvt.u64.u32 %rd661, %r279; + +$L__BB1_183: + .loc 1 225 32 + add.s64 %rd591, %rd3, %rd661; + ld.global.nc.u8 %rs177, [%rd591]; + cvt.u32.u16 %r281, %rs177; + and.b32 %r282, %r281, 255; + mul.wide.u32 %rd592, %r282, 4; + add.s64 %rd593, %rd2, %rd592; + shr.s64 %rd594, %rd661, 63; + shr.u64 %rd595, %rd594, 56; + add.s64 %rd596, %rd661, %rd595; + shr.s64 %rd597, %rd596, 8; + shl.b64 %rd598, %rd597, 2; + add.s64 %rd599, %rd1, %rd598; + ld.global.nc.f32 %f179, [%rd599]; + ld.global.nc.f32 %f180, [%rd593]; + mul.f32 %f181, %f180, %f179; + .loc 1 216 24 + and.b16 %rs178, %rs2, 240; + shr.u16 %rs179, %rs178, 4; + .loc 1 226 28 + cvt.u32.u16 %r283, %rs179; + mul.wide.u32 %rd600, %r283, 4; + add.s64 %rd602, %rd179, %rd600; + ld.const.f32 %f182, [%rd602]; + fma.rn.f32 %f178, %f182, %f181, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs173, %f15;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs174, %f178;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r280, {%rs173,%rs174};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+56], %r280; + +$L__BB1_184: + .loc 1 212 29 + add.s64 %rd129, %rd5, 30; + .loc 1 213 9 + setp.ge.s64 %p78, %rd129, %rd138; + @%p78 bra $L__BB1_194; + + .loc 1 0 9 + or.b64 %rd603, %rd129, %rd8; + and.b64 %rd604, %rd603, -4294967296; + setp.eq.s64 %p79, %rd604, 0; + @%p79 bra $L__BB1_187; + + div.s64 %rd662, %rd129, %rd8; + bra.uni $L__BB1_188; + +$L__BB1_187: + cvt.u32.u64 %r284, %rd8; + cvt.u32.u64 %r285, %rd129; + div.u32 %r286, %r285, %r284; + cvt.u64.u32 %rd662, %r286; + +$L__BB1_188: + .loc 1 219 28 + add.s64 %rd605, %rd3, %rd662; + ld.global.nc.u8 %rs180, [%rd605]; + cvt.u32.u16 %r287, %rs180; + and.b32 %r288, %r287, 255; + mul.wide.u32 %rd606, %r288, 4; + add.s64 %rd607, %rd2, %rd606; + shr.s64 %rd608, %rd662, 63; + shr.u64 %rd609, %rd608, 56; + add.s64 %rd610, %rd662, %rd609; + shr.s64 %rd611, %rd610, 8; + shl.b64 %rd612, %rd611, 2; + add.s64 %rd613, %rd1, %rd612; + ld.global.nc.f32 %f183, [%rd613]; + ld.global.nc.f32 %f184, [%rd607]; + mul.f32 %f185, %f184, %f183; + .loc 1 220 24 + shl.b16 %rs181, %rs1, 2; + cvt.u64.u16 %rd614, %rs181; + and.b64 %rd615, %rd614, 60; + add.s64 %rd617, %rd179, %rd615; + ld.const.f32 %f186, [%rd617]; + fma.rn.f32 %f16, %f186, %f185, %f17; + .loc 1 222 29 + add.s64 %rd133, %rd5, 31; + .loc 1 223 9 + setp.lt.s64 %p80, %rd133, %rd138; + @%p80 bra $L__BB1_190; + bra.uni $L__BB1_189; + +$L__BB1_190: + .loc 1 0 9 + or.b64 %rd618, %rd133, %rd8; + and.b64 %rd619, %rd618, -4294967296; + setp.eq.s64 %p81, %rd619, 0; + @%p81 bra $L__BB1_192; + + div.s64 %rd663, %rd133, %rd8; + bra.uni $L__BB1_193; + +$L__BB1_189: + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs182, %f16;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+60], %rs182; + bra.uni $L__BB1_194; + +$L__BB1_192: + .loc 1 0 26 + cvt.u32.u64 %r289, %rd8; + cvt.u32.u64 %r290, %rd133; + div.u32 %r291, %r290, %r289; + cvt.u64.u32 %rd663, %r291; + +$L__BB1_193: + .loc 1 225 32 + add.s64 %rd620, %rd3, %rd663; + ld.global.nc.u8 %rs187, [%rd620]; + cvt.u32.u16 %r293, %rs187; + and.b32 %r294, %r293, 255; + mul.wide.u32 %rd621, %r294, 4; + add.s64 %rd622, %rd2, %rd621; + shr.s64 %rd623, %rd663, 63; + shr.u64 %rd624, %rd623, 56; + add.s64 %rd625, %rd663, %rd624; + shr.s64 %rd626, %rd625, 8; + shl.b64 %rd627, %rd626, 2; + add.s64 %rd628, %rd1, %rd627; + ld.global.nc.f32 %f190, [%rd628]; + ld.global.nc.f32 %f191, [%rd622]; + mul.f32 %f192, %f191, %f190; + .loc 1 216 24 + shr.u16 %rs188, %rs1, 4; + .loc 1 226 28 + cvt.u32.u16 %r295, %rs188; + mul.wide.u32 %rd629, %r295, 4; + add.s64 %rd631, %rd179, %rd629; + ld.const.f32 %f193, [%rd631]; + fma.rn.f32 %f189, %f193, %f192, %f17; + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs183, %f16;} + + // end inline asm + .loc 2 596 3, function_name $L__info_string1, inlined_at 1 62 71 + // begin inline asm + { cvt.rn.f16.f32 %rs184, %f189;} + + // end inline asm + .loc 2 1419 5, function_name $L__info_string3, inlined_at 1 74 22 + // begin inline asm + { mov.b32 %r292, {%rs183,%rs184};} + + // end inline asm + .loc 1 75 5, function_name $L__info_string2, inlined_at 1 227 13 + st.global.u32 [%rd13+60], %r292; + +$L__BB1_194: + .loc 1 232 1 + ret; + +} +.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT_( + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3, + .param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5, + .param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7 +) +{ + .reg .pred %p<5>; + .reg .b16 %rs<12>; + .reg .f32 %f<14>; + .reg .b32 %r<17>; + .reg .b64 %rd<58>; + .loc 1 102 0 + + + ld.param.u64 %rd15, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0]; + ld.param.u64 %rd18, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1]; + ld.param.u64 %rd19, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2]; + ld.param.u64 %rd20, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3]; + ld.param.f32 %f2, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4]; + ld.param.u64 %rd16, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5]; + ld.param.u32 %r1, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6]; + ld.param.u64 %rd17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7]; + .loc 1 113 28 + cvta.to.global.u64 %rd1, %rd19; + cvta.to.global.u64 %rd2, %rd20; + cvta.to.global.u64 %rd3, %rd18; + mov.u32 %r2, %ctaid.x; + mov.u32 %r3, %ntid.x; + mul.wide.u32 %rd21, %r2, %r3; + mov.u32 %r4, %tid.x; + cvt.u64.u32 %rd22, %r4; + add.s64 %rd4, %rd21, %rd22; + .loc 1 114 25 + shl.b64 %rd5, %rd4, 1; + .loc 1 115 5 + setp.ge.s64 %p1, %rd5, %rd16; + @%p1 bra $L__BB2_10; + + .loc 1 113 28 + cvta.to.global.u64 %rd23, %rd15; + .loc 1 121 26 + add.s64 %rd24, %rd23, %rd4; + ld.global.nc.u8 %rs1, [%rd24]; + .loc 1 130 30 + cvt.s64.s32 %rd6, %r1; + or.b64 %rd25, %rd5, %rd6; + and.b64 %rd26, %rd25, -4294967296; + setp.eq.s64 %p2, %rd26, 0; + @%p2 bra $L__BB2_3; + + .loc 1 0 30 + div.s64 %rd56, %rd5, %rd6; + bra.uni $L__BB2_4; + +$L__BB2_3: + cvt.u32.u64 %r5, %rd6; + cvt.u32.u64 %r6, %rd5; + div.u32 %r7, %r6, %r5; + cvt.u64.u32 %rd56, %r7; + +$L__BB2_4: + .loc 1 132 24 + add.s64 %rd27, %rd3, %rd56; + ld.global.nc.u8 %rs2, [%rd27]; + cvt.u32.u16 %r8, %rs2; + and.b32 %r9, %r8, 255; + mul.wide.u32 %rd28, %r9, 4; + add.s64 %rd29, %rd2, %rd28; + .loc 1 131 30 + shr.s64 %rd30, %rd56, 63; + shr.u64 %rd31, %rd30, 56; + add.s64 %rd32, %rd56, %rd31; + shr.s64 %rd33, %rd32, 8; + .loc 1 132 24 + shl.b64 %rd34, %rd33, 2; + add.s64 %rd35, %rd1, %rd34; + ld.global.nc.f32 %f3, [%rd35]; + ld.global.nc.f32 %f4, [%rd29]; + mul.f32 %f5, %f4, %f3; + .loc 1 133 20 + shl.b16 %rs3, %rs1, 2; + cvt.u64.u16 %rd36, %rs3; + and.b64 %rd37, %rd36, 60; + mov.u64 %rd38, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E; + add.s64 %rd39, %rd38, %rd37; + ld.const.f32 %f6, [%rd39]; + fma.rn.f32 %f1, %f6, %f5, %f2; + .loc 1 136 25 + add.s64 %rd10, %rd5, 1; + .loc 1 137 5 + setp.lt.s64 %p3, %rd10, %rd16; + .loc 1 113 28 + cvta.to.global.u64 %rd40, %rd17; + .loc 1 149 9 + shl.b64 %rd41, %rd5, 1; + add.s64 %rd11, %rd40, %rd41; + .loc 1 137 5 + @%p3 bra $L__BB2_6; + bra.uni $L__BB2_5; + +$L__BB2_6: + .loc 1 0 5 + or.b64 %rd42, %rd10, %rd6; + and.b64 %rd43, %rd42, -4294967296; + setp.eq.s64 %p4, %rd43, 0; + @%p4 bra $L__BB2_8; + + div.s64 %rd57, %rd10, %rd6; + bra.uni $L__BB2_9; + +$L__BB2_5: + .loc 1 152 22 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 152 22 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs4, %f1;} + + // end inline asm + .loc 1 152 22 + st.global.u16 [%rd11], %rs4; + bra.uni $L__BB2_10; + +$L__BB2_8: + .loc 1 0 22 + cvt.u32.u64 %r10, %rd6; + cvt.u32.u64 %r11, %rd10; + div.u32 %r12, %r11, %r10; + cvt.u64.u32 %rd57, %r12; + +$L__BB2_9: + .loc 1 143 28 + add.s64 %rd44, %rd3, %rd57; + ld.global.nc.u8 %rs9, [%rd44]; + cvt.u32.u16 %r14, %rs9; + and.b32 %r15, %r14, 255; + mul.wide.u32 %rd45, %r15, 4; + add.s64 %rd46, %rd2, %rd45; + .loc 1 142 34 + shr.s64 %rd47, %rd57, 63; + shr.u64 %rd48, %rd47, 56; + add.s64 %rd49, %rd57, %rd48; + shr.s64 %rd50, %rd49, 8; + .loc 1 143 28 + shl.b64 %rd51, %rd50, 2; + add.s64 %rd52, %rd1, %rd51; + ld.global.nc.f32 %f10, [%rd52]; + ld.global.nc.f32 %f11, [%rd46]; + mul.f32 %f12, %f11, %f10; + .loc 1 123 20 + and.b16 %rs10, %rs1, 240; + shr.u16 %rs11, %rs10, 4; + .loc 1 144 24 + cvt.u32.u16 %r16, %rs11; + mul.wide.u32 %rd53, %r16, 4; + add.s64 %rd55, %rd38, %rd53; + ld.const.f32 %f13, [%rd55]; + fma.rn.f32 %f9, %f13, %f12, %f2; + .loc 1 149 36 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 149 36 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs5, %f1;} + + // end inline asm + .loc 1 149 52 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 149 52 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs6, %f9;} + + // end inline asm + .loc 1 149 9 + .loc 1 82 29, function_name $L__info_string6, inlined_at 1 149 9 + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r13, {%rs5,%rs6};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 149 9 + st.global.u32 [%rd11], %r13; + +$L__BB2_10: + .loc 1 154 1 + ret; + +} +.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT_( + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3, + .param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5, + .param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6, + .param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7 +) +{ + .reg .pred %p<82>; + .reg .b16 %rs<189>; + .reg .f32 %f<194>; + .reg .b32 %r<327>; + .reg .b64 %rd<664>; + .loc 1 157 0 + + + ld.param.u64 %rd137, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0]; + ld.param.u64 %rd140, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1]; + ld.param.u64 %rd141, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2]; + ld.param.u64 %rd142, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3]; + ld.param.f32 %f17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4]; + ld.param.u64 %rd138, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5]; + ld.param.u32 %r64, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6]; + ld.param.u64 %rd139, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7]; + cvta.to.global.u64 %rd1, %rd141; + cvta.to.global.u64 %rd2, %rd142; + cvta.to.global.u64 %rd3, %rd140; + .loc 1 169 23 + mov.u32 %r65, %ctaid.x; + mov.u32 %r66, %ntid.x; + mul.wide.u32 %rd143, %r65, %r66; + mov.u32 %r67, %tid.x; + cvt.u64.u32 %rd144, %r67; + add.s64 %rd145, %rd143, %rd144; + .loc 1 170 29 + shl.b64 %rd4, %rd145, 4; + .loc 1 171 29 + shl.b64 %rd5, %rd145, 5; + .loc 1 172 35 + add.s64 %rd146, %rd138, 1; + shr.u64 %rd147, %rd146, 63; + add.s64 %rd148, %rd146, %rd147; + shr.s64 %rd6, %rd148, 1; + .loc 1 174 5 + setp.ge.s64 %p1, %rd5, %rd138; + @%p1 bra $L__BB3_194; + + .loc 1 178 5 + add.s64 %rd149, %rd4, 16; + setp.gt.s64 %p2, %rd149, %rd6; + cvta.to.global.u64 %rd150, %rd137; + .loc 1 206 13 + add.s64 %rd7, %rd150, %rd4; + .loc 1 178 5 + @%p2 bra $L__BB3_3; + bra.uni $L__BB3_2; + +$L__BB3_3: + .loc 1 206 13 + setp.ge.s64 %p3, %rd4, %rd6; + mov.u32 %r321, 0; + mov.u32 %r322, %r321; + @%p3 bra $L__BB3_5; + + ld.global.nc.u8 %rs17, [%rd7]; + cvt.u32.u16 %r73, %rs17; + and.b32 %r322, %r73, 255; + +$L__BB3_5: + .loc 1 205 29 + add.s64 %rd151, %rd4, 1; + .loc 1 206 13 + setp.ge.s64 %p4, %rd151, %rd6; + @%p4 bra $L__BB3_7; + + ld.global.nc.u8 %rs18, [%rd7+1]; + cvt.u32.u16 %r75, %rs18; + and.b32 %r321, %r75, 255; + +$L__BB3_7: + .loc 1 205 29 + add.s64 %rd152, %rd4, 2; + .loc 1 206 13 + setp.ge.s64 %p5, %rd152, %rd6; + mov.u32 %r319, 0; + mov.u32 %r320, %r319; + @%p5 bra $L__BB3_9; + + ld.global.nc.u8 %rs19, [%rd7+2]; + cvt.u32.u16 %r77, %rs19; + and.b32 %r320, %r77, 255; + +$L__BB3_9: + .loc 1 205 29 + add.s64 %rd153, %rd4, 3; + .loc 1 206 13 + setp.ge.s64 %p6, %rd153, %rd6; + @%p6 bra $L__BB3_11; + + ld.global.nc.u8 %rs20, [%rd7+3]; + cvt.u32.u16 %r79, %rs20; + and.b32 %r319, %r79, 255; + +$L__BB3_11: + .loc 1 205 29 + add.s64 %rd154, %rd4, 4; + .loc 1 206 13 + setp.ge.s64 %p7, %rd154, %rd6; + mov.u32 %r317, 0; + mov.u32 %r318, %r317; + @%p7 bra $L__BB3_13; + + ld.global.nc.u8 %rs21, [%rd7+4]; + cvt.u32.u16 %r81, %rs21; + and.b32 %r318, %r81, 255; + +$L__BB3_13: + .loc 1 205 29 + add.s64 %rd155, %rd4, 5; + .loc 1 206 13 + setp.ge.s64 %p8, %rd155, %rd6; + @%p8 bra $L__BB3_15; + + ld.global.nc.u8 %rs22, [%rd7+5]; + cvt.u32.u16 %r83, %rs22; + and.b32 %r317, %r83, 255; + +$L__BB3_15: + .loc 1 205 29 + add.s64 %rd156, %rd4, 6; + .loc 1 206 13 + setp.ge.s64 %p9, %rd156, %rd6; + mov.u32 %r315, 0; + mov.u32 %r316, %r315; + @%p9 bra $L__BB3_17; + + ld.global.nc.u8 %rs23, [%rd7+6]; + cvt.u32.u16 %r85, %rs23; + and.b32 %r316, %r85, 255; + +$L__BB3_17: + .loc 1 205 29 + add.s64 %rd157, %rd4, 7; + .loc 1 206 13 + setp.ge.s64 %p10, %rd157, %rd6; + @%p10 bra $L__BB3_19; + + ld.global.nc.u8 %rs24, [%rd7+7]; + cvt.u32.u16 %r87, %rs24; + and.b32 %r315, %r87, 255; + +$L__BB3_19: + .loc 1 205 29 + add.s64 %rd158, %rd4, 8; + .loc 1 206 13 + setp.ge.s64 %p11, %rd158, %rd6; + mov.u32 %r313, 0; + mov.u32 %r314, %r313; + @%p11 bra $L__BB3_21; + + ld.global.nc.u8 %rs25, [%rd7+8]; + cvt.u32.u16 %r89, %rs25; + and.b32 %r314, %r89, 255; + +$L__BB3_21: + .loc 1 205 29 + add.s64 %rd159, %rd4, 9; + .loc 1 206 13 + setp.ge.s64 %p12, %rd159, %rd6; + @%p12 bra $L__BB3_23; + + ld.global.nc.u8 %rs26, [%rd7+9]; + cvt.u32.u16 %r91, %rs26; + and.b32 %r313, %r91, 255; + +$L__BB3_23: + .loc 1 205 29 + add.s64 %rd160, %rd4, 10; + .loc 1 206 13 + setp.ge.s64 %p13, %rd160, %rd6; + mov.u32 %r311, 0; + mov.u32 %r312, %r311; + @%p13 bra $L__BB3_25; + + ld.global.nc.u8 %rs27, [%rd7+10]; + cvt.u32.u16 %r93, %rs27; + and.b32 %r312, %r93, 255; + +$L__BB3_25: + .loc 1 205 29 + add.s64 %rd161, %rd4, 11; + .loc 1 206 13 + setp.ge.s64 %p14, %rd161, %rd6; + @%p14 bra $L__BB3_27; + + ld.global.nc.u8 %rs28, [%rd7+11]; + cvt.u32.u16 %r95, %rs28; + and.b32 %r311, %r95, 255; + +$L__BB3_27: + .loc 1 205 29 + add.s64 %rd162, %rd4, 12; + .loc 1 206 13 + setp.ge.s64 %p15, %rd162, %rd6; + mov.u32 %r324, 0; + mov.u32 %r323, %r324; + @%p15 bra $L__BB3_29; + + ld.global.nc.u8 %rs29, [%rd7+12]; + cvt.u32.u16 %r97, %rs29; + and.b32 %r323, %r97, 255; + +$L__BB3_29: + .loc 1 205 29 + add.s64 %rd163, %rd4, 13; + .loc 1 206 13 + setp.ge.s64 %p16, %rd163, %rd6; + @%p16 bra $L__BB3_31; + + ld.global.nc.u8 %rs30, [%rd7+13]; + cvt.u32.u16 %r99, %rs30; + and.b32 %r324, %r99, 255; + +$L__BB3_31: + .loc 1 205 29 + add.s64 %rd164, %rd4, 14; + .loc 1 206 13 + setp.ge.s64 %p17, %rd164, %rd6; + mov.u32 %r326, 0; + mov.u32 %r325, %r326; + @%p17 bra $L__BB3_33; + + ld.global.nc.u8 %rs31, [%rd7+14]; + cvt.u32.u16 %r101, %rs31; + and.b32 %r325, %r101, 255; + +$L__BB3_33: + .loc 1 205 29 + add.s64 %rd165, %rd4, 15; + .loc 1 206 13 + setp.ge.s64 %p18, %rd165, %rd6; + @%p18 bra $L__BB3_35; + + ld.global.nc.u8 %rs32, [%rd7+15]; + cvt.u32.u16 %r103, %rs32; + and.b32 %r326, %r103, 255; + bra.uni $L__BB3_35; + +$L__BB3_2: + .loc 1 180 29 + ld.global.nc.v4.u32 {%r322, %r318, %r314, %r323}, [%rd7]; + .loc 1 186 17 + shr.u32 %r321, %r322, 8; + .loc 1 187 17 + shr.u32 %r320, %r322, 16; + .loc 1 188 17 + shr.u32 %r319, %r322, 24; + .loc 1 186 17 + shr.u32 %r317, %r318, 8; + .loc 1 187 17 + shr.u32 %r316, %r318, 16; + .loc 1 188 17 + shr.u32 %r315, %r318, 24; + .loc 1 186 17 + shr.u32 %r313, %r314, 8; + .loc 1 187 17 + shr.u32 %r312, %r314, 16; + .loc 1 188 17 + shr.u32 %r311, %r314, 24; + .loc 1 186 17 + shr.u32 %r324, %r323, 8; + .loc 1 187 17 + shr.u32 %r325, %r323, 16; + .loc 1 188 17 + shr.u32 %r326, %r323, 24; + +$L__BB3_35: + .loc 1 213 9 + cvt.u16.u32 %rs1, %r326; + cvt.u16.u32 %rs2, %r325; + cvt.u16.u32 %rs3, %r324; + cvt.u16.u32 %rs4, %r323; + cvt.u16.u32 %rs5, %r321; + cvt.u16.u32 %rs6, %r320; + cvt.u16.u32 %rs7, %r319; + cvt.u16.u32 %rs8, %r318; + cvt.u16.u32 %rs9, %r317; + cvt.u16.u32 %rs10, %r316; + cvt.u16.u32 %rs11, %r315; + cvt.u16.u32 %rs12, %r314; + cvt.u16.u32 %rs13, %r313; + cvt.u16.u32 %rs14, %r312; + cvt.u16.u32 %rs15, %r311; + cvt.u16.u32 %rs16, %r322; + cvt.s64.s32 %rd8, %r64; + or.b64 %rd166, %rd5, %rd8; + and.b64 %rd167, %rd166, -4294967296; + setp.eq.s64 %p19, %rd167, 0; + @%p19 bra $L__BB3_37; + + .loc 1 0 9 + div.s64 %rd632, %rd5, %rd8; + bra.uni $L__BB3_38; + +$L__BB3_37: + cvt.u32.u64 %r104, %rd8; + cvt.u32.u64 %r105, %rd5; + div.u32 %r106, %r105, %r104; + cvt.u64.u32 %rd632, %r106; + +$L__BB3_38: + .loc 1 219 28 + add.s64 %rd168, %rd3, %rd632; + ld.global.nc.u8 %rs33, [%rd168]; + cvt.u32.u16 %r107, %rs33; + and.b32 %r108, %r107, 255; + mul.wide.u32 %rd169, %r108, 4; + add.s64 %rd170, %rd2, %rd169; + shr.s64 %rd171, %rd632, 63; + shr.u64 %rd172, %rd171, 56; + add.s64 %rd173, %rd632, %rd172; + shr.s64 %rd174, %rd173, 8; + shl.b64 %rd175, %rd174, 2; + add.s64 %rd176, %rd1, %rd175; + ld.global.nc.f32 %f18, [%rd176]; + ld.global.nc.f32 %f19, [%rd170]; + mul.f32 %f20, %f19, %f18; + .loc 1 220 24 + shl.b16 %rs34, %rs16, 2; + cvt.u64.u16 %rd177, %rs34; + and.b64 %rd178, %rd177, 60; + mov.u64 %rd179, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E; + add.s64 %rd180, %rd179, %rd178; + ld.const.f32 %f21, [%rd180]; + fma.rn.f32 %f1, %f21, %f20, %f17; + .loc 1 222 29 + add.s64 %rd12, %rd5, 1; + .loc 1 223 9 + setp.lt.s64 %p20, %rd12, %rd138; + cvta.to.global.u64 %rd181, %rd139; + .loc 1 227 13 + shl.b64 %rd182, %rd5, 1; + add.s64 %rd13, %rd181, %rd182; + .loc 1 223 9 + @%p20 bra $L__BB3_40; + bra.uni $L__BB3_39; + +$L__BB3_40: + .loc 1 0 9 + or.b64 %rd183, %rd12, %rd8; + and.b64 %rd184, %rd183, -4294967296; + setp.eq.s64 %p21, %rd184, 0; + @%p21 bra $L__BB3_42; + + div.s64 %rd633, %rd12, %rd8; + bra.uni $L__BB3_43; + +$L__BB3_39: + .loc 1 229 26 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 229 26 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs35, %f1;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13], %rs35; + bra.uni $L__BB3_44; + +$L__BB3_42: + .loc 1 0 26 + cvt.u32.u64 %r109, %rd8; + cvt.u32.u64 %r110, %rd12; + div.u32 %r111, %r110, %r109; + cvt.u64.u32 %rd633, %r111; + +$L__BB3_43: + .loc 1 225 32 + add.s64 %rd185, %rd3, %rd633; + ld.global.nc.u8 %rs40, [%rd185]; + cvt.u32.u16 %r113, %rs40; + and.b32 %r114, %r113, 255; + mul.wide.u32 %rd186, %r114, 4; + add.s64 %rd187, %rd2, %rd186; + shr.s64 %rd188, %rd633, 63; + shr.u64 %rd189, %rd188, 56; + add.s64 %rd190, %rd633, %rd189; + shr.s64 %rd191, %rd190, 8; + shl.b64 %rd192, %rd191, 2; + add.s64 %rd193, %rd1, %rd192; + ld.global.nc.f32 %f25, [%rd193]; + ld.global.nc.f32 %f26, [%rd187]; + mul.f32 %f27, %f26, %f25; + .loc 1 216 24 + and.b16 %rs41, %rs16, 240; + shr.u16 %rs42, %rs41, 4; + .loc 1 226 28 + cvt.u32.u16 %r115, %rs42; + mul.wide.u32 %rd194, %r115, 4; + add.s64 %rd196, %rd179, %rd194; + ld.const.f32 %f28, [%rd196]; + fma.rn.f32 %f24, %f28, %f27, %f17; + .loc 1 227 40 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 227 40 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs36, %f1;} + + // end inline asm + .loc 1 227 56 + .loc 1 63 85, function_name $L__info_string4, inlined_at 1 227 56 + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs37, %f24;} + + // end inline asm + .loc 1 227 13 + .loc 1 82 29, function_name $L__info_string6, inlined_at 1 227 13 + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r112, {%rs36,%rs37};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13], %r112; + +$L__BB3_44: + .loc 1 212 29 + add.s64 %rd17, %rd5, 2; + .loc 1 213 9 + setp.ge.s64 %p22, %rd17, %rd138; + @%p22 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd197, %rd17, %rd8; + and.b64 %rd198, %rd197, -4294967296; + setp.eq.s64 %p23, %rd198, 0; + @%p23 bra $L__BB3_47; + + div.s64 %rd634, %rd17, %rd8; + bra.uni $L__BB3_48; + +$L__BB3_47: + cvt.u32.u64 %r116, %rd8; + cvt.u32.u64 %r117, %rd17; + div.u32 %r118, %r117, %r116; + cvt.u64.u32 %rd634, %r118; + +$L__BB3_48: + .loc 1 219 28 + add.s64 %rd199, %rd3, %rd634; + ld.global.nc.u8 %rs43, [%rd199]; + cvt.u32.u16 %r119, %rs43; + and.b32 %r120, %r119, 255; + mul.wide.u32 %rd200, %r120, 4; + add.s64 %rd201, %rd2, %rd200; + shr.s64 %rd202, %rd634, 63; + shr.u64 %rd203, %rd202, 56; + add.s64 %rd204, %rd634, %rd203; + shr.s64 %rd205, %rd204, 8; + shl.b64 %rd206, %rd205, 2; + add.s64 %rd207, %rd1, %rd206; + ld.global.nc.f32 %f29, [%rd207]; + ld.global.nc.f32 %f30, [%rd201]; + mul.f32 %f31, %f30, %f29; + .loc 1 220 24 + shl.b16 %rs44, %rs5, 2; + cvt.u64.u16 %rd208, %rs44; + and.b64 %rd209, %rd208, 60; + add.s64 %rd211, %rd179, %rd209; + ld.const.f32 %f32, [%rd211]; + fma.rn.f32 %f2, %f32, %f31, %f17; + .loc 1 222 29 + add.s64 %rd21, %rd5, 3; + .loc 1 223 9 + setp.lt.s64 %p24, %rd21, %rd138; + @%p24 bra $L__BB3_50; + bra.uni $L__BB3_49; + +$L__BB3_50: + .loc 1 0 9 + or.b64 %rd212, %rd21, %rd8; + and.b64 %rd213, %rd212, -4294967296; + setp.eq.s64 %p25, %rd213, 0; + @%p25 bra $L__BB3_52; + + div.s64 %rd635, %rd21, %rd8; + bra.uni $L__BB3_53; + +$L__BB3_49: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs45, %f2;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+4], %rs45; + bra.uni $L__BB3_54; + +$L__BB3_52: + .loc 1 0 26 + cvt.u32.u64 %r121, %rd8; + cvt.u32.u64 %r122, %rd21; + div.u32 %r123, %r122, %r121; + cvt.u64.u32 %rd635, %r123; + +$L__BB3_53: + .loc 1 225 32 + add.s64 %rd214, %rd3, %rd635; + ld.global.nc.u8 %rs50, [%rd214]; + cvt.u32.u16 %r125, %rs50; + and.b32 %r126, %r125, 255; + mul.wide.u32 %rd215, %r126, 4; + add.s64 %rd216, %rd2, %rd215; + shr.s64 %rd217, %rd635, 63; + shr.u64 %rd218, %rd217, 56; + add.s64 %rd219, %rd635, %rd218; + shr.s64 %rd220, %rd219, 8; + shl.b64 %rd221, %rd220, 2; + add.s64 %rd222, %rd1, %rd221; + ld.global.nc.f32 %f36, [%rd222]; + ld.global.nc.f32 %f37, [%rd216]; + mul.f32 %f38, %f37, %f36; + .loc 1 216 24 + and.b16 %rs51, %rs5, 240; + shr.u16 %rs52, %rs51, 4; + .loc 1 226 28 + cvt.u32.u16 %r127, %rs52; + mul.wide.u32 %rd223, %r127, 4; + add.s64 %rd225, %rd179, %rd223; + ld.const.f32 %f39, [%rd225]; + fma.rn.f32 %f35, %f39, %f38, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs46, %f2;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs47, %f35;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r124, {%rs46,%rs47};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+4], %r124; + +$L__BB3_54: + .loc 1 212 29 + add.s64 %rd25, %rd5, 4; + .loc 1 213 9 + setp.ge.s64 %p26, %rd25, %rd138; + @%p26 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd226, %rd25, %rd8; + and.b64 %rd227, %rd226, -4294967296; + setp.eq.s64 %p27, %rd227, 0; + @%p27 bra $L__BB3_57; + + div.s64 %rd636, %rd25, %rd8; + bra.uni $L__BB3_58; + +$L__BB3_57: + cvt.u32.u64 %r128, %rd8; + cvt.u32.u64 %r129, %rd25; + div.u32 %r130, %r129, %r128; + cvt.u64.u32 %rd636, %r130; + +$L__BB3_58: + .loc 1 219 28 + add.s64 %rd228, %rd3, %rd636; + ld.global.nc.u8 %rs53, [%rd228]; + cvt.u32.u16 %r131, %rs53; + and.b32 %r132, %r131, 255; + mul.wide.u32 %rd229, %r132, 4; + add.s64 %rd230, %rd2, %rd229; + shr.s64 %rd231, %rd636, 63; + shr.u64 %rd232, %rd231, 56; + add.s64 %rd233, %rd636, %rd232; + shr.s64 %rd234, %rd233, 8; + shl.b64 %rd235, %rd234, 2; + add.s64 %rd236, %rd1, %rd235; + ld.global.nc.f32 %f40, [%rd236]; + ld.global.nc.f32 %f41, [%rd230]; + mul.f32 %f42, %f41, %f40; + .loc 1 220 24 + shl.b16 %rs54, %rs6, 2; + cvt.u64.u16 %rd237, %rs54; + and.b64 %rd238, %rd237, 60; + add.s64 %rd240, %rd179, %rd238; + ld.const.f32 %f43, [%rd240]; + fma.rn.f32 %f3, %f43, %f42, %f17; + .loc 1 222 29 + add.s64 %rd29, %rd5, 5; + .loc 1 223 9 + setp.lt.s64 %p28, %rd29, %rd138; + @%p28 bra $L__BB3_60; + bra.uni $L__BB3_59; + +$L__BB3_60: + .loc 1 0 9 + or.b64 %rd241, %rd29, %rd8; + and.b64 %rd242, %rd241, -4294967296; + setp.eq.s64 %p29, %rd242, 0; + @%p29 bra $L__BB3_62; + + div.s64 %rd637, %rd29, %rd8; + bra.uni $L__BB3_63; + +$L__BB3_59: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs55, %f3;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+8], %rs55; + bra.uni $L__BB3_64; + +$L__BB3_62: + .loc 1 0 26 + cvt.u32.u64 %r133, %rd8; + cvt.u32.u64 %r134, %rd29; + div.u32 %r135, %r134, %r133; + cvt.u64.u32 %rd637, %r135; + +$L__BB3_63: + .loc 1 225 32 + add.s64 %rd243, %rd3, %rd637; + ld.global.nc.u8 %rs60, [%rd243]; + cvt.u32.u16 %r137, %rs60; + and.b32 %r138, %r137, 255; + mul.wide.u32 %rd244, %r138, 4; + add.s64 %rd245, %rd2, %rd244; + shr.s64 %rd246, %rd637, 63; + shr.u64 %rd247, %rd246, 56; + add.s64 %rd248, %rd637, %rd247; + shr.s64 %rd249, %rd248, 8; + shl.b64 %rd250, %rd249, 2; + add.s64 %rd251, %rd1, %rd250; + ld.global.nc.f32 %f47, [%rd251]; + ld.global.nc.f32 %f48, [%rd245]; + mul.f32 %f49, %f48, %f47; + .loc 1 216 24 + and.b16 %rs61, %rs6, 240; + shr.u16 %rs62, %rs61, 4; + .loc 1 226 28 + cvt.u32.u16 %r139, %rs62; + mul.wide.u32 %rd252, %r139, 4; + add.s64 %rd254, %rd179, %rd252; + ld.const.f32 %f50, [%rd254]; + fma.rn.f32 %f46, %f50, %f49, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs56, %f3;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs57, %f46;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r136, {%rs56,%rs57};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+8], %r136; + +$L__BB3_64: + .loc 1 212 29 + add.s64 %rd33, %rd5, 6; + .loc 1 213 9 + setp.ge.s64 %p30, %rd33, %rd138; + @%p30 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd255, %rd33, %rd8; + and.b64 %rd256, %rd255, -4294967296; + setp.eq.s64 %p31, %rd256, 0; + @%p31 bra $L__BB3_67; + + div.s64 %rd638, %rd33, %rd8; + bra.uni $L__BB3_68; + +$L__BB3_67: + cvt.u32.u64 %r140, %rd8; + cvt.u32.u64 %r141, %rd33; + div.u32 %r142, %r141, %r140; + cvt.u64.u32 %rd638, %r142; + +$L__BB3_68: + .loc 1 219 28 + add.s64 %rd257, %rd3, %rd638; + ld.global.nc.u8 %rs63, [%rd257]; + cvt.u32.u16 %r143, %rs63; + and.b32 %r144, %r143, 255; + mul.wide.u32 %rd258, %r144, 4; + add.s64 %rd259, %rd2, %rd258; + shr.s64 %rd260, %rd638, 63; + shr.u64 %rd261, %rd260, 56; + add.s64 %rd262, %rd638, %rd261; + shr.s64 %rd263, %rd262, 8; + shl.b64 %rd264, %rd263, 2; + add.s64 %rd265, %rd1, %rd264; + ld.global.nc.f32 %f51, [%rd265]; + ld.global.nc.f32 %f52, [%rd259]; + mul.f32 %f53, %f52, %f51; + .loc 1 220 24 + shl.b16 %rs64, %rs7, 2; + cvt.u64.u16 %rd266, %rs64; + and.b64 %rd267, %rd266, 60; + add.s64 %rd269, %rd179, %rd267; + ld.const.f32 %f54, [%rd269]; + fma.rn.f32 %f4, %f54, %f53, %f17; + .loc 1 222 29 + add.s64 %rd37, %rd5, 7; + .loc 1 223 9 + setp.lt.s64 %p32, %rd37, %rd138; + @%p32 bra $L__BB3_70; + bra.uni $L__BB3_69; + +$L__BB3_70: + .loc 1 0 9 + or.b64 %rd270, %rd37, %rd8; + and.b64 %rd271, %rd270, -4294967296; + setp.eq.s64 %p33, %rd271, 0; + @%p33 bra $L__BB3_72; + + div.s64 %rd639, %rd37, %rd8; + bra.uni $L__BB3_73; + +$L__BB3_69: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs65, %f4;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+12], %rs65; + bra.uni $L__BB3_74; + +$L__BB3_72: + .loc 1 0 26 + cvt.u32.u64 %r145, %rd8; + cvt.u32.u64 %r146, %rd37; + div.u32 %r147, %r146, %r145; + cvt.u64.u32 %rd639, %r147; + +$L__BB3_73: + .loc 1 225 32 + add.s64 %rd272, %rd3, %rd639; + ld.global.nc.u8 %rs70, [%rd272]; + cvt.u32.u16 %r149, %rs70; + and.b32 %r150, %r149, 255; + mul.wide.u32 %rd273, %r150, 4; + add.s64 %rd274, %rd2, %rd273; + shr.s64 %rd275, %rd639, 63; + shr.u64 %rd276, %rd275, 56; + add.s64 %rd277, %rd639, %rd276; + shr.s64 %rd278, %rd277, 8; + shl.b64 %rd279, %rd278, 2; + add.s64 %rd280, %rd1, %rd279; + ld.global.nc.f32 %f58, [%rd280]; + ld.global.nc.f32 %f59, [%rd274]; + mul.f32 %f60, %f59, %f58; + .loc 1 216 24 + shr.u16 %rs71, %rs7, 4; + .loc 1 226 28 + cvt.u32.u16 %r151, %rs71; + mul.wide.u32 %rd281, %r151, 4; + add.s64 %rd283, %rd179, %rd281; + ld.const.f32 %f61, [%rd283]; + fma.rn.f32 %f57, %f61, %f60, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs66, %f4;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs67, %f57;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r148, {%rs66,%rs67};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+12], %r148; + +$L__BB3_74: + .loc 1 212 29 + add.s64 %rd41, %rd5, 8; + .loc 1 213 9 + setp.ge.s64 %p34, %rd41, %rd138; + @%p34 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd284, %rd41, %rd8; + and.b64 %rd285, %rd284, -4294967296; + setp.eq.s64 %p35, %rd285, 0; + @%p35 bra $L__BB3_77; + + div.s64 %rd640, %rd41, %rd8; + bra.uni $L__BB3_78; + +$L__BB3_77: + cvt.u32.u64 %r152, %rd8; + cvt.u32.u64 %r153, %rd41; + div.u32 %r154, %r153, %r152; + cvt.u64.u32 %rd640, %r154; + +$L__BB3_78: + .loc 1 219 28 + add.s64 %rd286, %rd3, %rd640; + ld.global.nc.u8 %rs72, [%rd286]; + cvt.u32.u16 %r155, %rs72; + and.b32 %r156, %r155, 255; + mul.wide.u32 %rd287, %r156, 4; + add.s64 %rd288, %rd2, %rd287; + shr.s64 %rd289, %rd640, 63; + shr.u64 %rd290, %rd289, 56; + add.s64 %rd291, %rd640, %rd290; + shr.s64 %rd292, %rd291, 8; + shl.b64 %rd293, %rd292, 2; + add.s64 %rd294, %rd1, %rd293; + ld.global.nc.f32 %f62, [%rd294]; + ld.global.nc.f32 %f63, [%rd288]; + mul.f32 %f64, %f63, %f62; + .loc 1 220 24 + shl.b16 %rs73, %rs8, 2; + cvt.u64.u16 %rd295, %rs73; + and.b64 %rd296, %rd295, 60; + add.s64 %rd298, %rd179, %rd296; + ld.const.f32 %f65, [%rd298]; + fma.rn.f32 %f5, %f65, %f64, %f17; + .loc 1 222 29 + add.s64 %rd45, %rd5, 9; + .loc 1 223 9 + setp.lt.s64 %p36, %rd45, %rd138; + @%p36 bra $L__BB3_80; + bra.uni $L__BB3_79; + +$L__BB3_80: + .loc 1 0 9 + or.b64 %rd299, %rd45, %rd8; + and.b64 %rd300, %rd299, -4294967296; + setp.eq.s64 %p37, %rd300, 0; + @%p37 bra $L__BB3_82; + + div.s64 %rd641, %rd45, %rd8; + bra.uni $L__BB3_83; + +$L__BB3_79: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs74, %f5;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+16], %rs74; + bra.uni $L__BB3_84; + +$L__BB3_82: + .loc 1 0 26 + cvt.u32.u64 %r157, %rd8; + cvt.u32.u64 %r158, %rd45; + div.u32 %r159, %r158, %r157; + cvt.u64.u32 %rd641, %r159; + +$L__BB3_83: + .loc 1 225 32 + add.s64 %rd301, %rd3, %rd641; + ld.global.nc.u8 %rs79, [%rd301]; + cvt.u32.u16 %r161, %rs79; + and.b32 %r162, %r161, 255; + mul.wide.u32 %rd302, %r162, 4; + add.s64 %rd303, %rd2, %rd302; + shr.s64 %rd304, %rd641, 63; + shr.u64 %rd305, %rd304, 56; + add.s64 %rd306, %rd641, %rd305; + shr.s64 %rd307, %rd306, 8; + shl.b64 %rd308, %rd307, 2; + add.s64 %rd309, %rd1, %rd308; + ld.global.nc.f32 %f69, [%rd309]; + ld.global.nc.f32 %f70, [%rd303]; + mul.f32 %f71, %f70, %f69; + .loc 1 216 24 + and.b16 %rs80, %rs8, 240; + shr.u16 %rs81, %rs80, 4; + .loc 1 226 28 + cvt.u32.u16 %r163, %rs81; + mul.wide.u32 %rd310, %r163, 4; + add.s64 %rd312, %rd179, %rd310; + ld.const.f32 %f72, [%rd312]; + fma.rn.f32 %f68, %f72, %f71, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs75, %f5;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs76, %f68;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r160, {%rs75,%rs76};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+16], %r160; + +$L__BB3_84: + .loc 1 212 29 + add.s64 %rd49, %rd5, 10; + .loc 1 213 9 + setp.ge.s64 %p38, %rd49, %rd138; + @%p38 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd313, %rd49, %rd8; + and.b64 %rd314, %rd313, -4294967296; + setp.eq.s64 %p39, %rd314, 0; + @%p39 bra $L__BB3_87; + + div.s64 %rd642, %rd49, %rd8; + bra.uni $L__BB3_88; + +$L__BB3_87: + cvt.u32.u64 %r164, %rd8; + cvt.u32.u64 %r165, %rd49; + div.u32 %r166, %r165, %r164; + cvt.u64.u32 %rd642, %r166; + +$L__BB3_88: + .loc 1 219 28 + add.s64 %rd315, %rd3, %rd642; + ld.global.nc.u8 %rs82, [%rd315]; + cvt.u32.u16 %r167, %rs82; + and.b32 %r168, %r167, 255; + mul.wide.u32 %rd316, %r168, 4; + add.s64 %rd317, %rd2, %rd316; + shr.s64 %rd318, %rd642, 63; + shr.u64 %rd319, %rd318, 56; + add.s64 %rd320, %rd642, %rd319; + shr.s64 %rd321, %rd320, 8; + shl.b64 %rd322, %rd321, 2; + add.s64 %rd323, %rd1, %rd322; + ld.global.nc.f32 %f73, [%rd323]; + ld.global.nc.f32 %f74, [%rd317]; + mul.f32 %f75, %f74, %f73; + .loc 1 220 24 + shl.b16 %rs83, %rs9, 2; + cvt.u64.u16 %rd324, %rs83; + and.b64 %rd325, %rd324, 60; + add.s64 %rd327, %rd179, %rd325; + ld.const.f32 %f76, [%rd327]; + fma.rn.f32 %f6, %f76, %f75, %f17; + .loc 1 222 29 + add.s64 %rd53, %rd5, 11; + .loc 1 223 9 + setp.lt.s64 %p40, %rd53, %rd138; + @%p40 bra $L__BB3_90; + bra.uni $L__BB3_89; + +$L__BB3_90: + .loc 1 0 9 + or.b64 %rd328, %rd53, %rd8; + and.b64 %rd329, %rd328, -4294967296; + setp.eq.s64 %p41, %rd329, 0; + @%p41 bra $L__BB3_92; + + div.s64 %rd643, %rd53, %rd8; + bra.uni $L__BB3_93; + +$L__BB3_89: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs84, %f6;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+20], %rs84; + bra.uni $L__BB3_94; + +$L__BB3_92: + .loc 1 0 26 + cvt.u32.u64 %r169, %rd8; + cvt.u32.u64 %r170, %rd53; + div.u32 %r171, %r170, %r169; + cvt.u64.u32 %rd643, %r171; + +$L__BB3_93: + .loc 1 225 32 + add.s64 %rd330, %rd3, %rd643; + ld.global.nc.u8 %rs89, [%rd330]; + cvt.u32.u16 %r173, %rs89; + and.b32 %r174, %r173, 255; + mul.wide.u32 %rd331, %r174, 4; + add.s64 %rd332, %rd2, %rd331; + shr.s64 %rd333, %rd643, 63; + shr.u64 %rd334, %rd333, 56; + add.s64 %rd335, %rd643, %rd334; + shr.s64 %rd336, %rd335, 8; + shl.b64 %rd337, %rd336, 2; + add.s64 %rd338, %rd1, %rd337; + ld.global.nc.f32 %f80, [%rd338]; + ld.global.nc.f32 %f81, [%rd332]; + mul.f32 %f82, %f81, %f80; + .loc 1 216 24 + and.b16 %rs90, %rs9, 240; + shr.u16 %rs91, %rs90, 4; + .loc 1 226 28 + cvt.u32.u16 %r175, %rs91; + mul.wide.u32 %rd339, %r175, 4; + add.s64 %rd341, %rd179, %rd339; + ld.const.f32 %f83, [%rd341]; + fma.rn.f32 %f79, %f83, %f82, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs85, %f6;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs86, %f79;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r172, {%rs85,%rs86};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+20], %r172; + +$L__BB3_94: + .loc 1 212 29 + add.s64 %rd57, %rd5, 12; + .loc 1 213 9 + setp.ge.s64 %p42, %rd57, %rd138; + @%p42 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd342, %rd57, %rd8; + and.b64 %rd343, %rd342, -4294967296; + setp.eq.s64 %p43, %rd343, 0; + @%p43 bra $L__BB3_97; + + div.s64 %rd644, %rd57, %rd8; + bra.uni $L__BB3_98; + +$L__BB3_97: + cvt.u32.u64 %r176, %rd8; + cvt.u32.u64 %r177, %rd57; + div.u32 %r178, %r177, %r176; + cvt.u64.u32 %rd644, %r178; + +$L__BB3_98: + .loc 1 219 28 + add.s64 %rd344, %rd3, %rd644; + ld.global.nc.u8 %rs92, [%rd344]; + cvt.u32.u16 %r179, %rs92; + and.b32 %r180, %r179, 255; + mul.wide.u32 %rd345, %r180, 4; + add.s64 %rd346, %rd2, %rd345; + shr.s64 %rd347, %rd644, 63; + shr.u64 %rd348, %rd347, 56; + add.s64 %rd349, %rd644, %rd348; + shr.s64 %rd350, %rd349, 8; + shl.b64 %rd351, %rd350, 2; + add.s64 %rd352, %rd1, %rd351; + ld.global.nc.f32 %f84, [%rd352]; + ld.global.nc.f32 %f85, [%rd346]; + mul.f32 %f86, %f85, %f84; + .loc 1 220 24 + shl.b16 %rs93, %rs10, 2; + cvt.u64.u16 %rd353, %rs93; + and.b64 %rd354, %rd353, 60; + add.s64 %rd356, %rd179, %rd354; + ld.const.f32 %f87, [%rd356]; + fma.rn.f32 %f7, %f87, %f86, %f17; + .loc 1 222 29 + add.s64 %rd61, %rd5, 13; + .loc 1 223 9 + setp.lt.s64 %p44, %rd61, %rd138; + @%p44 bra $L__BB3_100; + bra.uni $L__BB3_99; + +$L__BB3_100: + .loc 1 0 9 + or.b64 %rd357, %rd61, %rd8; + and.b64 %rd358, %rd357, -4294967296; + setp.eq.s64 %p45, %rd358, 0; + @%p45 bra $L__BB3_102; + + div.s64 %rd645, %rd61, %rd8; + bra.uni $L__BB3_103; + +$L__BB3_99: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs94, %f7;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+24], %rs94; + bra.uni $L__BB3_104; + +$L__BB3_102: + .loc 1 0 26 + cvt.u32.u64 %r181, %rd8; + cvt.u32.u64 %r182, %rd61; + div.u32 %r183, %r182, %r181; + cvt.u64.u32 %rd645, %r183; + +$L__BB3_103: + .loc 1 225 32 + add.s64 %rd359, %rd3, %rd645; + ld.global.nc.u8 %rs99, [%rd359]; + cvt.u32.u16 %r185, %rs99; + and.b32 %r186, %r185, 255; + mul.wide.u32 %rd360, %r186, 4; + add.s64 %rd361, %rd2, %rd360; + shr.s64 %rd362, %rd645, 63; + shr.u64 %rd363, %rd362, 56; + add.s64 %rd364, %rd645, %rd363; + shr.s64 %rd365, %rd364, 8; + shl.b64 %rd366, %rd365, 2; + add.s64 %rd367, %rd1, %rd366; + ld.global.nc.f32 %f91, [%rd367]; + ld.global.nc.f32 %f92, [%rd361]; + mul.f32 %f93, %f92, %f91; + .loc 1 216 24 + and.b16 %rs100, %rs10, 240; + shr.u16 %rs101, %rs100, 4; + .loc 1 226 28 + cvt.u32.u16 %r187, %rs101; + mul.wide.u32 %rd368, %r187, 4; + add.s64 %rd370, %rd179, %rd368; + ld.const.f32 %f94, [%rd370]; + fma.rn.f32 %f90, %f94, %f93, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs95, %f7;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs96, %f90;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r184, {%rs95,%rs96};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+24], %r184; + +$L__BB3_104: + .loc 1 212 29 + add.s64 %rd65, %rd5, 14; + .loc 1 213 9 + setp.ge.s64 %p46, %rd65, %rd138; + @%p46 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd371, %rd65, %rd8; + and.b64 %rd372, %rd371, -4294967296; + setp.eq.s64 %p47, %rd372, 0; + @%p47 bra $L__BB3_107; + + div.s64 %rd646, %rd65, %rd8; + bra.uni $L__BB3_108; + +$L__BB3_107: + cvt.u32.u64 %r188, %rd8; + cvt.u32.u64 %r189, %rd65; + div.u32 %r190, %r189, %r188; + cvt.u64.u32 %rd646, %r190; + +$L__BB3_108: + .loc 1 219 28 + add.s64 %rd373, %rd3, %rd646; + ld.global.nc.u8 %rs102, [%rd373]; + cvt.u32.u16 %r191, %rs102; + and.b32 %r192, %r191, 255; + mul.wide.u32 %rd374, %r192, 4; + add.s64 %rd375, %rd2, %rd374; + shr.s64 %rd376, %rd646, 63; + shr.u64 %rd377, %rd376, 56; + add.s64 %rd378, %rd646, %rd377; + shr.s64 %rd379, %rd378, 8; + shl.b64 %rd380, %rd379, 2; + add.s64 %rd381, %rd1, %rd380; + ld.global.nc.f32 %f95, [%rd381]; + ld.global.nc.f32 %f96, [%rd375]; + mul.f32 %f97, %f96, %f95; + .loc 1 220 24 + shl.b16 %rs103, %rs11, 2; + cvt.u64.u16 %rd382, %rs103; + and.b64 %rd383, %rd382, 60; + add.s64 %rd385, %rd179, %rd383; + ld.const.f32 %f98, [%rd385]; + fma.rn.f32 %f8, %f98, %f97, %f17; + .loc 1 222 29 + add.s64 %rd69, %rd5, 15; + .loc 1 223 9 + setp.lt.s64 %p48, %rd69, %rd138; + @%p48 bra $L__BB3_110; + bra.uni $L__BB3_109; + +$L__BB3_110: + .loc 1 0 9 + or.b64 %rd386, %rd69, %rd8; + and.b64 %rd387, %rd386, -4294967296; + setp.eq.s64 %p49, %rd387, 0; + @%p49 bra $L__BB3_112; + + div.s64 %rd647, %rd69, %rd8; + bra.uni $L__BB3_113; + +$L__BB3_109: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs104, %f8;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+28], %rs104; + bra.uni $L__BB3_114; + +$L__BB3_112: + .loc 1 0 26 + cvt.u32.u64 %r193, %rd8; + cvt.u32.u64 %r194, %rd69; + div.u32 %r195, %r194, %r193; + cvt.u64.u32 %rd647, %r195; + +$L__BB3_113: + .loc 1 225 32 + add.s64 %rd388, %rd3, %rd647; + ld.global.nc.u8 %rs109, [%rd388]; + cvt.u32.u16 %r197, %rs109; + and.b32 %r198, %r197, 255; + mul.wide.u32 %rd389, %r198, 4; + add.s64 %rd390, %rd2, %rd389; + shr.s64 %rd391, %rd647, 63; + shr.u64 %rd392, %rd391, 56; + add.s64 %rd393, %rd647, %rd392; + shr.s64 %rd394, %rd393, 8; + shl.b64 %rd395, %rd394, 2; + add.s64 %rd396, %rd1, %rd395; + ld.global.nc.f32 %f102, [%rd396]; + ld.global.nc.f32 %f103, [%rd390]; + mul.f32 %f104, %f103, %f102; + .loc 1 216 24 + shr.u16 %rs110, %rs11, 4; + .loc 1 226 28 + cvt.u32.u16 %r199, %rs110; + mul.wide.u32 %rd397, %r199, 4; + add.s64 %rd399, %rd179, %rd397; + ld.const.f32 %f105, [%rd399]; + fma.rn.f32 %f101, %f105, %f104, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs105, %f8;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs106, %f101;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r196, {%rs105,%rs106};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+28], %r196; + +$L__BB3_114: + .loc 1 212 29 + add.s64 %rd73, %rd5, 16; + .loc 1 213 9 + setp.ge.s64 %p50, %rd73, %rd138; + @%p50 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd400, %rd73, %rd8; + and.b64 %rd401, %rd400, -4294967296; + setp.eq.s64 %p51, %rd401, 0; + @%p51 bra $L__BB3_117; + + div.s64 %rd648, %rd73, %rd8; + bra.uni $L__BB3_118; + +$L__BB3_117: + cvt.u32.u64 %r200, %rd8; + cvt.u32.u64 %r201, %rd73; + div.u32 %r202, %r201, %r200; + cvt.u64.u32 %rd648, %r202; + +$L__BB3_118: + .loc 1 219 28 + add.s64 %rd402, %rd3, %rd648; + ld.global.nc.u8 %rs111, [%rd402]; + cvt.u32.u16 %r203, %rs111; + and.b32 %r204, %r203, 255; + mul.wide.u32 %rd403, %r204, 4; + add.s64 %rd404, %rd2, %rd403; + shr.s64 %rd405, %rd648, 63; + shr.u64 %rd406, %rd405, 56; + add.s64 %rd407, %rd648, %rd406; + shr.s64 %rd408, %rd407, 8; + shl.b64 %rd409, %rd408, 2; + add.s64 %rd410, %rd1, %rd409; + ld.global.nc.f32 %f106, [%rd410]; + ld.global.nc.f32 %f107, [%rd404]; + mul.f32 %f108, %f107, %f106; + .loc 1 220 24 + shl.b16 %rs112, %rs12, 2; + cvt.u64.u16 %rd411, %rs112; + and.b64 %rd412, %rd411, 60; + add.s64 %rd414, %rd179, %rd412; + ld.const.f32 %f109, [%rd414]; + fma.rn.f32 %f9, %f109, %f108, %f17; + .loc 1 222 29 + add.s64 %rd77, %rd5, 17; + .loc 1 223 9 + setp.lt.s64 %p52, %rd77, %rd138; + @%p52 bra $L__BB3_120; + bra.uni $L__BB3_119; + +$L__BB3_120: + .loc 1 0 9 + or.b64 %rd415, %rd77, %rd8; + and.b64 %rd416, %rd415, -4294967296; + setp.eq.s64 %p53, %rd416, 0; + @%p53 bra $L__BB3_122; + + div.s64 %rd649, %rd77, %rd8; + bra.uni $L__BB3_123; + +$L__BB3_119: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs113, %f9;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+32], %rs113; + bra.uni $L__BB3_124; + +$L__BB3_122: + .loc 1 0 26 + cvt.u32.u64 %r205, %rd8; + cvt.u32.u64 %r206, %rd77; + div.u32 %r207, %r206, %r205; + cvt.u64.u32 %rd649, %r207; + +$L__BB3_123: + .loc 1 225 32 + add.s64 %rd417, %rd3, %rd649; + ld.global.nc.u8 %rs118, [%rd417]; + cvt.u32.u16 %r209, %rs118; + and.b32 %r210, %r209, 255; + mul.wide.u32 %rd418, %r210, 4; + add.s64 %rd419, %rd2, %rd418; + shr.s64 %rd420, %rd649, 63; + shr.u64 %rd421, %rd420, 56; + add.s64 %rd422, %rd649, %rd421; + shr.s64 %rd423, %rd422, 8; + shl.b64 %rd424, %rd423, 2; + add.s64 %rd425, %rd1, %rd424; + ld.global.nc.f32 %f113, [%rd425]; + ld.global.nc.f32 %f114, [%rd419]; + mul.f32 %f115, %f114, %f113; + .loc 1 216 24 + and.b16 %rs119, %rs12, 240; + shr.u16 %rs120, %rs119, 4; + .loc 1 226 28 + cvt.u32.u16 %r211, %rs120; + mul.wide.u32 %rd426, %r211, 4; + add.s64 %rd428, %rd179, %rd426; + ld.const.f32 %f116, [%rd428]; + fma.rn.f32 %f112, %f116, %f115, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs114, %f9;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs115, %f112;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r208, {%rs114,%rs115};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+32], %r208; + +$L__BB3_124: + .loc 1 212 29 + add.s64 %rd81, %rd5, 18; + .loc 1 213 9 + setp.ge.s64 %p54, %rd81, %rd138; + @%p54 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd429, %rd81, %rd8; + and.b64 %rd430, %rd429, -4294967296; + setp.eq.s64 %p55, %rd430, 0; + @%p55 bra $L__BB3_127; + + div.s64 %rd650, %rd81, %rd8; + bra.uni $L__BB3_128; + +$L__BB3_127: + cvt.u32.u64 %r212, %rd8; + cvt.u32.u64 %r213, %rd81; + div.u32 %r214, %r213, %r212; + cvt.u64.u32 %rd650, %r214; + +$L__BB3_128: + .loc 1 219 28 + add.s64 %rd431, %rd3, %rd650; + ld.global.nc.u8 %rs121, [%rd431]; + cvt.u32.u16 %r215, %rs121; + and.b32 %r216, %r215, 255; + mul.wide.u32 %rd432, %r216, 4; + add.s64 %rd433, %rd2, %rd432; + shr.s64 %rd434, %rd650, 63; + shr.u64 %rd435, %rd434, 56; + add.s64 %rd436, %rd650, %rd435; + shr.s64 %rd437, %rd436, 8; + shl.b64 %rd438, %rd437, 2; + add.s64 %rd439, %rd1, %rd438; + ld.global.nc.f32 %f117, [%rd439]; + ld.global.nc.f32 %f118, [%rd433]; + mul.f32 %f119, %f118, %f117; + .loc 1 220 24 + shl.b16 %rs122, %rs13, 2; + cvt.u64.u16 %rd440, %rs122; + and.b64 %rd441, %rd440, 60; + add.s64 %rd443, %rd179, %rd441; + ld.const.f32 %f120, [%rd443]; + fma.rn.f32 %f10, %f120, %f119, %f17; + .loc 1 222 29 + add.s64 %rd85, %rd5, 19; + .loc 1 223 9 + setp.lt.s64 %p56, %rd85, %rd138; + @%p56 bra $L__BB3_130; + bra.uni $L__BB3_129; + +$L__BB3_130: + .loc 1 0 9 + or.b64 %rd444, %rd85, %rd8; + and.b64 %rd445, %rd444, -4294967296; + setp.eq.s64 %p57, %rd445, 0; + @%p57 bra $L__BB3_132; + + div.s64 %rd651, %rd85, %rd8; + bra.uni $L__BB3_133; + +$L__BB3_129: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs123, %f10;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+36], %rs123; + bra.uni $L__BB3_134; + +$L__BB3_132: + .loc 1 0 26 + cvt.u32.u64 %r217, %rd8; + cvt.u32.u64 %r218, %rd85; + div.u32 %r219, %r218, %r217; + cvt.u64.u32 %rd651, %r219; + +$L__BB3_133: + .loc 1 225 32 + add.s64 %rd446, %rd3, %rd651; + ld.global.nc.u8 %rs128, [%rd446]; + cvt.u32.u16 %r221, %rs128; + and.b32 %r222, %r221, 255; + mul.wide.u32 %rd447, %r222, 4; + add.s64 %rd448, %rd2, %rd447; + shr.s64 %rd449, %rd651, 63; + shr.u64 %rd450, %rd449, 56; + add.s64 %rd451, %rd651, %rd450; + shr.s64 %rd452, %rd451, 8; + shl.b64 %rd453, %rd452, 2; + add.s64 %rd454, %rd1, %rd453; + ld.global.nc.f32 %f124, [%rd454]; + ld.global.nc.f32 %f125, [%rd448]; + mul.f32 %f126, %f125, %f124; + .loc 1 216 24 + and.b16 %rs129, %rs13, 240; + shr.u16 %rs130, %rs129, 4; + .loc 1 226 28 + cvt.u32.u16 %r223, %rs130; + mul.wide.u32 %rd455, %r223, 4; + add.s64 %rd457, %rd179, %rd455; + ld.const.f32 %f127, [%rd457]; + fma.rn.f32 %f123, %f127, %f126, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs124, %f10;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs125, %f123;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r220, {%rs124,%rs125};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+36], %r220; + +$L__BB3_134: + .loc 1 212 29 + add.s64 %rd89, %rd5, 20; + .loc 1 213 9 + setp.ge.s64 %p58, %rd89, %rd138; + @%p58 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd458, %rd89, %rd8; + and.b64 %rd459, %rd458, -4294967296; + setp.eq.s64 %p59, %rd459, 0; + @%p59 bra $L__BB3_137; + + div.s64 %rd652, %rd89, %rd8; + bra.uni $L__BB3_138; + +$L__BB3_137: + cvt.u32.u64 %r224, %rd8; + cvt.u32.u64 %r225, %rd89; + div.u32 %r226, %r225, %r224; + cvt.u64.u32 %rd652, %r226; + +$L__BB3_138: + .loc 1 219 28 + add.s64 %rd460, %rd3, %rd652; + ld.global.nc.u8 %rs131, [%rd460]; + cvt.u32.u16 %r227, %rs131; + and.b32 %r228, %r227, 255; + mul.wide.u32 %rd461, %r228, 4; + add.s64 %rd462, %rd2, %rd461; + shr.s64 %rd463, %rd652, 63; + shr.u64 %rd464, %rd463, 56; + add.s64 %rd465, %rd652, %rd464; + shr.s64 %rd466, %rd465, 8; + shl.b64 %rd467, %rd466, 2; + add.s64 %rd468, %rd1, %rd467; + ld.global.nc.f32 %f128, [%rd468]; + ld.global.nc.f32 %f129, [%rd462]; + mul.f32 %f130, %f129, %f128; + .loc 1 220 24 + shl.b16 %rs132, %rs14, 2; + cvt.u64.u16 %rd469, %rs132; + and.b64 %rd470, %rd469, 60; + add.s64 %rd472, %rd179, %rd470; + ld.const.f32 %f131, [%rd472]; + fma.rn.f32 %f11, %f131, %f130, %f17; + .loc 1 222 29 + add.s64 %rd93, %rd5, 21; + .loc 1 223 9 + setp.lt.s64 %p60, %rd93, %rd138; + @%p60 bra $L__BB3_140; + bra.uni $L__BB3_139; + +$L__BB3_140: + .loc 1 0 9 + or.b64 %rd473, %rd93, %rd8; + and.b64 %rd474, %rd473, -4294967296; + setp.eq.s64 %p61, %rd474, 0; + @%p61 bra $L__BB3_142; + + div.s64 %rd653, %rd93, %rd8; + bra.uni $L__BB3_143; + +$L__BB3_139: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs133, %f11;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+40], %rs133; + bra.uni $L__BB3_144; + +$L__BB3_142: + .loc 1 0 26 + cvt.u32.u64 %r229, %rd8; + cvt.u32.u64 %r230, %rd93; + div.u32 %r231, %r230, %r229; + cvt.u64.u32 %rd653, %r231; + +$L__BB3_143: + .loc 1 225 32 + add.s64 %rd475, %rd3, %rd653; + ld.global.nc.u8 %rs138, [%rd475]; + cvt.u32.u16 %r233, %rs138; + and.b32 %r234, %r233, 255; + mul.wide.u32 %rd476, %r234, 4; + add.s64 %rd477, %rd2, %rd476; + shr.s64 %rd478, %rd653, 63; + shr.u64 %rd479, %rd478, 56; + add.s64 %rd480, %rd653, %rd479; + shr.s64 %rd481, %rd480, 8; + shl.b64 %rd482, %rd481, 2; + add.s64 %rd483, %rd1, %rd482; + ld.global.nc.f32 %f135, [%rd483]; + ld.global.nc.f32 %f136, [%rd477]; + mul.f32 %f137, %f136, %f135; + .loc 1 216 24 + and.b16 %rs139, %rs14, 240; + shr.u16 %rs140, %rs139, 4; + .loc 1 226 28 + cvt.u32.u16 %r235, %rs140; + mul.wide.u32 %rd484, %r235, 4; + add.s64 %rd486, %rd179, %rd484; + ld.const.f32 %f138, [%rd486]; + fma.rn.f32 %f134, %f138, %f137, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs134, %f11;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs135, %f134;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r232, {%rs134,%rs135};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+40], %r232; + +$L__BB3_144: + .loc 1 212 29 + add.s64 %rd97, %rd5, 22; + .loc 1 213 9 + setp.ge.s64 %p62, %rd97, %rd138; + @%p62 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd487, %rd97, %rd8; + and.b64 %rd488, %rd487, -4294967296; + setp.eq.s64 %p63, %rd488, 0; + @%p63 bra $L__BB3_147; + + div.s64 %rd654, %rd97, %rd8; + bra.uni $L__BB3_148; + +$L__BB3_147: + cvt.u32.u64 %r236, %rd8; + cvt.u32.u64 %r237, %rd97; + div.u32 %r238, %r237, %r236; + cvt.u64.u32 %rd654, %r238; + +$L__BB3_148: + .loc 1 219 28 + add.s64 %rd489, %rd3, %rd654; + ld.global.nc.u8 %rs141, [%rd489]; + cvt.u32.u16 %r239, %rs141; + and.b32 %r240, %r239, 255; + mul.wide.u32 %rd490, %r240, 4; + add.s64 %rd491, %rd2, %rd490; + shr.s64 %rd492, %rd654, 63; + shr.u64 %rd493, %rd492, 56; + add.s64 %rd494, %rd654, %rd493; + shr.s64 %rd495, %rd494, 8; + shl.b64 %rd496, %rd495, 2; + add.s64 %rd497, %rd1, %rd496; + ld.global.nc.f32 %f139, [%rd497]; + ld.global.nc.f32 %f140, [%rd491]; + mul.f32 %f141, %f140, %f139; + .loc 1 220 24 + shl.b16 %rs142, %rs15, 2; + cvt.u64.u16 %rd498, %rs142; + and.b64 %rd499, %rd498, 60; + add.s64 %rd501, %rd179, %rd499; + ld.const.f32 %f142, [%rd501]; + fma.rn.f32 %f12, %f142, %f141, %f17; + .loc 1 222 29 + add.s64 %rd101, %rd5, 23; + .loc 1 223 9 + setp.lt.s64 %p64, %rd101, %rd138; + @%p64 bra $L__BB3_150; + bra.uni $L__BB3_149; + +$L__BB3_150: + .loc 1 0 9 + or.b64 %rd502, %rd101, %rd8; + and.b64 %rd503, %rd502, -4294967296; + setp.eq.s64 %p65, %rd503, 0; + @%p65 bra $L__BB3_152; + + div.s64 %rd655, %rd101, %rd8; + bra.uni $L__BB3_153; + +$L__BB3_149: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs143, %f12;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+44], %rs143; + bra.uni $L__BB3_154; + +$L__BB3_152: + .loc 1 0 26 + cvt.u32.u64 %r241, %rd8; + cvt.u32.u64 %r242, %rd101; + div.u32 %r243, %r242, %r241; + cvt.u64.u32 %rd655, %r243; + +$L__BB3_153: + .loc 1 225 32 + add.s64 %rd504, %rd3, %rd655; + ld.global.nc.u8 %rs148, [%rd504]; + cvt.u32.u16 %r245, %rs148; + and.b32 %r246, %r245, 255; + mul.wide.u32 %rd505, %r246, 4; + add.s64 %rd506, %rd2, %rd505; + shr.s64 %rd507, %rd655, 63; + shr.u64 %rd508, %rd507, 56; + add.s64 %rd509, %rd655, %rd508; + shr.s64 %rd510, %rd509, 8; + shl.b64 %rd511, %rd510, 2; + add.s64 %rd512, %rd1, %rd511; + ld.global.nc.f32 %f146, [%rd512]; + ld.global.nc.f32 %f147, [%rd506]; + mul.f32 %f148, %f147, %f146; + .loc 1 216 24 + shr.u16 %rs149, %rs15, 4; + .loc 1 226 28 + cvt.u32.u16 %r247, %rs149; + mul.wide.u32 %rd513, %r247, 4; + add.s64 %rd515, %rd179, %rd513; + ld.const.f32 %f149, [%rd515]; + fma.rn.f32 %f145, %f149, %f148, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs144, %f12;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs145, %f145;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r244, {%rs144,%rs145};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+44], %r244; + +$L__BB3_154: + .loc 1 212 29 + add.s64 %rd105, %rd5, 24; + .loc 1 213 9 + setp.ge.s64 %p66, %rd105, %rd138; + @%p66 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd516, %rd105, %rd8; + and.b64 %rd517, %rd516, -4294967296; + setp.eq.s64 %p67, %rd517, 0; + @%p67 bra $L__BB3_157; + + div.s64 %rd656, %rd105, %rd8; + bra.uni $L__BB3_158; + +$L__BB3_157: + cvt.u32.u64 %r248, %rd8; + cvt.u32.u64 %r249, %rd105; + div.u32 %r250, %r249, %r248; + cvt.u64.u32 %rd656, %r250; + +$L__BB3_158: + .loc 1 219 28 + add.s64 %rd518, %rd3, %rd656; + ld.global.nc.u8 %rs150, [%rd518]; + cvt.u32.u16 %r251, %rs150; + and.b32 %r252, %r251, 255; + mul.wide.u32 %rd519, %r252, 4; + add.s64 %rd520, %rd2, %rd519; + shr.s64 %rd521, %rd656, 63; + shr.u64 %rd522, %rd521, 56; + add.s64 %rd523, %rd656, %rd522; + shr.s64 %rd524, %rd523, 8; + shl.b64 %rd525, %rd524, 2; + add.s64 %rd526, %rd1, %rd525; + ld.global.nc.f32 %f150, [%rd526]; + ld.global.nc.f32 %f151, [%rd520]; + mul.f32 %f152, %f151, %f150; + .loc 1 220 24 + shl.b16 %rs151, %rs4, 2; + cvt.u64.u16 %rd527, %rs151; + and.b64 %rd528, %rd527, 60; + add.s64 %rd530, %rd179, %rd528; + ld.const.f32 %f153, [%rd530]; + fma.rn.f32 %f13, %f153, %f152, %f17; + .loc 1 222 29 + add.s64 %rd109, %rd5, 25; + .loc 1 223 9 + setp.lt.s64 %p68, %rd109, %rd138; + @%p68 bra $L__BB3_160; + bra.uni $L__BB3_159; + +$L__BB3_160: + .loc 1 0 9 + or.b64 %rd531, %rd109, %rd8; + and.b64 %rd532, %rd531, -4294967296; + setp.eq.s64 %p69, %rd532, 0; + @%p69 bra $L__BB3_162; + + div.s64 %rd657, %rd109, %rd8; + bra.uni $L__BB3_163; + +$L__BB3_159: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs152, %f13;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+48], %rs152; + bra.uni $L__BB3_164; + +$L__BB3_162: + .loc 1 0 26 + cvt.u32.u64 %r253, %rd8; + cvt.u32.u64 %r254, %rd109; + div.u32 %r255, %r254, %r253; + cvt.u64.u32 %rd657, %r255; + +$L__BB3_163: + .loc 1 225 32 + add.s64 %rd533, %rd3, %rd657; + ld.global.nc.u8 %rs157, [%rd533]; + cvt.u32.u16 %r257, %rs157; + and.b32 %r258, %r257, 255; + mul.wide.u32 %rd534, %r258, 4; + add.s64 %rd535, %rd2, %rd534; + shr.s64 %rd536, %rd657, 63; + shr.u64 %rd537, %rd536, 56; + add.s64 %rd538, %rd657, %rd537; + shr.s64 %rd539, %rd538, 8; + shl.b64 %rd540, %rd539, 2; + add.s64 %rd541, %rd1, %rd540; + ld.global.nc.f32 %f157, [%rd541]; + ld.global.nc.f32 %f158, [%rd535]; + mul.f32 %f159, %f158, %f157; + .loc 1 216 24 + and.b16 %rs158, %rs4, 240; + shr.u16 %rs159, %rs158, 4; + .loc 1 226 28 + cvt.u32.u16 %r259, %rs159; + mul.wide.u32 %rd542, %r259, 4; + add.s64 %rd544, %rd179, %rd542; + ld.const.f32 %f160, [%rd544]; + fma.rn.f32 %f156, %f160, %f159, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs153, %f13;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs154, %f156;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r256, {%rs153,%rs154};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+48], %r256; + +$L__BB3_164: + .loc 1 212 29 + add.s64 %rd113, %rd5, 26; + .loc 1 213 9 + setp.ge.s64 %p70, %rd113, %rd138; + @%p70 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd545, %rd113, %rd8; + and.b64 %rd546, %rd545, -4294967296; + setp.eq.s64 %p71, %rd546, 0; + @%p71 bra $L__BB3_167; + + div.s64 %rd658, %rd113, %rd8; + bra.uni $L__BB3_168; + +$L__BB3_167: + cvt.u32.u64 %r260, %rd8; + cvt.u32.u64 %r261, %rd113; + div.u32 %r262, %r261, %r260; + cvt.u64.u32 %rd658, %r262; + +$L__BB3_168: + .loc 1 219 28 + add.s64 %rd547, %rd3, %rd658; + ld.global.nc.u8 %rs160, [%rd547]; + cvt.u32.u16 %r263, %rs160; + and.b32 %r264, %r263, 255; + mul.wide.u32 %rd548, %r264, 4; + add.s64 %rd549, %rd2, %rd548; + shr.s64 %rd550, %rd658, 63; + shr.u64 %rd551, %rd550, 56; + add.s64 %rd552, %rd658, %rd551; + shr.s64 %rd553, %rd552, 8; + shl.b64 %rd554, %rd553, 2; + add.s64 %rd555, %rd1, %rd554; + ld.global.nc.f32 %f161, [%rd555]; + ld.global.nc.f32 %f162, [%rd549]; + mul.f32 %f163, %f162, %f161; + .loc 1 220 24 + shl.b16 %rs161, %rs3, 2; + cvt.u64.u16 %rd556, %rs161; + and.b64 %rd557, %rd556, 60; + add.s64 %rd559, %rd179, %rd557; + ld.const.f32 %f164, [%rd559]; + fma.rn.f32 %f14, %f164, %f163, %f17; + .loc 1 222 29 + add.s64 %rd117, %rd5, 27; + .loc 1 223 9 + setp.lt.s64 %p72, %rd117, %rd138; + @%p72 bra $L__BB3_170; + bra.uni $L__BB3_169; + +$L__BB3_170: + .loc 1 0 9 + or.b64 %rd560, %rd117, %rd8; + and.b64 %rd561, %rd560, -4294967296; + setp.eq.s64 %p73, %rd561, 0; + @%p73 bra $L__BB3_172; + + div.s64 %rd659, %rd117, %rd8; + bra.uni $L__BB3_173; + +$L__BB3_169: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs162, %f14;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+52], %rs162; + bra.uni $L__BB3_174; + +$L__BB3_172: + .loc 1 0 26 + cvt.u32.u64 %r265, %rd8; + cvt.u32.u64 %r266, %rd117; + div.u32 %r267, %r266, %r265; + cvt.u64.u32 %rd659, %r267; + +$L__BB3_173: + .loc 1 225 32 + add.s64 %rd562, %rd3, %rd659; + ld.global.nc.u8 %rs167, [%rd562]; + cvt.u32.u16 %r269, %rs167; + and.b32 %r270, %r269, 255; + mul.wide.u32 %rd563, %r270, 4; + add.s64 %rd564, %rd2, %rd563; + shr.s64 %rd565, %rd659, 63; + shr.u64 %rd566, %rd565, 56; + add.s64 %rd567, %rd659, %rd566; + shr.s64 %rd568, %rd567, 8; + shl.b64 %rd569, %rd568, 2; + add.s64 %rd570, %rd1, %rd569; + ld.global.nc.f32 %f168, [%rd570]; + ld.global.nc.f32 %f169, [%rd564]; + mul.f32 %f170, %f169, %f168; + .loc 1 216 24 + and.b16 %rs168, %rs3, 240; + shr.u16 %rs169, %rs168, 4; + .loc 1 226 28 + cvt.u32.u16 %r271, %rs169; + mul.wide.u32 %rd571, %r271, 4; + add.s64 %rd573, %rd179, %rd571; + ld.const.f32 %f171, [%rd573]; + fma.rn.f32 %f167, %f171, %f170, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs163, %f14;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs164, %f167;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r268, {%rs163,%rs164};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+52], %r268; + +$L__BB3_174: + .loc 1 212 29 + add.s64 %rd121, %rd5, 28; + .loc 1 213 9 + setp.ge.s64 %p74, %rd121, %rd138; + @%p74 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd574, %rd121, %rd8; + and.b64 %rd575, %rd574, -4294967296; + setp.eq.s64 %p75, %rd575, 0; + @%p75 bra $L__BB3_177; + + div.s64 %rd660, %rd121, %rd8; + bra.uni $L__BB3_178; + +$L__BB3_177: + cvt.u32.u64 %r272, %rd8; + cvt.u32.u64 %r273, %rd121; + div.u32 %r274, %r273, %r272; + cvt.u64.u32 %rd660, %r274; + +$L__BB3_178: + .loc 1 219 28 + add.s64 %rd576, %rd3, %rd660; + ld.global.nc.u8 %rs170, [%rd576]; + cvt.u32.u16 %r275, %rs170; + and.b32 %r276, %r275, 255; + mul.wide.u32 %rd577, %r276, 4; + add.s64 %rd578, %rd2, %rd577; + shr.s64 %rd579, %rd660, 63; + shr.u64 %rd580, %rd579, 56; + add.s64 %rd581, %rd660, %rd580; + shr.s64 %rd582, %rd581, 8; + shl.b64 %rd583, %rd582, 2; + add.s64 %rd584, %rd1, %rd583; + ld.global.nc.f32 %f172, [%rd584]; + ld.global.nc.f32 %f173, [%rd578]; + mul.f32 %f174, %f173, %f172; + .loc 1 220 24 + shl.b16 %rs171, %rs2, 2; + cvt.u64.u16 %rd585, %rs171; + and.b64 %rd586, %rd585, 60; + add.s64 %rd588, %rd179, %rd586; + ld.const.f32 %f175, [%rd588]; + fma.rn.f32 %f15, %f175, %f174, %f17; + .loc 1 222 29 + add.s64 %rd125, %rd5, 29; + .loc 1 223 9 + setp.lt.s64 %p76, %rd125, %rd138; + @%p76 bra $L__BB3_180; + bra.uni $L__BB3_179; + +$L__BB3_180: + .loc 1 0 9 + or.b64 %rd589, %rd125, %rd8; + and.b64 %rd590, %rd589, -4294967296; + setp.eq.s64 %p77, %rd590, 0; + @%p77 bra $L__BB3_182; + + div.s64 %rd661, %rd125, %rd8; + bra.uni $L__BB3_183; + +$L__BB3_179: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs172, %f15;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+56], %rs172; + bra.uni $L__BB3_184; + +$L__BB3_182: + .loc 1 0 26 + cvt.u32.u64 %r277, %rd8; + cvt.u32.u64 %r278, %rd125; + div.u32 %r279, %r278, %r277; + cvt.u64.u32 %rd661, %r279; + +$L__BB3_183: + .loc 1 225 32 + add.s64 %rd591, %rd3, %rd661; + ld.global.nc.u8 %rs177, [%rd591]; + cvt.u32.u16 %r281, %rs177; + and.b32 %r282, %r281, 255; + mul.wide.u32 %rd592, %r282, 4; + add.s64 %rd593, %rd2, %rd592; + shr.s64 %rd594, %rd661, 63; + shr.u64 %rd595, %rd594, 56; + add.s64 %rd596, %rd661, %rd595; + shr.s64 %rd597, %rd596, 8; + shl.b64 %rd598, %rd597, 2; + add.s64 %rd599, %rd1, %rd598; + ld.global.nc.f32 %f179, [%rd599]; + ld.global.nc.f32 %f180, [%rd593]; + mul.f32 %f181, %f180, %f179; + .loc 1 216 24 + and.b16 %rs178, %rs2, 240; + shr.u16 %rs179, %rs178, 4; + .loc 1 226 28 + cvt.u32.u16 %r283, %rs179; + mul.wide.u32 %rd600, %r283, 4; + add.s64 %rd602, %rd179, %rd600; + ld.const.f32 %f182, [%rd602]; + fma.rn.f32 %f178, %f182, %f181, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs173, %f15;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs174, %f178;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r280, {%rs173,%rs174};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+56], %r280; + +$L__BB3_184: + .loc 1 212 29 + add.s64 %rd129, %rd5, 30; + .loc 1 213 9 + setp.ge.s64 %p78, %rd129, %rd138; + @%p78 bra $L__BB3_194; + + .loc 1 0 9 + or.b64 %rd603, %rd129, %rd8; + and.b64 %rd604, %rd603, -4294967296; + setp.eq.s64 %p79, %rd604, 0; + @%p79 bra $L__BB3_187; + + div.s64 %rd662, %rd129, %rd8; + bra.uni $L__BB3_188; + +$L__BB3_187: + cvt.u32.u64 %r284, %rd8; + cvt.u32.u64 %r285, %rd129; + div.u32 %r286, %r285, %r284; + cvt.u64.u32 %rd662, %r286; + +$L__BB3_188: + .loc 1 219 28 + add.s64 %rd605, %rd3, %rd662; + ld.global.nc.u8 %rs180, [%rd605]; + cvt.u32.u16 %r287, %rs180; + and.b32 %r288, %r287, 255; + mul.wide.u32 %rd606, %r288, 4; + add.s64 %rd607, %rd2, %rd606; + shr.s64 %rd608, %rd662, 63; + shr.u64 %rd609, %rd608, 56; + add.s64 %rd610, %rd662, %rd609; + shr.s64 %rd611, %rd610, 8; + shl.b64 %rd612, %rd611, 2; + add.s64 %rd613, %rd1, %rd612; + ld.global.nc.f32 %f183, [%rd613]; + ld.global.nc.f32 %f184, [%rd607]; + mul.f32 %f185, %f184, %f183; + .loc 1 220 24 + shl.b16 %rs181, %rs1, 2; + cvt.u64.u16 %rd614, %rs181; + and.b64 %rd615, %rd614, 60; + add.s64 %rd617, %rd179, %rd615; + ld.const.f32 %f186, [%rd617]; + fma.rn.f32 %f16, %f186, %f185, %f17; + .loc 1 222 29 + add.s64 %rd133, %rd5, 31; + .loc 1 223 9 + setp.lt.s64 %p80, %rd133, %rd138; + @%p80 bra $L__BB3_190; + bra.uni $L__BB3_189; + +$L__BB3_190: + .loc 1 0 9 + or.b64 %rd618, %rd133, %rd8; + and.b64 %rd619, %rd618, -4294967296; + setp.eq.s64 %p81, %rd619, 0; + @%p81 bra $L__BB3_192; + + div.s64 %rd663, %rd133, %rd8; + bra.uni $L__BB3_193; + +$L__BB3_189: + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs182, %f16;} + + // end inline asm + .loc 1 229 26 + st.global.u16 [%rd13+60], %rs182; + bra.uni $L__BB3_194; + +$L__BB3_192: + .loc 1 0 26 + cvt.u32.u64 %r289, %rd8; + cvt.u32.u64 %r290, %rd133; + div.u32 %r291, %r290, %r289; + cvt.u64.u32 %rd663, %r291; + +$L__BB3_193: + .loc 1 225 32 + add.s64 %rd620, %rd3, %rd663; + ld.global.nc.u8 %rs187, [%rd620]; + cvt.u32.u16 %r293, %rs187; + and.b32 %r294, %r293, 255; + mul.wide.u32 %rd621, %r294, 4; + add.s64 %rd622, %rd2, %rd621; + shr.s64 %rd623, %rd663, 63; + shr.u64 %rd624, %rd623, 56; + add.s64 %rd625, %rd663, %rd624; + shr.s64 %rd626, %rd625, 8; + shl.b64 %rd627, %rd626, 2; + add.s64 %rd628, %rd1, %rd627; + ld.global.nc.f32 %f190, [%rd628]; + ld.global.nc.f32 %f191, [%rd622]; + mul.f32 %f192, %f191, %f190; + .loc 1 216 24 + shr.u16 %rs188, %rs1, 4; + .loc 1 226 28 + cvt.u32.u16 %r295, %rs188; + mul.wide.u32 %rd629, %r295, 4; + add.s64 %rd631, %rd179, %rd629; + ld.const.f32 %f193, [%rd631]; + fma.rn.f32 %f189, %f193, %f192, %f17; + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs183, %f16;} + + // end inline asm + .loc 3 455 3, function_name $L__info_string5, inlined_at 1 63 85 + // begin inline asm + { cvt.rn.bf16.f32 %rs184, %f189;} + + // end inline asm + .loc 3 1534 5, function_name $L__info_string7, inlined_at 1 82 29 + // begin inline asm + { mov.b32 %r292, {%rs183,%rs184};} + + // end inline asm + .loc 1 83 5, function_name $L__info_string6, inlined_at 1 227 13 + st.global.u32 [%rd13+60], %r292; + +$L__BB3_194: + .loc 1 232 1 + ret; + +} + .file 1 "/mnt/d/thu-project/Learning-CUDA-master/Learning-CUDA-master/nf4/dequant_kernel_v2.cu" + .file 2 "/usr/include/cuda_fp16.hpp" + .file 3 "/usr/include/cuda_bf16.hpp" + .section .debug_str + { +$L__info_string0: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100 +.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,55,99,97,115,116,95,116,111,73,54,95,95,104,97,108 +.b8 102,69,69,84,95,102,0 +$L__info_string1: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,50,95,95,102,108,111,97,116,50,104,97,108,102,69,102,0 +$L__info_string2: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100 +.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,49,48,115,116,111,114,101,95,112,97,105,114,73,54,95 +.b8 95,104,97,108,102,69,69,118,80,84,95,83,51,95,83,51,95,0 +$L__info_string3: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,52,95,95,104,97,108,118,101,115,50,104,97,108,102,50,69,54,95,95,104,97,108,102,83,48,95 +.b8 0 +$L__info_string4: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100 +.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,55,99,97,115,116,95,116,111,73,49,51,95,95,110,118 +.b8 95,98,102,108,111,97,116,49,54,69,69,84,95,102,0 +$L__info_string5: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,54,95,95,102,108,111,97,116,50,98,102,108,111,97,116,49,54,69,102,0 +$L__info_string6: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100 +.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,49,48,115,116,111,114,101,95,112,97,105,114,73,49,51 +.b8 95,95,110,118,95,98,102,108,111,97,116,49,54,69,69,118,80,84,95,83,51,95,83,51,95,0 +$L__info_string7: +.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101 +.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,56,95,95,104,97,108,118,101,115,50,98,102,108,111,97,116,49,54,50,69,49,51,95,95,110,118 +.b8 95,98,102,108,111,97,116,49,54,83,48,95,0 + + } diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu new file mode 100644 index 0000000..6f97600 --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu @@ -0,0 +1,544 @@ +/** + * dequant_kernel.cu — NF4 Dequantization Kernel (Optimized) + * + * Formula: w = NF4[4bit_index] * code2[absmax_q[block]] * absmax2[block/256] + offset + * + * This is bitsandbytes "double quantization" (quant_type=nf4, double_quant=True): + * - Level 1 (absmax_q): uint8 index into code2 lookup table + * - Level 2 (absmax2): FP16 scale for groups of 256 L1 blocks + * - The L1 scale itself is quantized to save memory (~3% overhead instead of 8%) + * + * Optimization history (for presentation): + * v1 Naive: one thread per element, scalar FP16 write → ~5% A100 bandwidth + * v2 Vectorized: one thread per uint8 (2 elements), __half2 packed store → ~35% bandwidth + * v3 (TODO): 128-bit vectorized load (int4, 32 elements/thread) → ~80% bandwidth + */ + +#include "dequant_kernel.h" + +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +namespace { + +constexpr int kPairsPerThreadV3 = 8; + +// ── NF4 Lookup Table ───────────────────────────────────────────────────────── +// From QLoRA paper Table 1: 16 quantile values of N(0,1) scaled to [-1, 1] +// Stored in __constant__ memory: all 108 SMs on A100 share one copy, +// with 8KB constant cache per SM. 16 * 4 = 64 bytes — fits in a single cache line. +__device__ __constant__ float d_nf4[16] = { + -1.0f, -0.6961928f, -0.52507305f, -0.3949175f, + -0.28444138f, -0.18477343f, -0.091050036f, 0.0f, + 0.0795803f, 0.1609302f, 0.2461123f, 0.33791524f, + 0.44070983f, 0.562617f, 0.72295684f, 1.0f +}; + +// CPU copy for reference computation (identical values) +constexpr float kNF4[16] = { + -1.0f, -0.6961928f, -0.52507305f, -0.3949175f, + -0.28444138f, -0.18477343f, -0.091050036f, 0.0f, + 0.0795803f, 0.1609302f, 0.2461123f, 0.33791524f, + 0.44070983f, 0.562617f, 0.72295684f, 1.0f +}; + +inline int64_t ceil_div(int64_t a, int64_t b) { return (a + b - 1) / b; } + +inline float fp16_bits_to_float(uint16_t bits) { + __half h; std::memcpy(&h, &bits, sizeof(uint16_t)); + return __half2float(h); +} + +// ── Type-agnostic cast helper ───────────────────────────────────────────────── +template __device__ inline T cast_to(float v); +template <> __device__ inline __half cast_to<__half>(float v) { return __float2half(v); } +template <> __device__ inline __nv_bfloat16 cast_to<__nv_bfloat16>(float v) { return __float2bfloat16(v); } + +// ── Packed-pair write helper ─────────────────────────────────────────────────── +// Instead of two separate 16-bit stores, pack into one 32-bit store. +// This halves the number of store instructions and improves L2 write efficiency. +template +__device__ inline void store_pair(T* __restrict__ ptr, T a, T b); + +template <> +__device__ inline void store_pair<__half>(__half* __restrict__ ptr, __half a, __half b) { + // __half2 is two consecutive __half values; store as uint32 for atomic/vectorized write + __half2 packed = __halves2half2(a, b); + *reinterpret_cast(ptr) = *reinterpret_cast(&packed); +} + +template <> +__device__ inline void store_pair<__nv_bfloat16>( + __nv_bfloat16* __restrict__ ptr, __nv_bfloat16 a, __nv_bfloat16 b) +{ + __nv_bfloat162 packed = __halves2bfloat162(a, b); + *reinterpret_cast(ptr) = *reinterpret_cast(&packed); +} + +// ── Main Dequant Kernel (v2: Vectorized / Packed Store) ────────────────────── +// +// Thread mapping: +// - 1 thread handles 1 uint8 = 2 packed 4-bit weights = 2 output elements +// - pair_idx = blockIdx.x * blockDim.x + threadIdx.x +// - elem0 = pair_idx * 2 +// - elem1 = pair_idx * 2 + 1 +// +// Memory access pattern (key for A100 bandwidth): +// - packed_weights: consecutive threads → consecutive bytes → COALESCED READ +// - output: consecutive threads → consecutive 32-bit stores → COALESCED WRITE +// - absmax_q: 32 consecutive threads share the same block (blocksize=64) +// → same cache line → effectively a BROADCAST, no divergence +// - code2: random access into 256-entry table → stays in L1 after warmup +// +template +__global__ void dequant_kernel( + const uint8_t* __restrict__ packed_weights, // [num_pairs] bytes + const uint8_t* __restrict__ absmax_q, // [num_blocks] uint8 + const float* __restrict__ absmax2, // [num_groups] float + const float* __restrict__ code2, // [256] float + float offset, + int64_t numel, + int32_t blocksize, + T* __restrict__ out) // [numel] T +{ + // Which pair of elements does this thread own? + const int64_t pair_idx = static_cast(blockIdx.x) * blockDim.x + threadIdx.x; + const int64_t elem0 = pair_idx * 2; + if (elem0 >= numel) return; + + // ── Unpack two 4-bit indices from one byte ──────────────────────────────── + // byte layout (bitsandbytes convention): + // bits[3:0] → element at even position (elem0) + // bits[7:4] → element at odd position (elem1) + const uint8_t packed = packed_weights[pair_idx]; + const int idx0 = packed & 0x0F; // low nibble + const int idx1 = (packed >> 4) & 0x0F; // high nibble + + // ── Compute scale for elem0 ─────────────────────────────────────────────── + // Two-level (double) quantization: + // L1 scale = code2[ absmax_q[block_idx] ] (uint8 → float via codebook) + // L2 scale = absmax2[ block_idx / 256 ] (float) + // final scale = L1 * L2 + const int64_t block_idx0 = elem0 / blocksize; + const int64_t group_idx0 = block_idx0 / 256; + const float scale0 = code2[absmax_q[block_idx0]] * absmax2[group_idx0]; + const float w0 = d_nf4[idx0] * scale0 + offset; + + // ── Compute scale for elem1 (may be in a different block at boundaries) ─── + const int64_t elem1 = elem0 + 1; + if (elem1 < numel) { + // Normal path: two valid elements + // For blocksize >= 2, block_idx1 == block_idx0 in the vast majority of cases. + // We still compute it correctly for generality (compiler will optimize same-block case). + const int64_t block_idx1 = elem1 / blocksize; + const int64_t group_idx1 = block_idx1 / 256; + const float scale1 = code2[absmax_q[block_idx1]] * absmax2[group_idx1]; + const float w1 = d_nf4[idx1] * scale1 + offset; + + // ── Vectorized (packed) store: 2 × T in one 32-bit write ───────────── + // Equivalent to two separate stores, but issues a single 32-bit transaction + // to the L2/HBM, halving write pressure. + store_pair(out + elem0, cast_to(w0), cast_to(w1)); + } else { + // Edge case: only elem0 is valid (odd-sized matrix, last element) + out[elem0] = cast_to(w0); + } +} + +template +__global__ void dequant_kernel_v3( + const uint8_t* __restrict__ packed_weights, + const uint8_t* __restrict__ absmax_q, + const float* __restrict__ absmax2, + const float* __restrict__ code2, + float offset, + int64_t numel, + int32_t blocksize, + T* __restrict__ out) +{ + constexpr int pairs_per_thread = kPairsPerThreadV3; + + const int64_t tid = static_cast(blockIdx.x) * blockDim.x + threadIdx.x; + const int64_t pair_base = tid * pairs_per_thread; + const int64_t elem_base = pair_base * 2; + const int64_t num_pairs_total = (numel + 1) / 2; + + if (elem_base >= numel) return; + + uint8_t bytes[pairs_per_thread]; + + if (pair_base + pairs_per_thread <= num_pairs_total) { + if constexpr (pairs_per_thread == 16) { + const uint4 raw = *reinterpret_cast(packed_weights + pair_base); + const uint32_t lanes[4] = {raw.x, raw.y, raw.z, raw.w}; + #pragma unroll + for (int l = 0; l < 4; ++l) { + const uint32_t v = lanes[l]; + bytes[l * 4 + 0] = static_cast(v & 0xFFu); + bytes[l * 4 + 1] = static_cast((v >> 8) & 0xFFu); + bytes[l * 4 + 2] = static_cast((v >> 16) & 0xFFu); + bytes[l * 4 + 3] = static_cast((v >> 24) & 0xFFu); + } + } else { + const uint2 raw = *reinterpret_cast(packed_weights + pair_base); + const uint32_t lanes[2] = {raw.x, raw.y}; + #pragma unroll + for (int l = 0; l < 2; ++l) { + const uint32_t v = lanes[l]; + bytes[l * 4 + 0] = static_cast(v & 0xFFu); + bytes[l * 4 + 1] = static_cast((v >> 8) & 0xFFu); + bytes[l * 4 + 2] = static_cast((v >> 16) & 0xFFu); + bytes[l * 4 + 3] = static_cast((v >> 24) & 0xFFu); + } + } + } else { + #pragma unroll + for (int i = 0; i < pairs_per_thread; ++i) { + const int64_t p = pair_base + i; + bytes[i] = (p < num_pairs_total) ? packed_weights[p] : 0u; + } + } + + #pragma unroll + for (int i = 0; i < pairs_per_thread; ++i) { + const int64_t elem0 = elem_base + static_cast(i) * 2; + if (elem0 >= numel) break; + + const int idx0 = bytes[i] & 0x0F; + const int idx1 = (bytes[i] >> 4) & 0x0F; + + const int64_t block_idx0 = elem0 / blocksize; + const float scale0 = code2[absmax_q[block_idx0]] * absmax2[block_idx0 / 256]; + const float w0 = d_nf4[idx0] * scale0 + offset; + + const int64_t elem1 = elem0 + 1; + if (elem1 < numel) { + const int64_t block_idx1 = elem1 / blocksize; + const float scale1 = code2[absmax_q[block_idx1]] * absmax2[block_idx1 / 256]; + const float w1 = d_nf4[idx1] * scale1 + offset; + store_pair(out + elem0, cast_to(w0), cast_to(w1)); + } else { + out[elem0] = cast_to(w0); + } + } +} + +template +__global__ __launch_bounds__(128, 8) void dequant_kernel_v4( + const uint8_t* __restrict__ packed_weights, + const uint8_t* __restrict__ absmax_q, + const float* __restrict__ absmax2, + const float* __restrict__ code2, + float offset, + int64_t numel, + int32_t blocksize, + T* __restrict__ out) +{ + constexpr int pairs_per_thread = kPairsPerThreadV3; + + const int64_t tid = static_cast(blockIdx.x) * blockDim.x + threadIdx.x; + const int64_t pair_base = tid * pairs_per_thread; + const int64_t elem_base = pair_base * 2; + const int64_t num_pairs_total = (numel + 1) / 2; + + if (elem_base >= numel) return; + + uint8_t bytes[pairs_per_thread]; + + if (pair_base + pairs_per_thread <= num_pairs_total) { + const uint2 raw = *reinterpret_cast(packed_weights + pair_base); + const uint32_t lanes[2] = {raw.x, raw.y}; + #pragma unroll + for (int l = 0; l < 2; ++l) { + const uint32_t v = lanes[l]; + bytes[l * 4 + 0] = static_cast(v & 0xFFu); + bytes[l * 4 + 1] = static_cast((v >> 8) & 0xFFu); + bytes[l * 4 + 2] = static_cast((v >> 16) & 0xFFu); + bytes[l * 4 + 3] = static_cast((v >> 24) & 0xFFu); + } + } else { + #pragma unroll + for (int i = 0; i < pairs_per_thread; ++i) { + const int64_t p = pair_base + i; + bytes[i] = (p < num_pairs_total) ? packed_weights[p] : 0u; + } + } + + #pragma unroll + for (int i = 0; i < pairs_per_thread; ++i) { + const int64_t elem0 = elem_base + static_cast(i) * 2; + if (elem0 >= numel) break; + + const int idx0 = bytes[i] & 0x0F; + const int idx1 = (bytes[i] >> 4) & 0x0F; + + const int64_t block_idx0 = elem0 / blocksize; + const float scale0 = code2[absmax_q[block_idx0]] * absmax2[block_idx0 / 256]; + const float w0 = d_nf4[idx0] * scale0 + offset; + + const int64_t elem1 = elem0 + 1; + if (elem1 < numel) { + const int64_t block_idx1 = elem1 / blocksize; + const float scale1 = code2[absmax_q[block_idx1]] * absmax2[block_idx1 / 256]; + const float w1 = d_nf4[idx1] * scale1 + offset; + store_pair(out + elem0, cast_to(w0), cast_to(w1)); + } else { + out[elem0] = cast_to(w0); + } + } +} + +// ── CPU Reference (for MAE verification) ───────────────────────────────────── +// Identical formula to the GPU kernel, computed in FP32 on the host. +// Used to verify: |gpu_output[i] - cpu_ref[i]| < 1e-2 for all i +void cpu_reference(const NF4Binary& input, std::vector& ref) { + const int64_t numel = input.config.rows * input.config.cols; + const int32_t blocksize = input.config.blocksize; + ref.resize(numel); + + for (int64_t i = 0; i < numel; ++i) { + const int64_t pair_idx = i / 2; + const bool low = (i % 2) == 0; + const uint8_t packed = input.packed_weights[pair_idx]; + const int idx = low ? (packed & 0x0F) : ((packed >> 4) & 0x0F); + + const int64_t block_idx = i / blocksize; + const int64_t group_idx = block_idx / 256; + + // Reconstruct two-level scale from stored FP16 bits + const float scale_l1 = fp16_bits_to_float(input.code2_raw[input.absmax_q[block_idx]]); + const float scale_l2 = fp16_bits_to_float(input.absmax2_raw[group_idx]); + ref[i] = kNF4[idx] * scale_l1 * scale_l2 + input.offset; + } +} + +// ── GPU Launch Wrapper ──────────────────────────────────────────────────────── +template +bool launch_cuda(const NF4Binary& input, std::vector& output, std::vector& gpu_fp32_out) { + auto fail_cuda = [](const char* stage, cudaError_t err) -> bool { + std::cerr << "FAIL " << stage << ": [" << (int)err << "] " + << cudaGetErrorString(err) << std::endl; + return false; + }; + + const int64_t numel = input.config.rows * input.config.cols; + const int64_t num_pairs = ceil_div(numel, 2); + const int64_t num_blocks = ceil_div(numel, input.config.blocksize); + const int64_t num_groups = ceil_div(num_blocks, 256); + + std::cout << "GPU launch: numel=" << numel + << " pairs=" << num_pairs + << " blocks=" << num_blocks + << " groups=" << num_groups << std::endl; + + // ── Device allocations ──────────────────────────────────────────────────── + uint8_t* d_packed = nullptr; + uint8_t* d_absmax_q = nullptr; + float* d_absmax2 = nullptr; + float* d_code2 = nullptr; + T* d_out = nullptr; + + cudaError_t err; + if ((err = cudaMalloc(&d_packed, num_pairs)) != cudaSuccess) return fail_cuda("malloc packed", err); + if ((err = cudaMalloc(&d_absmax_q, num_blocks)) != cudaSuccess) return fail_cuda("malloc absmax_q", err); + if ((err = cudaMalloc(&d_absmax2, num_groups * sizeof(float))) != cudaSuccess) return fail_cuda("malloc absmax2", err); + if ((err = cudaMalloc(&d_code2, 256 * sizeof(float))) != cudaSuccess) return fail_cuda("malloc code2", err); + if ((err = cudaMalloc(&d_out, numel * sizeof(T))) != cudaSuccess) return fail_cuda("malloc out", err); + + // ── Convert FP16 metadata to FP32 for GPU ───────────────────────────────── + // (GPU kernel uses float to avoid fp16 precision issues in scale multiplication) + std::vector h_absmax2(num_groups), h_code2(256); + for (int64_t i = 0; i < num_groups; ++i) h_absmax2[i] = fp16_bits_to_float(input.absmax2_raw[i]); + for (int i = 0; i < 256; ++i) h_code2[i] = fp16_bits_to_float(input.code2_raw[i]); + + // ── Host → Device transfers ─────────────────────────────────────────────── + if ((err = cudaMemcpy(d_packed, input.packed_weights.data(), num_pairs, cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy packed", err); + if ((err = cudaMemcpy(d_absmax_q, input.absmax_q.data(), num_blocks, cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy absmax_q", err); + if ((err = cudaMemcpy(d_absmax2, h_absmax2.data(), num_groups * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy absmax2", err); + if ((err = cudaMemcpy(d_code2, h_code2.data(), 256 * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy code2", err); + + const double bytes_read = (double)(num_pairs + num_blocks + num_groups * 2 + 256 * 2); + const double bytes_write = (double)(numel * sizeof(T)); + const double total_gb = (bytes_read + bytes_write) / 1e9; + float ms_v2 = 0.0f; + float ms_v3 = 0.0f; + float ms_v4 = 0.0f; + + { + const int threads_v2 = 256; + const int blocks_v2 = static_cast(ceil_div(num_pairs, static_cast(threads_v2))); + cudaEvent_t t_start, t_stop; + cudaEventCreate(&t_start); + cudaEventCreate(&t_stop); + cudaEventRecord(t_start); + dequant_kernel<<>>( + d_packed, d_absmax_q, d_absmax2, d_code2, + input.offset, numel, input.config.blocksize, d_out); + cudaEventRecord(t_stop); + if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v2", err); + if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v2", err); + cudaEventElapsedTime(&ms_v2, t_start, t_stop); + cudaEventDestroy(t_start); + cudaEventDestroy(t_stop); + } + + { + const int threads_v3 = 128; + const int64_t num_threads_v3 = ceil_div(num_pairs, static_cast(kPairsPerThreadV3)); + const int blocks_v3 = static_cast(ceil_div(num_threads_v3, static_cast(threads_v3))); + cudaEvent_t t_start, t_stop; + cudaEventCreate(&t_start); + cudaEventCreate(&t_stop); + cudaEventRecord(t_start); + dequant_kernel_v3<<>>( + d_packed, d_absmax_q, d_absmax2, d_code2, + input.offset, numel, input.config.blocksize, d_out); + cudaEventRecord(t_stop); + if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v3", err); + if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v3", err); + cudaEventElapsedTime(&ms_v3, t_start, t_stop); + cudaEventDestroy(t_start); + cudaEventDestroy(t_stop); + } + + int min_grid_size_v4 = 0; + int block_size_v4 = 0; + if ((err = cudaOccupancyMaxPotentialBlockSize( + &min_grid_size_v4, + &block_size_v4, + dequant_kernel_v4, + 0, + 128)) != cudaSuccess) { + return fail_cuda("occupancy v4", err); + } + if (block_size_v4 <= 0 || block_size_v4 > 128) { + block_size_v4 = 128; + } + + { + const int64_t num_threads_v4 = ceil_div(num_pairs, static_cast(kPairsPerThreadV3)); + const int blocks_v4 = static_cast(ceil_div(num_threads_v4, static_cast(block_size_v4))); + cudaEvent_t t_start, t_stop; + cudaEventCreate(&t_start); + cudaEventCreate(&t_stop); + cudaEventRecord(t_start); + dequant_kernel_v4<<>>( + d_packed, d_absmax_q, d_absmax2, d_code2, + input.offset, numel, input.config.blocksize, d_out); + cudaEventRecord(t_stop); + if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v4", err); + if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v4", err); + cudaEventElapsedTime(&ms_v4, t_start, t_stop); + cudaEventDestroy(t_start); + cudaEventDestroy(t_stop); + } + + if ((err = cudaDeviceSynchronize()) != cudaSuccess) return fail_cuda("sync", err); + + const double bw_v2 = total_gb / (ms_v2 / 1000.0); + const double bw_v3 = total_gb / (ms_v3 / 1000.0); + const double bw_v4 = total_gb / (ms_v4 / 1000.0); + const double speedup_v3 = ms_v3 > 0.0 ? (ms_v2 / ms_v3) : 0.0; + const double speedup_v4 = ms_v4 > 0.0 ? (ms_v2 / ms_v4) : 0.0; + const double speedup_v4_vs_v3 = ms_v4 > 0.0 ? (ms_v3 / ms_v4) : 0.0; + + std::cout << "[v2] Kernel time : " << ms_v2 << " ms | Bandwidth : " << bw_v2 + << " GB/s (" << (bw_v2 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl; + std::cout << "[v3] Kernel time : " << ms_v3 << " ms | Bandwidth : " << bw_v3 + << " GB/s (" << (bw_v3 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl; + std::cout << "[v3 speedup vs v2]: " << speedup_v3 << "x" << std::endl; + std::cout << "[v4] Kernel time : " << ms_v4 << " ms | Bandwidth : " << bw_v4 + << " GB/s (" << (bw_v4 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl; + std::cout << "[v4 speedup vs v2]: " << speedup_v4 << "x" << std::endl; + std::cout << "[v4 speedup vs v3]: " << speedup_v4_vs_v3 << "x" + << " | occupancy block=" << block_size_v4 + << " min_grid=" << min_grid_size_v4 << std::endl; + + // ── Device → Host copy ──────────────────────────────────────────────────── + gpu_fp32_out.resize(numel); + if constexpr (std::is_same::value) { + std::vector<__half> h_out(numel); + if ((err = cudaMemcpy(h_out.data(), d_out, numel * sizeof(__half), cudaMemcpyDeviceToHost)) != cudaSuccess) + return fail_cuda("memcpy output fp16", err); + for (int64_t i = 0; i < numel; ++i) gpu_fp32_out[i] = __half2float(h_out[i]); + } else { + std::vector<__nv_bfloat16> h_out(numel); + if ((err = cudaMemcpy(h_out.data(), d_out, numel * sizeof(__nv_bfloat16), cudaMemcpyDeviceToHost)) != cudaSuccess) + return fail_cuda("memcpy output bf16", err); + for (int64_t i = 0; i < numel; ++i) gpu_fp32_out[i] = __bfloat162float(h_out[i]); + } + output = gpu_fp32_out; + + cudaFree(d_packed); cudaFree(d_absmax_q); + cudaFree(d_absmax2); cudaFree(d_code2); cudaFree(d_out); + return true; +} + +} // namespace + +// ── Public API ─────────────────────────────────────────────────────────────── + +bool load_nf4_binary(const char* file_path, NF4Binary& out) { + std::ifstream fin(file_path, std::ios::binary); + if (!fin.is_open()) return false; + + int64_t rows = 0, cols = 0; int32_t blocksize = 0; + fin.read(reinterpret_cast(&rows), sizeof(rows)); + fin.read(reinterpret_cast(&cols), sizeof(cols)); + fin.read(reinterpret_cast(&blocksize), sizeof(blocksize)); + if (!fin.good()) return false; + + const int64_t numel = rows * cols; + const int64_t num_pairs = ceil_div(numel, 2); + const int64_t num_blocks = ceil_div(numel, blocksize); + const int64_t num_groups = ceil_div(num_blocks, 256); + + out.config = {rows, cols, blocksize, ComputeType::FP16}; + out.packed_weights.resize(num_pairs); + out.absmax_q.resize(num_blocks); + out.absmax2_raw.resize(num_groups); + out.code2_raw.resize(256); + + fin.read(reinterpret_cast(out.packed_weights.data()), num_pairs); + fin.read(reinterpret_cast(out.absmax_q.data()), num_blocks); + fin.read(reinterpret_cast(out.absmax2_raw.data()), num_groups * sizeof(uint16_t)); + fin.read(reinterpret_cast(out.code2_raw.data()), 256 * sizeof(uint16_t)); + fin.read(reinterpret_cast(&out.offset), sizeof(float)); + + std::cout << "Loaded: " << rows << "x" << cols + << " blocksize=" << blocksize << " offset=" << out.offset << std::endl; + return fin.good(); +} + +bool save_float_output(const char* file_path, const std::vector& data) { + std::ofstream fout(file_path, std::ios::binary); + if (!fout.is_open()) return false; + fout.write(reinterpret_cast(data.data()), data.size() * sizeof(float)); + return fout.good(); +} + +bool run_dequant_cuda(const NF4Binary& input, std::vector& output, float& mae) { + std::vector gpu_out; + const bool ok = (input.config.compute_type == ComputeType::FP16) + ? launch_cuda<__half>(input, output, gpu_out) + : launch_cuda<__nv_bfloat16>(input, output, gpu_out); + if (!ok) return false; + + // MAE against CPU reference + std::vector ref; + cpu_reference(input, ref); + double err_sum = 0.0; + for (size_t i = 0; i < ref.size(); ++i) + err_sum += std::abs((double)gpu_out[i] - (double)ref[i]); + mae = static_cast(err_sum / (double)ref.size()); + std::cout << "MAE (v4 GPU vs CPU ref): " << mae << (mae < 1e-2f ? " ✓ PASS" : " ✗ FAIL (threshold 1e-2)") << std::endl; + return true; +} diff --git a/03_nf4_dequant/SkyHigh-achieving/main.cpp b/03_nf4_dequant/SkyHigh-achieving/main.cpp new file mode 100644 index 0000000..a891e6c --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/main.cpp @@ -0,0 +1,76 @@ +#include "dequant_kernel.h" +#include +#include +#include +#include + +namespace { + +ComputeType parse_compute_type(const std::string& s) { + if (s == "bf16") { + return ComputeType::BF16; + } + return ComputeType::FP16; +} + +} + +int main(int argc, char** argv) { + int deviceCount = 0; + cudaError_t error = cudaGetDeviceCount(&deviceCount); + if (error != cudaSuccess) { + std::cerr << "cudaGetDeviceCount failed: " << cudaGetErrorString(error) << std::endl; + std::cerr << "Ensure this binary is executed inside a GPU allocation (srun/sbatch)." << std::endl; + return -1; + } + if (deviceCount == 0) { + std::cerr << "No CUDA-capable devices found in current context." << std::endl; + std::cerr << "Use: srun --partition=nvidia --gres=gpu:nvidia:1 ... ./nf4_dequant" << std::endl; + return -1; + } + + cudaDeviceProp prop; + cudaError_t propErr = cudaGetDeviceProperties(&prop, 0); + if (propErr == cudaSuccess) { + std::cout << "Using device 0: " << prop.name << " (Compute Capability " << prop.major << "." << prop.minor << ")" << std::endl; + } else { + std::cerr << "cudaGetDeviceProperties failed: " << cudaGetErrorString(propErr) << std::endl; + return -1; + } + + if (argc < 4) { + std::cerr << "Usage: nf4_dequant " << std::endl; + return 1; + } + + NF4Binary input; + if (!load_nf4_binary(argv[1], input)) { + std::cerr << "Failed to load input binary: " << argv[1] << std::endl; + return 2; + } + input.config.compute_type = parse_compute_type(argv[2]); + + std::vector output; + float mae = 0.0f; + if (!run_dequant_cuda(input, output, mae)) { + std::cerr << "CUDA run failed" << std::endl; + return 3; + } + + if (!save_float_output(argv[3], output)) { + std::cerr << "Failed to save output: " << argv[3] << std::endl; + return 4; + } + + std::cout << "rows=" << input.config.rows + << " cols=" << input.config.cols + << " blocksize=" << input.config.blocksize + << " mae=" << mae << std::endl; + + if (mae >= 1e-2f) { + std::cerr << "MAE threshold failed" << std::endl; + return 5; + } + + return 0; +} diff --git a/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md b/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md new file mode 100644 index 0000000..5fd877e --- /dev/null +++ b/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md @@ -0,0 +1,889 @@ + +## Step0-CheckEnv +- Time: 2026-03-10 20:57:45 +- Status: FAIL +- Command: nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +./run_pipeline.sh: line 41: nvcc: command not found +~~~ + +## Step0-CheckEnv +- Time: 2026-03-10 21:08:01 +- Status: FAIL +- Command: nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +./run_pipeline.sh: line 41: nvcc: command not found +~~~ + +## Step0-CheckEnv +- Time: 2026-03-10 22:27:26 +- Status: SUCCESS +- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +NVCC=/usr/local/cuda/bin/nvcc +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Jan_15_19:20:09_PST_2025 +Cuda compilation tools, release 12.8, V12.8.61 +Build cuda_12.8.r12.8/compiler.35404655_0 +nvcc OK +Python 3.12.3 +~~~ + +## Step1-GenerateData +- Time: 2026-03-10 22:27:26 +- Status: SUCCESS +- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin + +~~~text +Generating data: 1024x1024 (numel=1048576) + blocksize=64 + num_pairs=524288 + num_blocks=16384 + num_groups=64 +Saved to sample_nf4.bin +~~~ + +## Step2-BuildCUDA +- Time: 2026-03-10 22:27:29 +- Status: SUCCESS +- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu + +~~~text + +~~~ + +## Step3-RunDequant-GPU +- Time: 2026-03-10 22:27:29 +- Status: SUCCESS +- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +Kernel time : 2.43862 ms +Bandwidth : 1.08195 GB/s (A100 peak ~1935 GB/s, 0.0559146%) +MAE (GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +~~~ + +## Step4-Profile-nsys +- Time: 2026-03-10 22:27:35 +- Status: SUCCESS +- Command: nsys profile -o profile_report -f true --stats=true --cuda-memory-usage=true ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +Kernel time : 2.44131 ms +Bandwidth : 1.08076 GB/s (A100 peak ~1935 GB/s, 0.0558531%) +MAE (GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +Collecting data... +Generating '/tmp/nsys-report-9fb8.qdstrm' + [1/8] [0% ] profile_report.nsys-rep [1/8] [0% ] profile_report.nsys-rep [1/8] [6% ] profile_report.nsys-rep [1/8] [=======39% ] profile_report.nsys-rep [1/8] [=================73% ] profile_report.nsys-rep [1/8] [===================79% ] profile_report.nsys-rep [1/8] [====================84% ] profile_report.nsys-rep [1/8] [====================85% ] profile_report.nsys-rep [1/8] [======================92% ] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep + [2/8] [0% ] profile_report.sqlite [2/8] [1% ] profile_report.sqlite [2/8] [2% ] profile_report.sqlite [2/8] [3% ] profile_report.sqlite [2/8] [4% ] profile_report.sqlite [2/8] [5% ] profile_report.sqlite [2/8] [6% ] profile_report.sqlite [2/8] [7% ] profile_report.sqlite [2/8] [8% ] profile_report.sqlite [2/8] [9% ] profile_report.sqlite [2/8] [10% ] profile_report.sqlite [2/8] [11% ] profile_report.sqlite [2/8] [12% ] profile_report.sqlite [2/8] [13% ] profile_report.sqlite [2/8] [14% ] profile_report.sqlite [2/8] [=15% ] profile_report.sqlite [2/8] [=16% ] profile_report.sqlite [2/8] [=17% ] profile_report.sqlite [2/8] [==18% ] profile_report.sqlite [2/8] [==19% ] profile_report.sqlite [2/8] [==20% ] profile_report.sqlite [2/8] [==21% ] profile_report.sqlite [2/8] [===22% ] profile_report.sqlite [2/8] [===23% ] profile_report.sqlite [2/8] [===24% ] profile_report.sqlite [2/8] [====25% ] profile_report.sqlite [2/8] [====26% ] profile_report.sqlite [2/8] [====27% ] profile_report.sqlite [2/8] [====28% ] profile_report.sqlite [2/8] [=====29% ] profile_report.sqlite [2/8] [=====30% ] profile_report.sqlite [2/8] [=====31% ] profile_report.sqlite [2/8] [=====32% ] profile_report.sqlite [2/8] [======33% ] profile_report.sqlite [2/8] [======34% ] profile_report.sqlite [2/8] [======35% ] profile_report.sqlite [2/8] [=======36% ] profile_report.sqlite [2/8] [=======37% ] profile_report.sqlite [2/8] [=======38% ] profile_report.sqlite [2/8] [=======39% ] profile_report.sqlite [2/8] [========40% ] profile_report.sqlite [2/8] [========41% ] profile_report.sqlite [2/8] [========42% ] profile_report.sqlite [2/8] [=========43% ] profile_report.sqlite [2/8] [=========44% ] profile_report.sqlite [2/8] [=========45% ] profile_report.sqlite [2/8] [=========46% ] profile_report.sqlite [2/8] [==========47% ] profile_report.sqlite [2/8] [==========48% ] profile_report.sqlite [2/8] [==========49% ] profile_report.sqlite [2/8] [===========50% ] profile_report.sqlite [2/8] [===========51% ] profile_report.sqlite [2/8] [===========52% ] profile_report.sqlite [2/8] [===========53% ] profile_report.sqlite [2/8] [============54% ] profile_report.sqlite [2/8] [============55% ] profile_report.sqlite [2/8] [============56% ] profile_report.sqlite [2/8] [============57% ] profile_report.sqlite [2/8] [=============58% ] profile_report.sqlite [2/8] [=============59% ] profile_report.sqlite [2/8] [=============60% ] profile_report.sqlite [2/8] [==============61% ] profile_report.sqlite [2/8] [==============62% ] profile_report.sqlite [2/8] [==============63% ] profile_report.sqlite [2/8] [==============64% ] profile_report.sqlite [2/8] [===============65% ] profile_report.sqlite [2/8] [===============66% ] profile_report.sqlite [2/8] [===============67% ] profile_report.sqlite [2/8] [================68% ] profile_report.sqlite [2/8] [================69% ] profile_report.sqlite [2/8] [================70% ] profile_report.sqlite [2/8] [================71% ] profile_report.sqlite [2/8] [=================72% ] profile_report.sqlite [2/8] [=================73% ] profile_report.sqlite [2/8] [=================74% ] profile_report.sqlite [2/8] [==================75% ] profile_report.sqlite [2/8] [==================76% ] profile_report.sqlite [2/8] [==================77% ] profile_report.sqlite [2/8] [==================78% ] profile_report.sqlite [2/8] [===================79% ] profile_report.sqlite [2/8] [===================80% ] profile_report.sqlite [2/8] [===================81% ] profile_report.sqlite [2/8] [===================82% ] profile_report.sqlite [2/8] [====================83% ] profile_report.sqlite [2/8] [====================84% ] profile_report.sqlite [2/8] [====================85% ] profile_report.sqlite [2/8] [=====================86% ] profile_report.sqlite [2/8] [=====================87% ] profile_report.sqlite [2/8] [=====================88% ] profile_report.sqlite [2/8] [=====================89% ] profile_report.sqlite [2/8] [======================90% ] profile_report.sqlite [2/8] [======================91% ] profile_report.sqlite [2/8] [======================92% ] profile_report.sqlite [2/8] [=======================93% ] profile_report.sqlite [2/8] [=======================94% ] profile_report.sqlite [2/8] [=======================95% ] profile_report.sqlite [2/8] [=======================96% ] profile_report.sqlite [2/8] [========================97% ] profile_report.sqlite [2/8] [========================98% ] profile_report.sqlite [2/8] [========================99% ] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite +SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data. +[3/8] Executing 'nvtx_sum' stats report +[4/8] Executing 'osrt_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- ---------- -------- --------- ----------- ---------------------- + 55.2 730128649 16 45633040.6 26820737.5 1423 289243599 71896110.3 poll + 44.2 583894130 1617 361097.2 85711.0 1036 21523206 1366756.0 ioctl + 0.2 2889414 43 67195.7 13768.0 6551 1783848 270644.6 mmap64 + 0.2 2185559 1 2185559.0 2185559.0 2185559 2185559 0.0 writev + 0.1 674119 118 5712.9 5081.0 1593 21596 3566.4 open64 + 0.0 641353 10 64135.3 68869.5 24351 125865 35979.6 sem_timedwait + 0.0 544613 110 4951.0 2730.0 1005 63163 7202.7 fopen + 0.0 318022 2 159011.0 159011.0 145779 172243 18712.9 pthread_create + 0.0 186230 13 14325.4 7970.0 2042 92103 23865.6 mmap + 0.0 166282 11 15116.5 2277.0 1003 136872 40472.2 read + 0.0 94287 1 94287.0 94287.0 94287 94287 0.0 pthread_cond_wait + 0.0 83055 11 7550.5 8191.0 4463 10457 2104.4 write + 0.0 64940 33 1967.9 1410.0 1003 7914 1532.4 fclose + 0.0 64210 7 9172.9 9471.0 1080 21441 7047.1 fflush + 0.0 29258 1 29258.0 29258.0 29258 29258 0.0 fgets + 0.0 28034 6 4672.3 5518.0 1233 6373 1951.9 open + 0.0 18949 3 6316.3 6365.0 3919 8665 2373.4 munmap + 0.0 15052 5 3010.4 1779.0 1287 7850 2747.5 fwrite + 0.0 11502 1 11502.0 11502.0 11502 11502 0.0 connect + 0.0 11498 2 5749.0 5749.0 5008 6490 1047.9 socket + 0.0 11081 3 3693.7 3759.0 1860 5462 1801.9 pipe2 + 0.0 7859 7 1122.7 1066.0 1003 1417 145.0 fcntl + 0.0 5869 2 2934.5 2934.5 2184 3685 1061.4 pthread_cond_broadcast + 0.0 5356 1 5356.0 5356.0 5356 5356 0.0 fread + 0.0 2275 1 2275.0 2275.0 2275 2275 0.0 bind + +[5/8] Executing 'cuda_api_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- --------- -------- --------- ----------- --------------------------------- + 97.6 466928783 5 93385756.6 5477.0 3995 466790440 208739570.3 cudaMalloc + 1.0 4683757 5 936751.4 274816.0 49882 2417095 1104288.5 cudaMemcpy + 0.5 2476208 1 2476208.0 2476208.0 2476208 2476208 0.0 cudaLaunchKernel + 0.5 2325002 1 2325002.0 2325002.0 2325002 2325002 0.0 cudaDeviceSynchronize + 0.2 1119916 5 223983.2 28689.0 6410 923241 395735.0 cudaFree + 0.2 968121 1 968121.0 968121.0 968121 968121 0.0 cudaGetDeviceProperties_v2_v12000 + 0.0 25180 2 12590.0 12590.0 8814 16366 5340.1 cudaEventRecord + 0.0 11778 2 5889.0 5889.0 793 10985 7206.8 cudaEventCreate + 0.0 2432 2 1216.0 1216.0 434 1998 1105.9 cudaEventDestroy + 0.0 755 1 755.0 755.0 755 755 0.0 cuModuleGetLoadingMode + +[6/8] Executing 'cuda_gpu_kern_sum' stats report + + Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- + 100.0 10272 1 10272.0 10272.0 10272 10272 0.0 void ::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,… + +[7/8] Executing 'cuda_gpu_mem_time_sum' stats report + + Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation + -------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 66.6 84833 1 84833.0 84833.0 84833 84833 0.0 [CUDA memcpy Device-to-Host] + 33.4 42624 4 10656.0 3536.0 1984 33568 15310.3 [CUDA memcpy Host-to-Device] + +[8/8] Executing 'cuda_gpu_mem_size_sum' stats report + + Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation + ---------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 2.097 1 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Device-to-Host] + 0.542 4 0.135 0.009 0.000 0.524 0.259 [CUDA memcpy Host-to-Device] + +Generated: + /home/qtc_yu/nf4_project/profile_report.nsys-rep + /home/qtc_yu/nf4_project/profile_report.sqlite +~~~ + +## Step5-BandwidthCalc +- Time: 2026-03-10 22:27:35 +- Status: SUCCESS +- Command: python3 -c " +import struct, os, time + +# Theoretical A100 HBM2e bandwidth: ~1935 GB/s +# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q) +# + num_groups*2 bytes (absmax2) + 256*2 bytes (code2) +# Our kernel writes: numel * 2 bytes (fp16 output) + +rows, cols, blocksize = 1024, 1024, 64 +numel = rows * cols +num_pairs = (numel + 1) // 2 +num_blocks = (numel + blocksize - 1) // blocksize +num_groups = (num_blocks + 255) // 256 + +bytes_read = num_pairs + num_blocks + num_groups * 2 + 256 * 2 +bytes_write = numel * 2 +total_bytes = bytes_read + bytes_write + +print(f'Data movement analysis (1024x1024, fp16 output):') +print(f' Read packed_weights : {num_pairs/1024:.0f} KB') +print(f' Read absmax_q : {num_blocks/1024:.0f} KB') +print(f' Read absmax2+code2 : {(num_groups*2+512)/1024:.2f} KB') +print(f' Write fp16 output : {bytes_write/1024/1024:.1f} MB') +print(f' Total data movement : {total_bytes/1024/1024:.2f} MB') +print(f' A100 peak bandwidth : 1935 GB/s') +print(f' Theoretical min time : {total_bytes/1935e9*1000:.3f} ms') +print(f' (if nsys shows >2x this, there is optimization headroom)') +" + +~~~text +Data movement analysis (1024x1024, fp16 output): + Read packed_weights : 512 KB + Read absmax_q : 16 KB + Read absmax2+code2 : 0.62 KB + Write fp16 output : 2.0 MB + Total data movement : 2.52 MB + A100 peak bandwidth : 1935 GB/s + Theoretical min time : 0.001 ms + (if nsys shows >2x this, there is optimization headroom) +~~~ + +## Step0-CheckEnv +- Time: 2026-03-11 16:56:22 +- Status: SUCCESS +- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +NVCC=/usr/local/cuda/bin/nvcc +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Jan_15_19:20:09_PST_2025 +Cuda compilation tools, release 12.8, V12.8.61 +Build cuda_12.8.r12.8/compiler.35404655_0 +nvcc OK +Python 3.12.3 +~~~ + +## Step1-GenerateData +- Time: 2026-03-11 16:56:23 +- Status: SUCCESS +- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin + +~~~text +Generating data: 1024x1024 (numel=1048576) + blocksize=64 + num_pairs=524288 + num_blocks=16384 + num_groups=64 +Saved to sample_nf4.bin +~~~ + +## Step2-BuildCUDA +- Time: 2026-03-11 16:56:26 +- Status: SUCCESS +- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu + +~~~text + +~~~ + +## Step3-RunDequant-GPU +- Time: 2026-03-11 16:56:29 +- Status: SUCCESS +- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.4392 ms | Bandwidth : 1.08169 GB/s (0.0559014% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.024768 ms | Bandwidth : 106.527 GB/s (5.50528% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 98.4819x +[v4] Kernel time : 0.017472 ms | Bandwidth : 151.011 GB/s (7.80419% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 139.606x +[v4 speedup vs v3]: 1.41758x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +~~~ + +## Step4-Profile-nsys +- Time: 2026-03-11 16:56:44 +- Status: SUCCESS +- Command: nsys profile -o profile_report -f true --stats=true --cuda-memory-usage=true ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.44163 ms | Bandwidth : 1.08061 GB/s (0.0558457% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.028192 ms | Bandwidth : 93.5891 GB/s (4.83665% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 86.6073x +[v4] Kernel time : 0.021504 ms | Bandwidth : 122.696 GB/s (6.3409% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 113.543x +[v4 speedup vs v3]: 1.31101x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +Collecting data... +Generating '/tmp/nsys-report-d5e1.qdstrm' + [1/8] [0% ] profile_report.nsys-rep [1/8] [0% ] profile_report.nsys-rep [1/8] [7% ] profile_report.nsys-rep [1/8] [=========44% ] profile_report.nsys-rep [1/8] [===================81% ] profile_report.nsys-rep [1/8] [====================84% ] profile_report.nsys-rep [1/8] [=====================88% ] profile_report.nsys-rep [1/8] [=====================89% ] profile_report.nsys-rep [1/8] [=======================94% ] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep + [2/8] [0% ] profile_report.sqlite [2/8] [1% ] profile_report.sqlite [2/8] [2% ] profile_report.sqlite [2/8] [3% ] profile_report.sqlite [2/8] [4% ] profile_report.sqlite [2/8] [5% ] profile_report.sqlite [2/8] [6% ] profile_report.sqlite [2/8] [7% ] profile_report.sqlite [2/8] [8% ] profile_report.sqlite [2/8] [9% ] profile_report.sqlite [2/8] [10% ] profile_report.sqlite [2/8] [11% ] profile_report.sqlite [2/8] [12% ] profile_report.sqlite [2/8] [13% ] profile_report.sqlite [2/8] [14% ] profile_report.sqlite [2/8] [=15% ] profile_report.sqlite [2/8] [=16% ] profile_report.sqlite [2/8] [=17% ] profile_report.sqlite [2/8] [==18% ] profile_report.sqlite [2/8] [==19% ] profile_report.sqlite [2/8] [==20% ] profile_report.sqlite [2/8] [==21% ] profile_report.sqlite [2/8] [===22% ] profile_report.sqlite [2/8] [===23% ] profile_report.sqlite [2/8] [===24% ] profile_report.sqlite [2/8] [====25% ] profile_report.sqlite [2/8] [====26% ] profile_report.sqlite [2/8] [====27% ] profile_report.sqlite [2/8] [====28% ] profile_report.sqlite [2/8] [=====29% ] profile_report.sqlite [2/8] [=====30% ] profile_report.sqlite [2/8] [=====31% ] profile_report.sqlite [2/8] [=====32% ] profile_report.sqlite [2/8] [======33% ] profile_report.sqlite [2/8] [======34% ] profile_report.sqlite [2/8] [======35% ] profile_report.sqlite [2/8] [=======36% ] profile_report.sqlite [2/8] [=======37% ] profile_report.sqlite [2/8] [=======38% ] profile_report.sqlite [2/8] [=======39% ] profile_report.sqlite [2/8] [========40% ] profile_report.sqlite [2/8] [========41% ] profile_report.sqlite [2/8] [========42% ] profile_report.sqlite [2/8] [=========43% ] profile_report.sqlite [2/8] [=========44% ] profile_report.sqlite [2/8] [=========45% ] profile_report.sqlite [2/8] [=========46% ] profile_report.sqlite [2/8] [==========47% ] profile_report.sqlite [2/8] [==========48% ] profile_report.sqlite [2/8] [==========49% ] profile_report.sqlite [2/8] [===========50% ] profile_report.sqlite [2/8] [===========51% ] profile_report.sqlite [2/8] [===========52% ] profile_report.sqlite [2/8] [===========53% ] profile_report.sqlite [2/8] [============54% ] profile_report.sqlite [2/8] [============55% ] profile_report.sqlite [2/8] [============56% ] profile_report.sqlite [2/8] [============57% ] profile_report.sqlite [2/8] [=============58% ] profile_report.sqlite [2/8] [=============59% ] profile_report.sqlite [2/8] [=============60% ] profile_report.sqlite [2/8] [==============61% ] profile_report.sqlite [2/8] [==============62% ] profile_report.sqlite [2/8] [==============63% ] profile_report.sqlite [2/8] [==============64% ] profile_report.sqlite [2/8] [===============65% ] profile_report.sqlite [2/8] [===============66% ] profile_report.sqlite [2/8] [===============67% ] profile_report.sqlite [2/8] [================68% ] profile_report.sqlite [2/8] [================69% ] profile_report.sqlite [2/8] [================70% ] profile_report.sqlite [2/8] [================71% ] profile_report.sqlite [2/8] [=================72% ] profile_report.sqlite [2/8] [=================73% ] profile_report.sqlite [2/8] [=================74% ] profile_report.sqlite [2/8] [==================75% ] profile_report.sqlite [2/8] [==================76% ] profile_report.sqlite [2/8] [==================77% ] profile_report.sqlite [2/8] [==================78% ] profile_report.sqlite [2/8] [===================79% ] profile_report.sqlite [2/8] [===================80% ] profile_report.sqlite [2/8] [===================81% ] profile_report.sqlite [2/8] [===================82% ] profile_report.sqlite [2/8] [====================83% ] profile_report.sqlite [2/8] [====================84% ] profile_report.sqlite [2/8] [====================85% ] profile_report.sqlite [2/8] [=====================86% ] profile_report.sqlite [2/8] [=====================87% ] profile_report.sqlite [2/8] [=====================88% ] profile_report.sqlite [2/8] [=====================89% ] profile_report.sqlite [2/8] [======================90% ] profile_report.sqlite [2/8] [======================91% ] profile_report.sqlite [2/8] [======================92% ] profile_report.sqlite [2/8] [=======================93% ] profile_report.sqlite [2/8] [=======================94% ] profile_report.sqlite [2/8] [=======================95% ] profile_report.sqlite [2/8] [=======================96% ] profile_report.sqlite [2/8] [========================97% ] profile_report.sqlite [2/8] [========================98% ] profile_report.sqlite [2/8] [========================99% ] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite +SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data. +[3/8] Executing 'nvtx_sum' stats report +[4/8] Executing 'osrt_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- ----------- -------- --------- ----------- ---------------------- + 51.9 2939595733 38 77357782.4 100114525.5 199227 233531412 47228401.9 poll + 48.0 2717151410 1616 1681405.6 25058.5 1088 735265052 25629204.6 ioctl + 0.0 2301582 43 53525.2 12216.0 5070 1352339 204348.0 mmap64 + 0.0 2299925 1 2299925.0 2299925.0 2299925 2299925 0.0 writev + 0.0 751805 26 28915.6 1624.0 1005 704577 137812.6 fclose + 0.0 657983 10 65798.3 60865.0 33136 114236 25911.3 sem_timedwait + 0.0 622080 118 5271.9 4083.0 1938 28789 3942.8 open64 + 0.0 581144 140 4151.0 1980.0 1005 63631 6740.3 fopen + 0.0 295241 2 147620.5 147620.5 131424 163817 22905.3 pthread_create + 0.0 175301 12 14608.4 1975.5 1033 144614 40999.9 read + 0.0 160422 13 12340.2 6899.0 1634 80761 20947.7 mmap + 0.0 154848 1 154848.0 154848.0 154848 154848 0.0 pthread_cond_wait + 0.0 74526 11 6775.1 7032.0 2977 10332 2308.0 write + 0.0 50630 6 8438.3 7145.5 2686 18430 5594.4 fflush + 0.0 42885 1 42885.0 42885.0 42885 42885 0.0 fgets + 0.0 33466 6 5577.7 5723.5 2735 7544 1791.0 open + 0.0 28800 5 5760.0 2331.0 1554 20204 8082.4 fwrite + 0.0 15908 3 5302.7 4291.0 2813 8804 3121.0 munmap + 0.0 13623 3 4541.0 4190.0 2060 7373 2673.8 pipe2 + 0.0 12326 2 6163.0 6163.0 5102 7224 1500.5 socket + 0.0 10255 2 5127.5 5127.5 1508 8747 5118.7 pthread_cond_broadcast + 0.0 10135 1 10135.0 10135.0 10135 10135 0.0 connect + 0.0 5859 5 1171.8 1141.0 1072 1320 93.5 fcntl + 0.0 5162 1 5162.0 5162.0 5162 5162 0.0 fread + 0.0 2403 1 2403.0 2403.0 2403 2403 0.0 bind + +[5/8] Executing 'cuda_api_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ----------- --------- -------- ---------- ------------ --------------------------------- + 99.1 2675449092 5 535089818.4 6354.0 5265 2675288598 1196407490.6 cudaMalloc + 0.5 12855125 5 2571025.0 1721985.0 51727 8553405 3497750.8 cudaMemcpy + 0.3 6915921 3 2305307.0 2371967.0 2170818 2373136 116472.4 cudaEventSynchronize + 0.1 2694946 3 898315.3 41029.0 6375 2647542 1514973.8 cudaLaunchKernel + 0.0 1057945 1 1057945.0 1057945.0 1057945 1057945 0.0 cudaGetDeviceProperties_v2_v12000 + 0.0 467458 5 93491.6 39587.0 6203 291255 120389.5 cudaFree + 0.0 40459 6 6743.2 5736.0 3726 13570 3564.7 cudaEventRecord + 0.0 17044 6 2840.7 826.5 578 11807 4445.4 cudaEventCreate + 0.0 8885 1 8885.0 8885.0 8885 8885 0.0 cudaDeviceSynchronize + 0.0 4650 6 775.0 677.0 426 1255 329.2 cudaEventDestroy + 0.0 1010 1 1010.0 1010.0 1010 1010 0.0 cuModuleGetLoadingMode + +[6/8] Executing 'cuda_gpu_kern_sum' stats report + + Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- + 37.2 13344 1 13344.0 13344.0 13344 13344 0.0 void ::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float… + 35.2 12640 1 12640.0 12640.0 12640 12640 0.0 void ::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float… + 27.6 9888 1 9888.0 9888.0 9888 9888 0.0 void ::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,… + +[7/8] Executing 'cuda_gpu_mem_time_sum' stats report + + Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation + -------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 63.8 84897 1 84897.0 84897.0 84897 84897 0.0 [CUDA memcpy Device-to-Host] + 36.2 48192 4 12048.0 3456.0 1984 39296 18182.8 [CUDA memcpy Host-to-Device] + +[8/8] Executing 'cuda_gpu_mem_size_sum' stats report + + Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation + ---------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 2.097 1 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Device-to-Host] + 0.542 4 0.135 0.009 0.000 0.524 0.259 [CUDA memcpy Host-to-Device] + +Generated: + /home/qtc_yu/nf4_project/profile_report.nsys-rep + /home/qtc_yu/nf4_project/profile_report.sqlite +~~~ + +## Step5-BandwidthCalc +- Time: 2026-03-11 16:56:44 +- Status: SUCCESS +- Command: python3 -c " +import struct, os, time + +# Theoretical A100 HBM2e bandwidth: ~1935 GB/s +# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q) +# + num_groups*2 bytes (absmax2) + 256*2 bytes (code2) +# Our kernel writes: numel * 2 bytes (fp16 output) + +rows, cols, blocksize = 1024, 1024, 64 +numel = rows * cols +num_pairs = (numel + 1) // 2 +num_blocks = (numel + blocksize - 1) // blocksize +num_groups = (num_blocks + 255) // 256 + +bytes_read = num_pairs + num_blocks + num_groups * 2 + 256 * 2 +bytes_write = numel * 2 +total_bytes = bytes_read + bytes_write + +print(f'Data movement analysis (1024x1024, fp16 output):') +print(f' Read packed_weights : {num_pairs/1024:.0f} KB') +print(f' Read absmax_q : {num_blocks/1024:.0f} KB') +print(f' Read absmax2+code2 : {(num_groups*2+512)/1024:.2f} KB') +print(f' Write fp16 output : {bytes_write/1024/1024:.1f} MB') +print(f' Total data movement : {total_bytes/1024/1024:.2f} MB') +print(f' A100 peak bandwidth : 1935 GB/s') +print(f' Theoretical min time : {total_bytes/1935e9*1000:.3f} ms') +print(f' (if nsys shows >2x this, there is optimization headroom)') +" + +~~~text +Data movement analysis (1024x1024, fp16 output): + Read packed_weights : 512 KB + Read absmax_q : 16 KB + Read absmax2+code2 : 0.62 KB + Write fp16 output : 2.0 MB + Total data movement : 2.52 MB + A100 peak bandwidth : 1935 GB/s + Theoretical min time : 0.001 ms + (if nsys shows >2x this, there is optimization headroom) +~~~ + +## Step0-CheckEnv +- Time: 2026-03-11 17:27:31 +- Status: SUCCESS +- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +NVCC=/usr/local/cuda/bin/nvcc +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Jan_15_19:20:09_PST_2025 +Cuda compilation tools, release 12.8, V12.8.61 +Build cuda_12.8.r12.8/compiler.35404655_0 +nvcc OK +Python 3.12.3 +~~~ + +## Step1-GenerateData +- Time: 2026-03-11 17:27:31 +- Status: SUCCESS +- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin + +~~~text +Generating data: 1024x1024 (numel=1048576) + blocksize=64 + num_pairs=524288 + num_blocks=16384 + num_groups=64 +Saved to sample_nf4.bin +~~~ + +## Step2-BuildCUDA +- Time: 2026-03-11 17:27:34 +- Status: SUCCESS +- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu + +~~~text + +~~~ + +## Step3-RunDequant-GPU +- Time: 2026-03-11 17:27:37 +- Status: SUCCESS +- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.4385 ms | Bandwidth : 1.082 GB/s (0.0559176% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.024736 ms | Bandwidth : 106.665 GB/s (5.5124% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 98.5809x +[v4] Kernel time : 0.017632 ms | Bandwidth : 149.641 GB/s (7.73337% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 138.299x +[v4 speedup vs v3]: 1.4029x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +~~~ + +## Step4-Profile-nsys +- Time: 2026-03-11 17:27:52 +- Status: SUCCESS +- Command: nsys profile -o profile_report -f true --stats=true --cuda-memory-usage=true ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.44227 ms | Bandwidth : 1.08033 GB/s (0.0558311% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.027648 ms | Bandwidth : 95.4306 GB/s (4.93181% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 88.3345x +[v4] Kernel time : 0.020544 ms | Bandwidth : 128.43 GB/s (6.6372% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 118.88x +[v4 speedup vs v3]: 1.34579x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +Collecting data... +Generating '/tmp/nsys-report-e3f0.qdstrm' + [1/8] [0% ] profile_report.nsys-rep [1/8] [0% ] profile_report.nsys-rep [1/8] [7% ] profile_report.nsys-rep [1/8] [======33% ] profile_report.nsys-rep [1/8] [=============59% ] profile_report.nsys-rep [1/8] [=================74% ] profile_report.nsys-rep [1/8] [=====================88% ] profile_report.nsys-rep [1/8] [=======================94% ] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep + [2/8] [0% ] profile_report.sqlite [2/8] [1% ] profile_report.sqlite [2/8] [2% ] profile_report.sqlite [2/8] [3% ] profile_report.sqlite [2/8] [4% ] profile_report.sqlite [2/8] [5% ] profile_report.sqlite [2/8] [6% ] profile_report.sqlite [2/8] [7% ] profile_report.sqlite [2/8] [8% ] profile_report.sqlite [2/8] [9% ] profile_report.sqlite [2/8] [10% ] profile_report.sqlite [2/8] [11% ] profile_report.sqlite [2/8] [12% ] profile_report.sqlite [2/8] [13% ] profile_report.sqlite [2/8] [14% ] profile_report.sqlite [2/8] [=15% ] profile_report.sqlite [2/8] [=16% ] profile_report.sqlite [2/8] [=17% ] profile_report.sqlite [2/8] [==18% ] profile_report.sqlite [2/8] [==19% ] profile_report.sqlite [2/8] [==20% ] profile_report.sqlite [2/8] [==21% ] profile_report.sqlite [2/8] [===22% ] profile_report.sqlite [2/8] [===23% ] profile_report.sqlite [2/8] [===24% ] profile_report.sqlite [2/8] [====25% ] profile_report.sqlite [2/8] [====26% ] profile_report.sqlite [2/8] [====27% ] profile_report.sqlite [2/8] [====28% ] profile_report.sqlite [2/8] [=====29% ] profile_report.sqlite [2/8] [=====30% ] profile_report.sqlite [2/8] [=====31% ] profile_report.sqlite [2/8] [=====32% ] profile_report.sqlite [2/8] [======33% ] profile_report.sqlite [2/8] [======34% ] profile_report.sqlite [2/8] [======35% ] profile_report.sqlite [2/8] [=======36% ] profile_report.sqlite [2/8] [=======37% ] profile_report.sqlite [2/8] [=======38% ] profile_report.sqlite [2/8] [=======39% ] profile_report.sqlite [2/8] [========40% ] profile_report.sqlite [2/8] [========41% ] profile_report.sqlite [2/8] [========42% ] profile_report.sqlite [2/8] [=========43% ] profile_report.sqlite [2/8] [=========44% ] profile_report.sqlite [2/8] [=========45% ] profile_report.sqlite [2/8] [=========46% ] profile_report.sqlite [2/8] [==========47% ] profile_report.sqlite [2/8] [==========48% ] profile_report.sqlite [2/8] [==========49% ] profile_report.sqlite [2/8] [===========50% ] profile_report.sqlite [2/8] [===========51% ] profile_report.sqlite [2/8] [===========52% ] profile_report.sqlite [2/8] [===========53% ] profile_report.sqlite [2/8] [============54% ] profile_report.sqlite [2/8] [============55% ] profile_report.sqlite [2/8] [============56% ] profile_report.sqlite [2/8] [============57% ] profile_report.sqlite [2/8] [=============58% ] profile_report.sqlite [2/8] [=============59% ] profile_report.sqlite [2/8] [=============60% ] profile_report.sqlite [2/8] [==============61% ] profile_report.sqlite [2/8] [==============62% ] profile_report.sqlite [2/8] [==============63% ] profile_report.sqlite [2/8] [==============64% ] profile_report.sqlite [2/8] [===============65% ] profile_report.sqlite [2/8] [===============66% ] profile_report.sqlite [2/8] [===============67% ] profile_report.sqlite [2/8] [================68% ] profile_report.sqlite [2/8] [================69% ] profile_report.sqlite [2/8] [================70% ] profile_report.sqlite [2/8] [================71% ] profile_report.sqlite [2/8] [=================72% ] profile_report.sqlite [2/8] [=================73% ] profile_report.sqlite [2/8] [=================74% ] profile_report.sqlite [2/8] [==================75% ] profile_report.sqlite [2/8] [==================76% ] profile_report.sqlite [2/8] [==================77% ] profile_report.sqlite [2/8] [==================78% ] profile_report.sqlite [2/8] [===================79% ] profile_report.sqlite [2/8] [===================80% ] profile_report.sqlite [2/8] [===================81% ] profile_report.sqlite [2/8] [===================82% ] profile_report.sqlite [2/8] [====================83% ] profile_report.sqlite [2/8] [====================84% ] profile_report.sqlite [2/8] [====================85% ] profile_report.sqlite [2/8] [=====================86% ] profile_report.sqlite [2/8] [=====================87% ] profile_report.sqlite [2/8] [=====================88% ] profile_report.sqlite [2/8] [=====================89% ] profile_report.sqlite [2/8] [======================90% ] profile_report.sqlite [2/8] [======================91% ] profile_report.sqlite [2/8] [======================92% ] profile_report.sqlite [2/8] [=======================93% ] profile_report.sqlite [2/8] [=======================94% ] profile_report.sqlite [2/8] [=======================95% ] profile_report.sqlite [2/8] [=======================96% ] profile_report.sqlite [2/8] [========================97% ] profile_report.sqlite [2/8] [========================98% ] profile_report.sqlite [2/8] [========================99% ] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite +SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data. +[3/8] Executing 'nvtx_sum' stats report +[4/8] Executing 'osrt_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- ---------- -------- --------- ----------- ---------------------- + 76.2 2851207041 1615 1765453.3 119973.0 1311 501696372 21175201.6 ioctl + 23.5 880003946 16 55000246.6 22579085.5 1102 470726611 114966907.9 poll + 0.1 2850024 43 66279.6 10821.0 5307 1906548 288230.1 mmap64 + 0.0 1831231 1 1831231.0 1831231.0 1831231 1831231 0.0 writev + 0.0 1321864 131 10090.6 2920.0 1001 667826 58152.8 fopen + 0.0 717257 118 6078.4 5004.0 1440 18302 3177.1 open64 + 0.0 593912 10 59391.2 60245.0 20423 97386 27074.9 sem_timedwait + 0.0 533611 53 10068.1 1697.0 1012 435840 59614.8 fclose + 0.0 286950 2 143475.0 143475.0 126651 160299 23792.7 pthread_create + 0.0 138102 14 9864.4 1553.5 1053 110024 28873.8 read + 0.0 136459 13 10496.8 6042.0 1803 62598 15888.9 mmap + 0.0 97277 1 97277.0 97277.0 97277 97277 0.0 pthread_cond_wait + 0.0 67131 11 6102.8 6157.0 3404 8443 1890.5 write + 0.0 48672 5 9734.4 10347.0 6606 12032 2486.1 fflush + 0.0 29335 1 29335.0 29335.0 29335 29335 0.0 fgets + 0.0 25553 5 5110.6 4460.0 2276 8396 2474.8 open + 0.0 14788 5 2957.6 1638.0 1153 8174 2950.3 fwrite + 0.0 10659 3 3553.0 3854.0 1473 5332 1947.0 pipe2 + 0.0 10605 2 5302.5 5302.5 4458 6147 1194.3 socket + 0.0 10389 2 5194.5 5194.5 2003 8386 4513.5 pthread_cond_broadcast + 0.0 10091 3 3363.7 3285.0 3273 3533 146.8 munmap + 0.0 8491 1 8491.0 8491.0 8491 8491 0.0 connect + 0.0 4117 1 4117.0 4117.0 4117 4117 0.0 fread + 0.0 2696 2 1348.0 1348.0 1084 1612 373.4 fcntl + 0.0 2675 1 2675.0 2675.0 2675 2675 0.0 bind + +[5/8] Executing 'cuda_api_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- --------- -------- --------- ----------- --------------------------------- + 96.8 436058160 5 87211632.0 5193.0 4069 435177141 194518991.7 cudaMalloc + 1.6 7040692 3 2346897.3 2380570.0 2261570 2398552 74440.6 cudaEventSynchronize + 0.7 3312570 5 662514.0 378321.0 42763 2418461 994452.7 cudaMemcpy + 0.6 2602670 3 867556.7 31432.0 5296 2565942 1470902.9 cudaLaunchKernel + 0.2 984173 1 984173.0 984173.0 984173 984173 0.0 cudaGetDeviceProperties_v2_v12000 + 0.1 368056 5 73611.2 20222.0 5300 228095 96551.4 cudaFree + 0.0 30610 6 5101.7 4715.5 3057 8427 1872.8 cudaEventRecord + 0.0 16552 6 2758.7 624.5 372 13271 5160.2 cudaEventCreate + 0.0 5037 1 5037.0 5037.0 5037 5037 0.0 cudaDeviceSynchronize + 0.0 2815 6 469.2 472.0 245 780 193.1 cudaEventDestroy + 0.0 1281 1 1281.0 1281.0 1281 1281 0.0 cuModuleGetLoadingMode + +[6/8] Executing 'cuda_gpu_kern_sum' stats report + + Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- + 36.6 12928 1 12928.0 12928.0 12928 12928 0.0 void ::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float… + 35.4 12512 1 12512.0 12512.0 12512 12512 0.0 void ::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float… + 28.0 9888 1 9888.0 9888.0 9888 9888 0.0 void ::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,… + +[7/8] Executing 'cuda_gpu_mem_time_sum' stats report + + Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation + -------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 68.9 85409 1 85409.0 85409.0 85409 85409 0.0 [CUDA memcpy Device-to-Host] + 31.1 38528 4 9632.0 3376.0 1984 29792 13468.8 [CUDA memcpy Host-to-Device] + +[8/8] Executing 'cuda_gpu_mem_size_sum' stats report + + Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation + ---------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 2.097 1 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Device-to-Host] + 0.542 4 0.135 0.009 0.000 0.524 0.259 [CUDA memcpy Host-to-Device] + +Generated: + /home/qtc_yu/nf4_project/profile_report.nsys-rep + /home/qtc_yu/nf4_project/profile_report.sqlite +~~~ + +## Step5-BandwidthCalc +- Time: 2026-03-11 17:27:52 +- Status: SUCCESS +- Command: python3 -c " +import struct, os, time + +# Theoretical A100 HBM2e bandwidth: ~1935 GB/s +# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q) +# + num_groups*2 bytes (absmax2) + 256*2 bytes (code2) +# Our kernel writes: numel * 2 bytes (fp16 output) + +rows, cols, blocksize = 1024, 1024, 64 +numel = rows * cols +num_pairs = (numel + 1) // 2 +num_blocks = (numel + blocksize - 1) // blocksize +num_groups = (num_blocks + 255) // 256 + +bytes_read = num_pairs + num_blocks + num_groups * 2 + 256 * 2 +bytes_write = numel * 2 +total_bytes = bytes_read + bytes_write + +print(f'Data movement analysis (1024x1024, fp16 output):') +print(f' Read packed_weights : {num_pairs/1024:.0f} KB') +print(f' Read absmax_q : {num_blocks/1024:.0f} KB') +print(f' Read absmax2+code2 : {(num_groups*2+512)/1024:.2f} KB') +print(f' Write fp16 output : {bytes_write/1024/1024:.1f} MB') +print(f' Total data movement : {total_bytes/1024/1024:.2f} MB') +print(f' A100 peak bandwidth : 1935 GB/s') +print(f' Theoretical min time : {total_bytes/1935e9*1000:.3f} ms') +print(f' (if nsys shows >2x this, there is optimization headroom)') +" + +~~~text +Data movement analysis (1024x1024, fp16 output): + Read packed_weights : 512 KB + Read absmax_q : 16 KB + Read absmax2+code2 : 0.62 KB + Write fp16 output : 2.0 MB + Total data movement : 2.52 MB + A100 peak bandwidth : 1935 GB/s + Theoretical min time : 0.001 ms + (if nsys shows >2x this, there is optimization headroom) +~~~ + +## Step6-InstallBnB +- Time: 2026-03-11 17:27:54 +- Status: SUCCESS +- Command: pip install bitsandbytes || true + +~~~text + +[notice] A new release of pip is available: 24.3.1 -> 26.0.1 +[notice] To update, run: python -m pip install --upgrade pip +error: externally-managed-environment + +× This environment is externally managed +╰─> To install Python packages system-wide, try apt install + python3-xyz, where xyz is the package you are trying to + install. + + If you wish to install a non-Debian-packaged Python package, + create a virtual environment using python3 -m venv path/to/venv. + Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make + sure you have python3-full installed. + + If you wish to install a non-Debian packaged Python application, + it may be easiest to use pipx install xyz, which will manage a + virtual environment for you. Make sure you have pipx installed. + + See /usr/share/doc/python3.12/README.venv for more information. + +note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages. +hint: See PEP 668 for the detailed specification. +~~~ + +## Step6-BenchmarkBnB +- Time: 2026-03-11 17:27:56 +- Status: SUCCESS +- Command: python3 benchmark_vs_bnb.py + +~~~text +bitsandbytes not installed. Run: pip install bitsandbytes +~~~ + +## Step0-CheckEnv +- Time: 2026-03-11 18:01:42 +- Status: SUCCESS +- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version + +~~~text +NVCC=/usr/local/cuda/bin/nvcc +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Jan_15_19:20:09_PST_2025 +Cuda compilation tools, release 12.8, V12.8.61 +Build cuda_12.8.r12.8/compiler.35404655_0 +nvcc OK +Python 3.12.3 +~~~ + +## Step1-GenerateData +- Time: 2026-03-11 18:01:42 +- Status: SUCCESS +- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin + +~~~text +Generating data: 1024x1024 (numel=1048576) + blocksize=64 + num_pairs=524288 + num_blocks=16384 + num_groups=64 +Saved to sample_nf4.bin +~~~ + +## Step2-BuildCUDA +- Time: 2026-03-11 18:01:45 +- Status: SUCCESS +- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu + +~~~text + +~~~ + +## Step3-RunDequant-GPU +- Time: 2026-03-11 18:01:48 +- Status: SUCCESS +- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.44122 ms | Bandwidth : 1.0808 GB/s (0.0558552% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.023552 ms | Bandwidth : 112.027 GB/s (5.78952% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 103.652x +[v4] Kernel time : 0.017408 ms | Bandwidth : 151.566 GB/s (7.83288% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 140.235x +[v4 speedup vs v3]: 1.35294x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +~~~ + +## Step4-Profile-nsys +- Time: 2026-03-11 18:02:00 +- Status: SUCCESS +- Command: nsys profile -o profile_report -f true --stats=true --cuda-memory-usage=true ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin + +~~~text +Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0) +Loaded: 1024x1024 blocksize=64 offset=0.0429335 +GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64 +[v2] Kernel time : 2.44509 ms | Bandwidth : 1.07909 GB/s (0.0557668% of A100 peak 1935 GB/s) +[v3] Kernel time : 0.027808 ms | Bandwidth : 94.8815 GB/s (4.90344% of A100 peak 1935 GB/s) +[v3 speedup vs v2]: 87.9275x +[v4] Kernel time : 0.0208 ms | Bandwidth : 126.849 GB/s (6.55552% of A100 peak 1935 GB/s) +[v4 speedup vs v2]: 117.552x +[v4 speedup vs v3]: 1.33692x | occupancy block=128 min_grid=1296 +MAE (v4 GPU vs CPU ref): 2.25737e-05 ✓ PASS +rows=1024 cols=1024 blocksize=64 mae=2.25737e-05 +Collecting data... +Generating '/tmp/nsys-report-3e9a.qdstrm' + [1/8] [0% ] profile_report.nsys-rep [1/8] [0% ] profile_report.nsys-rep [1/8] [7% ] profile_report.nsys-rep [1/8] [==========47% ] profile_report.nsys-rep [1/8] [=====================87% ] profile_report.nsys-rep [1/8] [=====================88% ] profile_report.nsys-rep [1/8] [=======================94% ] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep [1/8] [========================100%] profile_report.nsys-rep + [2/8] [0% ] profile_report.sqlite [2/8] [1% ] profile_report.sqlite [2/8] [2% ] profile_report.sqlite [2/8] [3% ] profile_report.sqlite [2/8] [4% ] profile_report.sqlite [2/8] [5% ] profile_report.sqlite [2/8] [6% ] profile_report.sqlite [2/8] [7% ] profile_report.sqlite [2/8] [8% ] profile_report.sqlite [2/8] [9% ] profile_report.sqlite [2/8] [10% ] profile_report.sqlite [2/8] [11% ] profile_report.sqlite [2/8] [12% ] profile_report.sqlite [2/8] [13% ] profile_report.sqlite [2/8] [14% ] profile_report.sqlite [2/8] [=15% ] profile_report.sqlite [2/8] [=16% ] profile_report.sqlite [2/8] [=17% ] profile_report.sqlite [2/8] [==18% ] profile_report.sqlite [2/8] [==19% ] profile_report.sqlite [2/8] [==20% ] profile_report.sqlite [2/8] [==21% ] profile_report.sqlite [2/8] [===22% ] profile_report.sqlite [2/8] [===23% ] profile_report.sqlite [2/8] [===24% ] profile_report.sqlite [2/8] [====25% ] profile_report.sqlite [2/8] [====26% ] profile_report.sqlite [2/8] [====27% ] profile_report.sqlite [2/8] [====28% ] profile_report.sqlite [2/8] [=====29% ] profile_report.sqlite [2/8] [=====30% ] profile_report.sqlite [2/8] [=====31% ] profile_report.sqlite [2/8] [=====32% ] profile_report.sqlite [2/8] [======33% ] profile_report.sqlite [2/8] [======34% ] profile_report.sqlite [2/8] [======35% ] profile_report.sqlite [2/8] [=======36% ] profile_report.sqlite [2/8] [=======37% ] profile_report.sqlite [2/8] [=======38% ] profile_report.sqlite [2/8] [=======39% ] profile_report.sqlite [2/8] [========40% ] profile_report.sqlite [2/8] [========41% ] profile_report.sqlite [2/8] [========42% ] profile_report.sqlite [2/8] [=========43% ] profile_report.sqlite [2/8] [=========44% ] profile_report.sqlite [2/8] [=========45% ] profile_report.sqlite [2/8] [=========46% ] profile_report.sqlite [2/8] [==========47% ] profile_report.sqlite [2/8] [==========48% ] profile_report.sqlite [2/8] [==========49% ] profile_report.sqlite [2/8] [===========50% ] profile_report.sqlite [2/8] [===========51% ] profile_report.sqlite [2/8] [===========52% ] profile_report.sqlite [2/8] [===========53% ] profile_report.sqlite [2/8] [============54% ] profile_report.sqlite [2/8] [============55% ] profile_report.sqlite [2/8] [============56% ] profile_report.sqlite [2/8] [============57% ] profile_report.sqlite [2/8] [=============58% ] profile_report.sqlite [2/8] [=============59% ] profile_report.sqlite [2/8] [=============60% ] profile_report.sqlite [2/8] [==============61% ] profile_report.sqlite [2/8] [==============62% ] profile_report.sqlite [2/8] [==============63% ] profile_report.sqlite [2/8] [==============64% ] profile_report.sqlite [2/8] [===============65% ] profile_report.sqlite [2/8] [===============66% ] profile_report.sqlite [2/8] [===============67% ] profile_report.sqlite [2/8] [================68% ] profile_report.sqlite [2/8] [================69% ] profile_report.sqlite [2/8] [================70% ] profile_report.sqlite [2/8] [================71% ] profile_report.sqlite [2/8] [=================72% ] profile_report.sqlite [2/8] [=================73% ] profile_report.sqlite [2/8] [=================74% ] profile_report.sqlite [2/8] [==================75% ] profile_report.sqlite [2/8] [==================76% ] profile_report.sqlite [2/8] [==================77% ] profile_report.sqlite [2/8] [==================78% ] profile_report.sqlite [2/8] [===================79% ] profile_report.sqlite [2/8] [===================80% ] profile_report.sqlite [2/8] [===================81% ] profile_report.sqlite [2/8] [===================82% ] profile_report.sqlite [2/8] [====================83% ] profile_report.sqlite [2/8] [====================84% ] profile_report.sqlite [2/8] [====================85% ] profile_report.sqlite [2/8] [=====================86% ] profile_report.sqlite [2/8] [=====================87% ] profile_report.sqlite [2/8] [=====================88% ] profile_report.sqlite [2/8] [=====================89% ] profile_report.sqlite [2/8] [======================90% ] profile_report.sqlite [2/8] [======================91% ] profile_report.sqlite [2/8] [======================92% ] profile_report.sqlite [2/8] [=======================93% ] profile_report.sqlite [2/8] [=======================94% ] profile_report.sqlite [2/8] [=======================95% ] profile_report.sqlite [2/8] [=======================96% ] profile_report.sqlite [2/8] [========================97% ] profile_report.sqlite [2/8] [========================98% ] profile_report.sqlite [2/8] [========================99% ] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite [2/8] [========================100%] profile_report.sqlite +SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data. +[3/8] Executing 'nvtx_sum' stats report +[4/8] Executing 'osrt_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ----------- ---------- -------- ---------- ----------- ---------------------- + 51.0 2753430245 15 183562016.3 26656737.0 1631 2323704767 592888638.6 poll + 48.8 2631466158 1617 1627375.5 22843.0 1007 498039698 19268618.1 ioctl + 0.1 3693979 43 85906.5 13029.0 7274 2609045 394811.2 mmap64 + 0.0 1912594 1 1912594.0 1912594.0 1912594 1912594 0.0 writev + 0.0 704959 118 5974.2 4701.5 1527 19800 3408.4 open64 + 0.0 650469 134 4854.2 2232.0 1016 58307 7481.4 fopen + 0.0 590786 10 59078.6 60078.5 34958 91181 18134.5 sem_timedwait + 0.0 498560 33 15107.9 1839.0 1016 438830 76071.9 fclose + 0.0 309886 2 154943.0 154943.0 129716 180170 35676.4 pthread_create + 0.0 200881 13 15452.4 1431.0 1033 167912 45873.3 read + 0.0 165861 6 27643.5 13995.5 6489 86753 30575.8 fflush + 0.0 145824 13 11217.2 7556.0 1987 55220 13864.4 mmap + 0.0 87185 1 87185.0 87185.0 87185 87185 0.0 pthread_cond_wait + 0.0 69919 11 6356.3 6637.0 3504 8158 1675.0 write + 0.0 29613 1 29613.0 29613.0 29613 29613 0.0 fgets + 0.0 24208 5 4841.6 5185.0 3032 5932 1231.7 open + 0.0 18867 3 6289.0 4879.0 3101 10887 4080.0 munmap + 0.0 13821 3 4607.0 4305.0 2910 6606 1866.4 pipe2 + 0.0 12873 4 3218.3 1830.0 1056 8157 3333.9 fwrite + 0.0 10708 2 5354.0 5354.0 5340 5368 19.8 socket + 0.0 7847 1 7847.0 7847.0 7847 7847 0.0 connect + 0.0 6510 2 3255.0 3255.0 1904 4606 1910.6 pthread_cond_broadcast + 0.0 5520 4 1380.0 1235.0 1097 1953 387.6 fcntl + 0.0 4893 1 4893.0 4893.0 4893 4893 0.0 fread + 0.0 2160 1 2160.0 2160.0 2160 2160 0.0 bind + +[5/8] Executing 'cuda_api_sum' stats report + + Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- ---------- --------- -------- --------- ----------- --------------------------------- + 96.8 480994048 5 96198809.6 5851.0 4040 480806247 215002105.9 cudaMalloc + 1.4 7030660 3 2343553.3 2396114.0 2234617 2399929 94360.9 cudaEventSynchronize + 0.9 4717894 5 943578.8 344488.0 44010 2418249 1094250.2 cudaMemcpy + 0.5 2634583 3 878194.3 29781.0 4915 2599887 1491081.4 cudaLaunchKernel + 0.2 962535 1 962535.0 962535.0 962535 962535 0.0 cudaGetDeviceProperties_v2_v12000 + 0.1 351933 5 70386.6 22088.0 5190 232265 96551.6 cudaFree + 0.0 31362 6 5227.0 4506.0 2791 9480 2448.1 cudaEventRecord + 0.0 13758 6 2293.0 599.5 334 9501 3613.2 cudaEventCreate + 0.0 5317 1 5317.0 5317.0 5317 5317 0.0 cudaDeviceSynchronize + 0.0 2872 6 478.7 446.5 230 963 266.5 cudaEventDestroy + 0.0 1317 1 1317.0 1317.0 1317 1317 0.0 cuModuleGetLoadingMode + +[6/8] Executing 'cuda_gpu_kern_sum' stats report + + Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name + -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- + 36.3 12768 1 12768.0 12768.0 12768 12768 0.0 void ::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float… + 35.5 12512 1 12512.0 12512.0 12512 12512 0.0 void ::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float… + 28.2 9920 1 9920.0 9920.0 9920 9920 0.0 void ::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,… + +[7/8] Executing 'cuda_gpu_mem_time_sum' stats report + + Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation + -------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 67.1 84929 1 84929.0 84929.0 84929 84929 0.0 [CUDA memcpy Device-to-Host] + 32.9 41568 4 10392.0 3376.0 2016 32800 14964.0 [CUDA memcpy Host-to-Device] + +[8/8] Executing 'cuda_gpu_mem_size_sum' stats report + + Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation + ---------- ----- -------- -------- -------- -------- ----------- ---------------------------- + 2.097 1 2.097 2.097 2.097 2.097 0.000 [CUDA memcpy Device-to-Host] + 0.542 4 0.135 0.009 0.000 0.524 0.259 [CUDA memcpy Host-to-Device] + +Generated: + /home/qtc_yu/nf4_project/profile_report.nsys-rep + /home/qtc_yu/nf4_project/profile_report.sqlite +~~~ + +## Step5-BandwidthCalc +- Time: 2026-03-11 18:02:00 +- Status: SUCCESS +- Command: python3 -c " +import struct, os, time + +# Theoretical A100 HBM2e bandwidth: ~1935 GB/s +# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q) +# + num_groups*2 bytes (absmax2) + 256*2 bytes (code2) +# Our kernel writes: numel * 2 bytes (fp16 output) + +rows, cols, blocksize = 1024, 1024, 64 +numel = rows * cols +num_pairs = (numel + 1) // 2 +num_blocks = (numel + blocksize - 1) // blocksize +num_groups = (num_blocks + 255) // 256 + +bytes_read = num_pairs + num_blocks + num_groups * 2 + 256 * 2 +bytes_write = numel * 2 +total_bytes = bytes_read + bytes_write + +print(f'Data movement analysis (1024x1024, fp16 output):') +print(f' Read packed_weights : {num_pairs/1024:.0f} KB') +print(f' Read absmax_q : {num_blocks/1024:.0f} KB') +print(f' Read absmax2+code2 : {(num_groups*2+512)/1024:.2f} KB') +print(f' Write fp16 output : {bytes_write/1024/1024:.1f} MB') +print(f' Total data movement : {total_bytes/1024/1024:.2f} MB') +print(f' A100 peak bandwidth : 1935 GB/s') +print(f' Theoretical min time : {total_bytes/1935e9*1000:.3f} ms') +print(f' (if nsys shows >2x this, there is optimization headroom)') +" + +~~~text +Data movement analysis (1024x1024, fp16 output): + Read packed_weights : 512 KB + Read absmax_q : 16 KB + Read absmax2+code2 : 0.62 KB + Write fp16 output : 2.0 MB + Total data movement : 2.52 MB + A100 peak bandwidth : 1935 GB/s + Theoretical min time : 0.001 ms + (if nsys shows >2x this, there is optimization headroom) +~~~ + +## Step6-InstallBnB +- Time: 2026-03-11 18:02:09 +- Status: SUCCESS +- Command: pip install bitsandbytes --break-system-packages || pip install --user bitsandbytes || true + +~~~text +Defaulting to user installation because normal site-packages is not writeable +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/dill-0.3.9-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/opt_einsum-3.4.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.12.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_thunder-0.2.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/nvfuser-0.2.23a0+6627725-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 +Collecting bitsandbytes + Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB) +Requirement already satisfied: torch<3,>=2.3 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (2.6.0a0+ecf3bae40a.nv25.1) +Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (1.26.4) +Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (23.2) +Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.16.1) +Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (4.12.2) +Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.4.2) +Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.1.4) +Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (2024.10.0) +Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (70.3.0) +Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (1.13.1) +Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch<3,>=2.3->bitsandbytes) (1.3.0) +Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch<3,>=2.3->bitsandbytes) (3.0.2) +Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB) + ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.7/60.7 MB 13.5 MB/s eta 0:00:00 +Installing collected packages: bitsandbytes +Successfully installed bitsandbytes-0.49.2 + +[notice] A new release of pip is available: 24.3.1 -> 26.0.1 +[notice] To update, run: python -m pip install --upgrade pip +~~~ + +## Step6-BenchmarkBnB +- Time: 2026-03-11 18:02:21 +- Status: SUCCESS +- Command: python3 benchmark_vs_bnb.py + +~~~text +Benchmarking bitsandbytes on NVIDIA A100-SXM4-80GB... +Warmup... +Benchmarking... +bitsandbytes dequantize_4bit (8192x8192, nf4, blocksize=64): + Time: 0.341 ms + Bandwidth: 492.41 GB/s +~~~