diff --git a/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md b/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md
new file mode 100644
index 0000000..20c7aae
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/Final_Project_Report.md
@@ -0,0 +1,123 @@
+# NF4 量化算子优化项目总结报告 (Final Project Report)
+
+## 1. 项目概述 (Project Overview)
+
+本项目旨在实现并优化一个高性能的 NF4 (Normal Float 4-bit) 到 FP16/BF16 的反量化 (Dequantization) CUDA Kernel。该算子是大语言模型 (LLM) QLoRA 推理中的核心组件。我们不仅实现了功能上的正确性（支持双重量化、任意形状矩阵），还在 NVIDIA A100 平台上进行了深度的性能优化，最终性能达到 **317 GB/s**，为 `bitsandbytes` 工业级实现的 **64%**。
+
+---
+
+## 2. 实现思路与功能验证 (Implementation & Verification)
+
+### 2.1 核心功能实现
+代码位置：[dequant_kernel.cu](file:///d:/thu-project/Learning-CUDA-master/Learning-CUDA-master/nf4/dequant_kernel.cu) (v4 implementation)
+
+我们严格按照 `bitsandbytes` 的规范实现了以下功能：
+
+1.  **NF4 映射表 (Lookup Table)**:
+    -   使用 `__device__ __constant__` 存储 16 个预定义的正态分布分位数。
+    -   **优化**: 16个 float 仅占用 64 字节，完美放入 L1 Constant Cache，确保存取无延迟。
+
+2.  **双重量化缩放 (Double Quantization Scaling)**:
+    -   公式: `w = NF4[idx] * (code2[absmax_q] * absmax2) + offset`
+    -   实现了两级缩放逻辑：第一级 `absmax_q` (uint8) 查表映射到 float，第二级 `absmax2` (float) 作为 Group 级缩放。
+
+3.  **向量化内存访问 (Packed Store)**:
+    -   **读取**: 每个线程读取 1 个 `uint8` (包含 2 个 NF4 索引)。
+    -   **计算**: 解码出 2 个 FP16/BF16 值。
+    -   **写入**: 使用 `reinterpret_cast<uint32_t*>` 将 2 个 16-bit 结果打包为 1 个 32-bit 写入指令。
+    -   **优势**: 减少了 50% 的 Global Memory 写入指令数，大幅提升了 Store 效率。
+
+4.  **边界处理 (Boundary Handling)**:
+    -   Kernel 基于 1D `numel` 索引，天然支持任意形状 (Rows/Cols) 的矩阵。
+    -   针对奇数个元素的情况，代码中包含边界检查 (`if (elem1 < numel) ... else ...`)，确保不发生越界访问。
+
+### 2.2 正确性验证
+-   **对比对象**: `bitsandbytes` (v0.49.2) CPU/CUDA 结果。
+-   **验证指标**: 平均绝对误差 (MAE)。
+-   **结果**: MAE = `2.30755e-05`，远优于要求阈值 `1e-2`。
+
+---
+
+## 3. 优化历程与方法 (Optimization Journey)
+
+我们经历了四个版本的迭代，性能从最初的 58 GB/s 提升至 317 GB/s。
+
+### v1: Naive 实现 (Baseline)
+-   **思路**: 每个线程处理 1 个元素。
+-   **问题**: 内存访问极其低效（1 字节读，2 字节写），显存带宽利用率仅 ~3%。
+-   **性能**: ~58 GB/s。
+
+### v2: 向量化读写 (Vectorized Access)
+-   **优化**: 每个线程处理 2 个元素 (1 个 `uint8`)。
+-   **手段**: 引入 `pack` 读和 `half2` 写。
+-   **效果**: 访存指令减半，带宽利用率提升显著。
+
+### v3: 激进向量化 (Aggressive Vectorization)
+-   **优化**: 每个线程处理 8 或 16 对元素 (使用 `int4` 加载 128 位)。
+-   **问题**: 寄存器压力剧增，导致 Occupancy (活跃 Warp 数) 下降，发生 Register Spilling。
+-   **教训**: 在 Memory Bound 算子中，过度的单线程指令级并行 (ILP) 可能会损害线程级并行 (TLP)。
+
+### v4: 动态 Occupancy 控制 (Current Best)
+-   **优化**:
+    1.  **回退到 `int2` 加载**: 降低单线程寄存器压力。
+    2.  **`__launch_bounds__(128, 8)`**: 强制编译器限制寄存器使用，确保每个 SM 至少能跑 8 个 Block (1024 线程)。
+    3.  **动态 Block Size**: 使用 `cudaOccupancyMaxPotentialBlockSize` 自动计算最优配置。
+-   **原理**: 利用 Roofline 模型，通过增加并发 Warp 数量来掩盖 Global Memory 的长延迟。
+-   **性能**: **317.25 GB/s** (5.4x speedup vs Baseline)。
+
+---
+
+## 4. 性能指标与分析 (Performance Analysis)
+
+### 4.1 最终指标 (Final Metrics)
+测试环境: NVIDIA A100-SXM4-80GB, Matrix 8192x8192
+
+| Metric | Value | Note |
+| :--- | :--- | :--- |
+| **Time** | 0.532 ms | 极低延迟 |
+| **Bandwidth** | **317.25 GB/s** | 有效带宽 |
+| **MAE** | 2.30e-05 | 精度达标 |
+| **vs bitsandbytes** | 64.4% | 工业级对标 |
+
+### 4.2 Nsight Compute (NCU) 分析
+由于服务器环境限制（权限或驱动版本问题），我们未能在最终的 A100 环境上成功收集到 `ncu` 的详细指标（如 Memory/Compute Throughput 占比）。目前的性能分析主要基于以下理论推导和实验观察：
+
+1.  **Memory Bound 特征**:
+    -   Kernel 执行时间极短 (0.532 ms)，且计算量极小（仅做简单的查表和乘加）。
+    -   带宽达到 317 GB/s，远超单纯计算密集型任务在未优化访存时的表现。
+    -   根据 Roofline 模型，低算术强度 (Arithmetic Intensity) 的算子必然受限于显存带宽。
+
+2.  **Occupancy 优化验证**:
+    -   我们在代码中显式使用了 `__launch_bounds__(128, 8)`。
+    -   实验表明，相比未加 bounds 的版本 (v3)，性能提升了 8.4%。这间接证明了增加活跃 Warp 数量（即提高 Occupancy）成功掩盖了部分 Global Memory 延迟。
+
+3.  **Coalescing 验证**:
+    -   代码设计上，我们使用了 `uint32_t` 类型的 Packed Store，保证了每个 Warp 的 32 个线程写入连续的 128 字节 (32 * 4 bytes)，这完全符合 NVIDIA GPU 的 L2 Cache Line (32 字节) 和显存事务 (32/128 字节) 的对齐要求。
+
+### 4.3 Nsight Systems (NSYS) 分析
+-   **Timeline**: `nsys` 成功运行。从 Timeline 来看，Kernel 执行时间非常短，GPU 利用率主要受限于 Kernel 启动开销和数据传输 (H2D/D2H)。
+-   **System View**: 在端到端推理中，反量化通常与矩阵乘法 (GEMM) 紧密相连。单独测试反量化时，数据搬运占据了主导地位。
+
+---
+
+## 5. 未来优化方向 (Future Improvements)
+
+虽然 v4 已经是一个优秀的工程实现，但距离 `bitsandbytes` (492 GB/s) 仍有 36% 的差距。未来的优化方向包括：
+
+1.  **PTX 内联汇编 (Inline PTX)**:
+    -   手动控制 SASS 指令调度，消除编译器生成的冗余移动指令。
+    -   微调寄存器分配，进一步减少 Bank Conflict。
+
+2.  **异步拷贝 (Async Copy)**:
+    -   使用 Ampere 架构的 `cp.async` 指令，实现 Global Memory 到 Shared Memory 的硬件级异步传输，彻底打断流水线停顿。
+
+3.  **算子融合 (Kernel Fusion)**:
+    -   **终极方案**: 将 Dequant 与后续的 GEMM (矩阵乘) 融合。
+    -   **收益**: 反量化后的 FP16 数据直接在寄存器中参与乘法，完全省去写回 Global Memory 的过程，理论上可获得 2x 以上的端到端性能提升。
+
+---
+
+## 6. 附件 (Appendix)
+-   **源代码**: `dequant_kernel.cu`, `main.cpp`
+-   **测试脚本**: `benchmark_vs_bnb.py`
+-   **性能日志**: `run_log_remote.md`
diff --git a/03_nf4_dequant/SkyHigh-achieving/README.md b/03_nf4_dequant/SkyHigh-achieving/README.md
new file mode 100644
index 0000000..e59b291
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/README.md
@@ -0,0 +1,28 @@
+# SkyHigh-achieving
+
+本项目为 SkyHigh-achieving 项目的技术总结报告，包含实现思路、优化历程与性能分析。
+
+## 📁 项目结构
+
+```tree
+SkyHigh-achieving/
+├── Final_Project_Report.md
+├── README.md
+├── benchmark_vs_bnb.py
+├── dequant_kernel.cu
+├── dequant_kernel.h
+├── dequant_kernel.ptx
+├── dequant_kernel_v2.cu
+├── main.cpp
+└── run_log_remote.md
+```
+
+- **Final_Project_Report.md** → 详细的技术总结报告，包含实现思路、优化历程与性能分析
+- **README.md** → 项目提交说明与文件结构介绍（本文件）
+- **benchmark_vs_bnb.py** → 工业级对比脚本，用于对标 bitsandbytes 库的性能与精度
+- **dequant_kernel.cu** → 核心 NF4 解量化 Kernel 实现（v4 优化版），包含 Packed Store 与 Bounds 优化
+- **dequant_kernel.h** → Kernel 函数头文件定义，提供 C++ 调用接口
+- **dequant_kernel.ptx** → NVCC 编译生成的 PTX 汇编代码，用于指令级分析
+- **dequant_kernel_v2.cu** → 早期版本的 Kernel 实现（v2），用于性能对比参考
+- **main.cpp** → C++ 测试驱动程序，包含随机数据生成、MAE 精度验证与基础性能测试逻辑
+- **run_log_remote.md** → A100 服务器上的完整运行日志与性能实测数据记录
diff --git a/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py b/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py
new file mode 100644
index 0000000..9e055f4
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/benchmark_vs_bnb.py
@@ -0,0 +1,77 @@
+
+import torch
+import time
+import sys
+
+def benchmark_bnb(rows=8192, cols=8192, repeats=50):
+    try:
+        import bitsandbytes as bnb
+        from bitsandbytes.functional import dequantize_4bit, quantize_4bit
+    except ImportError:
+        print("bitsandbytes not installed. Run: pip install bitsandbytes")
+        return None
+
+    if not torch.cuda.is_available():
+        print("CUDA not available")
+        return None
+
+    print(f"Benchmarking bitsandbytes on {torch.cuda.get_device_name(0)}...")
+    
+    # 生成 fp32 权重并量化
+    device = torch.device("cuda:0")
+    # fp16 input usually for weights in LLMs before quantization, but bnb quantizes from fp16/fp32
+    w = torch.randn(rows, cols, device=device, dtype=torch.float16)
+    
+    # blocksize=64, quant_type='nf4'
+    # quantize_4bit returns: (quantized_data, quantization_state)
+    # The signature might vary by version, but usually it's input, blocksize, quant_type
+    try:
+        w_q, quant_state = bnb.functional.quantize_4bit(
+            w.reshape(1, -1), blocksize=64, quant_type='nf4'
+        )
+    except TypeError:
+         # Fallback for some versions
+         w_q, quant_state = bnb.functional.quantize_4bit(
+            w.reshape(1, -1), blocksize=64, quant_type='nf4', compress_statistics=True
+        )
+
+    # Warmup
+    print("Warmup...")
+    for _ in range(5):
+        out = bnb.functional.dequantize_4bit(w_q, quant_state, quant_type='nf4')
+    torch.cuda.synchronize()
+    
+    # Benchmark
+    print("Benchmarking...")
+    t0 = time.perf_counter()
+    for _ in range(repeats):
+        out = bnb.functional.dequantize_4bit(w_q, quant_state, quant_type='nf4')
+    torch.cuda.synchronize()
+    t1 = time.perf_counter()
+    
+    # Calculate metrics
+    ms_per_call = (t1 - t0) / repeats * 1000
+    
+    # Data transfer: 
+    # Read: 4-bit quantized data + quantization metadata (scales, absmax)
+    # Write: FP16 output
+    # Input size: rows * cols / 2 bytes (4-bit)
+    # Output size: rows * cols * 2 bytes (fp16)
+    # Metadata is negligible for bandwidth calculation usually, but strict calculation includes it.
+    # For comparison with our kernel, we usually count load(compressed) + store(decompressed).
+    
+    numel = rows * cols
+    bytes_in = numel // 2 # 0.5 bytes per element
+    bytes_out = numel * 2 # 2 bytes per element
+    total_bytes = bytes_in + bytes_out
+    
+    bw_gbs = (total_bytes) / (ms_per_call / 1000) / 1e9
+    
+    print(f"bitsandbytes dequantize_4bit ({rows}x{cols}, nf4, blocksize=64):")
+    print(f"  Time: {ms_per_call:.3f} ms")
+    print(f"  Bandwidth: {bw_gbs:.2f} GB/s")
+    
+    return ms_per_call, bw_gbs
+
+if __name__ == "__main__":
+    benchmark_bnb(8192, 8192)
diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu
new file mode 100644
index 0000000..3090463
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.cu
@@ -0,0 +1 @@
+#include "dequant_kernel_v2.cu"
diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h
new file mode 100644
index 0000000..da6fcb9
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.h
@@ -0,0 +1,29 @@
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+enum class ComputeType {
+    FP16,
+    BF16
+};
+
+struct DequantConfig {
+    int64_t rows;
+    int64_t cols;
+    int32_t blocksize;
+    ComputeType compute_type;
+};
+
+struct NF4Binary {
+    DequantConfig config;
+    std::vector<uint8_t> packed_weights;
+    std::vector<uint8_t> absmax_q;
+    std::vector<uint16_t> absmax2_raw;
+    std::vector<uint16_t> code2_raw;
+    float offset;
+};
+
+bool load_nf4_binary(const char* file_path, NF4Binary& out);
+bool save_float_output(const char* file_path, const std::vector<float>& data);
+bool run_dequant_cuda(const NF4Binary& input, std::vector<float>& output, float& mae);
diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx
new file mode 100644
index 0000000..cc80ebe
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel.ptx
@@ -0,0 +1,5017 @@
+//
+// Generated by NVIDIA NVVM Compiler
+//
+// Compiler Build ID: CL-32267302
+// Cuda compilation tools, release 12.0, V12.0.140
+// Based on NVVM 7.0.1
+//
+
+.version 8.0
+.target sm_80
+.address_size 64
+
+.const .align 4 .b8 _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E[64] = {0, 0, 128, 191, 177, 57, 50, 191, 48, 107, 6, 191, 160, 50, 202, 190, 77, 162, 145, 190, 63, 53, 61, 190, 113, 120, 186, 189, 0, 0, 0, 0, 255, 250, 162, 61, 227, 202, 36, 62, 221, 4, 124, 62, 58, 3, 173, 62, 184, 164, 225, 62, 171, 7, 16, 63, 179, 19, 57, 63, 0, 0, 128, 63};
+
+.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT_(
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_0,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_1,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_2,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_3,
+	.param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_4,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_5,
+	.param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_6,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_7
+)
+{
+	.reg .pred 	%p<5>;
+	.reg .b16 	%rs<12>;
+	.reg .f32 	%f<14>;
+	.reg .b32 	%r<17>;
+	.reg .b64 	%rd<58>;
+	.loc	1 102 0
+
+
+	ld.param.u64 	%rd15, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_0];
+	ld.param.u64 	%rd18, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_1];
+	ld.param.u64 	%rd19, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_2];
+	ld.param.u64 	%rd20, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_3];
+	ld.param.f32 	%f2, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_4];
+	ld.param.u64 	%rd16, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_5];
+	ld.param.u32 	%r1, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_6];
+	ld.param.u64 	%rd17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI6__halfEEvPKhS3_PKfS5_fliPT__param_7];
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd1, %rd19;
+	cvta.to.global.u64 	%rd2, %rd20;
+	cvta.to.global.u64 	%rd3, %rd18;
+	mov.u32 	%r2, %ctaid.x;
+	mov.u32 	%r3, %ntid.x;
+	mul.wide.u32 	%rd21, %r2, %r3;
+	mov.u32 	%r4, %tid.x;
+	cvt.u64.u32 	%rd22, %r4;
+	add.s64 	%rd4, %rd21, %rd22;
+	.loc	1 114 25
+	shl.b64 	%rd5, %rd4, 1;
+	.loc	1 115 5
+	setp.ge.s64 	%p1, %rd5, %rd16;
+	@%p1 bra 	$L__BB0_10;
+
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd23, %rd15;
+	.loc	1 121 26
+	add.s64 	%rd24, %rd23, %rd4;
+	ld.global.nc.u8 	%rs1, [%rd24];
+	.loc	1 130 30
+	cvt.s64.s32 	%rd6, %r1;
+	or.b64  	%rd25, %rd5, %rd6;
+	and.b64  	%rd26, %rd25, -4294967296;
+	setp.eq.s64 	%p2, %rd26, 0;
+	@%p2 bra 	$L__BB0_3;
+
+	.loc	1 0 30
+	div.s64 	%rd56, %rd5, %rd6;
+	bra.uni 	$L__BB0_4;
+
+$L__BB0_3:
+	cvt.u32.u64 	%r5, %rd6;
+	cvt.u32.u64 	%r6, %rd5;
+	div.u32 	%r7, %r6, %r5;
+	cvt.u64.u32 	%rd56, %r7;
+
+$L__BB0_4:
+	.loc	1 132 24
+	add.s64 	%rd27, %rd3, %rd56;
+	ld.global.nc.u8 	%rs2, [%rd27];
+	cvt.u32.u16 	%r8, %rs2;
+	and.b32  	%r9, %r8, 255;
+	mul.wide.u32 	%rd28, %r9, 4;
+	add.s64 	%rd29, %rd2, %rd28;
+	.loc	1 131 30
+	shr.s64 	%rd30, %rd56, 63;
+	shr.u64 	%rd31, %rd30, 56;
+	add.s64 	%rd32, %rd56, %rd31;
+	shr.s64 	%rd33, %rd32, 8;
+	.loc	1 132 24
+	shl.b64 	%rd34, %rd33, 2;
+	add.s64 	%rd35, %rd1, %rd34;
+	ld.global.nc.f32 	%f3, [%rd35];
+	ld.global.nc.f32 	%f4, [%rd29];
+	mul.f32 	%f5, %f4, %f3;
+	.loc	1 133 20
+	shl.b16 	%rs3, %rs1, 2;
+	cvt.u64.u16 	%rd36, %rs3;
+	and.b64  	%rd37, %rd36, 60;
+	mov.u64 	%rd38, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E;
+	add.s64 	%rd39, %rd38, %rd37;
+	ld.const.f32 	%f6, [%rd39];
+	fma.rn.f32 	%f1, %f6, %f5, %f2;
+	.loc	1 136 25
+	add.s64 	%rd10, %rd5, 1;
+	.loc	1 137 5
+	setp.lt.s64 	%p3, %rd10, %rd16;
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd40, %rd17;
+	.loc	1 149 9
+	shl.b64 	%rd41, %rd5, 1;
+	add.s64 	%rd11, %rd40, %rd41;
+	.loc	1 137 5
+	@%p3 bra 	$L__BB0_6;
+	bra.uni 	$L__BB0_5;
+
+$L__BB0_6:
+	.loc	1 0 5
+	or.b64  	%rd42, %rd10, %rd6;
+	and.b64  	%rd43, %rd42, -4294967296;
+	setp.eq.s64 	%p4, %rd43, 0;
+	@%p4 bra 	$L__BB0_8;
+
+	div.s64 	%rd57, %rd10, %rd6;
+	bra.uni 	$L__BB0_9;
+
+$L__BB0_5:
+	.loc	1 152 22
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 152 22
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs4, %f1;}
+
+	// end inline asm
+	.loc	1 152 22
+	st.global.u16 	[%rd11], %rs4;
+	bra.uni 	$L__BB0_10;
+
+$L__BB0_8:
+	.loc	1 0 22
+	cvt.u32.u64 	%r10, %rd6;
+	cvt.u32.u64 	%r11, %rd10;
+	div.u32 	%r12, %r11, %r10;
+	cvt.u64.u32 	%rd57, %r12;
+
+$L__BB0_9:
+	.loc	1 143 28
+	add.s64 	%rd44, %rd3, %rd57;
+	ld.global.nc.u8 	%rs9, [%rd44];
+	cvt.u32.u16 	%r14, %rs9;
+	and.b32  	%r15, %r14, 255;
+	mul.wide.u32 	%rd45, %r15, 4;
+	add.s64 	%rd46, %rd2, %rd45;
+	.loc	1 142 34
+	shr.s64 	%rd47, %rd57, 63;
+	shr.u64 	%rd48, %rd47, 56;
+	add.s64 	%rd49, %rd57, %rd48;
+	shr.s64 	%rd50, %rd49, 8;
+	.loc	1 143 28
+	shl.b64 	%rd51, %rd50, 2;
+	add.s64 	%rd52, %rd1, %rd51;
+	ld.global.nc.f32 	%f10, [%rd52];
+	ld.global.nc.f32 	%f11, [%rd46];
+	mul.f32 	%f12, %f11, %f10;
+	.loc	1 123 20
+	and.b16  	%rs10, %rs1, 240;
+	shr.u16 	%rs11, %rs10, 4;
+	.loc	1 144 24
+	cvt.u32.u16 	%r16, %rs11;
+	mul.wide.u32 	%rd53, %r16, 4;
+	add.s64 	%rd55, %rd38, %rd53;
+	ld.const.f32 	%f13, [%rd55];
+	fma.rn.f32 	%f9, %f13, %f12, %f2;
+	.loc	1 149 36
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 149 36
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs5, %f1;}
+
+	// end inline asm
+	.loc	1 149 52
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 149 52
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs6, %f9;}
+
+	// end inline asm
+	.loc	1 149 9
+	.loc	1 74 22, function_name $L__info_string2, inlined_at 1 149 9
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r13, {%rs5,%rs6};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 149 9
+	st.global.u32 	[%rd11], %r13;
+
+$L__BB0_10:
+	.loc	1 154 1
+	ret;
+
+}
+.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT_(
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_0,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_1,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_2,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_3,
+	.param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_4,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_5,
+	.param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_6,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_7
+)
+{
+	.reg .pred 	%p<82>;
+	.reg .b16 	%rs<189>;
+	.reg .f32 	%f<194>;
+	.reg .b32 	%r<327>;
+	.reg .b64 	%rd<664>;
+	.loc	1 157 0
+
+
+	ld.param.u64 	%rd137, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_0];
+	ld.param.u64 	%rd140, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_1];
+	ld.param.u64 	%rd141, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_2];
+	ld.param.u64 	%rd142, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_3];
+	ld.param.f32 	%f17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_4];
+	ld.param.u64 	%rd138, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_5];
+	ld.param.u32 	%r64, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_6];
+	ld.param.u64 	%rd139, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I6__halfEEvPKhS3_PKfS5_fliPT__param_7];
+	cvta.to.global.u64 	%rd1, %rd141;
+	cvta.to.global.u64 	%rd2, %rd142;
+	cvta.to.global.u64 	%rd3, %rd140;
+	.loc	1 169 23
+	mov.u32 	%r65, %ctaid.x;
+	mov.u32 	%r66, %ntid.x;
+	mul.wide.u32 	%rd143, %r65, %r66;
+	mov.u32 	%r67, %tid.x;
+	cvt.u64.u32 	%rd144, %r67;
+	add.s64 	%rd145, %rd143, %rd144;
+	.loc	1 170 29
+	shl.b64 	%rd4, %rd145, 4;
+	.loc	1 171 29
+	shl.b64 	%rd5, %rd145, 5;
+	.loc	1 172 35
+	add.s64 	%rd146, %rd138, 1;
+	shr.u64 	%rd147, %rd146, 63;
+	add.s64 	%rd148, %rd146, %rd147;
+	shr.s64 	%rd6, %rd148, 1;
+	.loc	1 174 5
+	setp.ge.s64 	%p1, %rd5, %rd138;
+	@%p1 bra 	$L__BB1_194;
+
+	.loc	1 178 5
+	add.s64 	%rd149, %rd4, 16;
+	setp.gt.s64 	%p2, %rd149, %rd6;
+	cvta.to.global.u64 	%rd150, %rd137;
+	.loc	1 206 13
+	add.s64 	%rd7, %rd150, %rd4;
+	.loc	1 178 5
+	@%p2 bra 	$L__BB1_3;
+	bra.uni 	$L__BB1_2;
+
+$L__BB1_3:
+	.loc	1 206 13
+	setp.ge.s64 	%p3, %rd4, %rd6;
+	mov.u32 	%r321, 0;
+	mov.u32 	%r322, %r321;
+	@%p3 bra 	$L__BB1_5;
+
+	ld.global.nc.u8 	%rs17, [%rd7];
+	cvt.u32.u16 	%r73, %rs17;
+	and.b32  	%r322, %r73, 255;
+
+$L__BB1_5:
+	.loc	1 205 29
+	add.s64 	%rd151, %rd4, 1;
+	.loc	1 206 13
+	setp.ge.s64 	%p4, %rd151, %rd6;
+	@%p4 bra 	$L__BB1_7;
+
+	ld.global.nc.u8 	%rs18, [%rd7+1];
+	cvt.u32.u16 	%r75, %rs18;
+	and.b32  	%r321, %r75, 255;
+
+$L__BB1_7:
+	.loc	1 205 29
+	add.s64 	%rd152, %rd4, 2;
+	.loc	1 206 13
+	setp.ge.s64 	%p5, %rd152, %rd6;
+	mov.u32 	%r319, 0;
+	mov.u32 	%r320, %r319;
+	@%p5 bra 	$L__BB1_9;
+
+	ld.global.nc.u8 	%rs19, [%rd7+2];
+	cvt.u32.u16 	%r77, %rs19;
+	and.b32  	%r320, %r77, 255;
+
+$L__BB1_9:
+	.loc	1 205 29
+	add.s64 	%rd153, %rd4, 3;
+	.loc	1 206 13
+	setp.ge.s64 	%p6, %rd153, %rd6;
+	@%p6 bra 	$L__BB1_11;
+
+	ld.global.nc.u8 	%rs20, [%rd7+3];
+	cvt.u32.u16 	%r79, %rs20;
+	and.b32  	%r319, %r79, 255;
+
+$L__BB1_11:
+	.loc	1 205 29
+	add.s64 	%rd154, %rd4, 4;
+	.loc	1 206 13
+	setp.ge.s64 	%p7, %rd154, %rd6;
+	mov.u32 	%r317, 0;
+	mov.u32 	%r318, %r317;
+	@%p7 bra 	$L__BB1_13;
+
+	ld.global.nc.u8 	%rs21, [%rd7+4];
+	cvt.u32.u16 	%r81, %rs21;
+	and.b32  	%r318, %r81, 255;
+
+$L__BB1_13:
+	.loc	1 205 29
+	add.s64 	%rd155, %rd4, 5;
+	.loc	1 206 13
+	setp.ge.s64 	%p8, %rd155, %rd6;
+	@%p8 bra 	$L__BB1_15;
+
+	ld.global.nc.u8 	%rs22, [%rd7+5];
+	cvt.u32.u16 	%r83, %rs22;
+	and.b32  	%r317, %r83, 255;
+
+$L__BB1_15:
+	.loc	1 205 29
+	add.s64 	%rd156, %rd4, 6;
+	.loc	1 206 13
+	setp.ge.s64 	%p9, %rd156, %rd6;
+	mov.u32 	%r315, 0;
+	mov.u32 	%r316, %r315;
+	@%p9 bra 	$L__BB1_17;
+
+	ld.global.nc.u8 	%rs23, [%rd7+6];
+	cvt.u32.u16 	%r85, %rs23;
+	and.b32  	%r316, %r85, 255;
+
+$L__BB1_17:
+	.loc	1 205 29
+	add.s64 	%rd157, %rd4, 7;
+	.loc	1 206 13
+	setp.ge.s64 	%p10, %rd157, %rd6;
+	@%p10 bra 	$L__BB1_19;
+
+	ld.global.nc.u8 	%rs24, [%rd7+7];
+	cvt.u32.u16 	%r87, %rs24;
+	and.b32  	%r315, %r87, 255;
+
+$L__BB1_19:
+	.loc	1 205 29
+	add.s64 	%rd158, %rd4, 8;
+	.loc	1 206 13
+	setp.ge.s64 	%p11, %rd158, %rd6;
+	mov.u32 	%r313, 0;
+	mov.u32 	%r314, %r313;
+	@%p11 bra 	$L__BB1_21;
+
+	ld.global.nc.u8 	%rs25, [%rd7+8];
+	cvt.u32.u16 	%r89, %rs25;
+	and.b32  	%r314, %r89, 255;
+
+$L__BB1_21:
+	.loc	1 205 29
+	add.s64 	%rd159, %rd4, 9;
+	.loc	1 206 13
+	setp.ge.s64 	%p12, %rd159, %rd6;
+	@%p12 bra 	$L__BB1_23;
+
+	ld.global.nc.u8 	%rs26, [%rd7+9];
+	cvt.u32.u16 	%r91, %rs26;
+	and.b32  	%r313, %r91, 255;
+
+$L__BB1_23:
+	.loc	1 205 29
+	add.s64 	%rd160, %rd4, 10;
+	.loc	1 206 13
+	setp.ge.s64 	%p13, %rd160, %rd6;
+	mov.u32 	%r311, 0;
+	mov.u32 	%r312, %r311;
+	@%p13 bra 	$L__BB1_25;
+
+	ld.global.nc.u8 	%rs27, [%rd7+10];
+	cvt.u32.u16 	%r93, %rs27;
+	and.b32  	%r312, %r93, 255;
+
+$L__BB1_25:
+	.loc	1 205 29
+	add.s64 	%rd161, %rd4, 11;
+	.loc	1 206 13
+	setp.ge.s64 	%p14, %rd161, %rd6;
+	@%p14 bra 	$L__BB1_27;
+
+	ld.global.nc.u8 	%rs28, [%rd7+11];
+	cvt.u32.u16 	%r95, %rs28;
+	and.b32  	%r311, %r95, 255;
+
+$L__BB1_27:
+	.loc	1 205 29
+	add.s64 	%rd162, %rd4, 12;
+	.loc	1 206 13
+	setp.ge.s64 	%p15, %rd162, %rd6;
+	mov.u32 	%r324, 0;
+	mov.u32 	%r323, %r324;
+	@%p15 bra 	$L__BB1_29;
+
+	ld.global.nc.u8 	%rs29, [%rd7+12];
+	cvt.u32.u16 	%r97, %rs29;
+	and.b32  	%r323, %r97, 255;
+
+$L__BB1_29:
+	.loc	1 205 29
+	add.s64 	%rd163, %rd4, 13;
+	.loc	1 206 13
+	setp.ge.s64 	%p16, %rd163, %rd6;
+	@%p16 bra 	$L__BB1_31;
+
+	ld.global.nc.u8 	%rs30, [%rd7+13];
+	cvt.u32.u16 	%r99, %rs30;
+	and.b32  	%r324, %r99, 255;
+
+$L__BB1_31:
+	.loc	1 205 29
+	add.s64 	%rd164, %rd4, 14;
+	.loc	1 206 13
+	setp.ge.s64 	%p17, %rd164, %rd6;
+	mov.u32 	%r326, 0;
+	mov.u32 	%r325, %r326;
+	@%p17 bra 	$L__BB1_33;
+
+	ld.global.nc.u8 	%rs31, [%rd7+14];
+	cvt.u32.u16 	%r101, %rs31;
+	and.b32  	%r325, %r101, 255;
+
+$L__BB1_33:
+	.loc	1 205 29
+	add.s64 	%rd165, %rd4, 15;
+	.loc	1 206 13
+	setp.ge.s64 	%p18, %rd165, %rd6;
+	@%p18 bra 	$L__BB1_35;
+
+	ld.global.nc.u8 	%rs32, [%rd7+15];
+	cvt.u32.u16 	%r103, %rs32;
+	and.b32  	%r326, %r103, 255;
+	bra.uni 	$L__BB1_35;
+
+$L__BB1_2:
+	.loc	1 180 29
+	ld.global.nc.v4.u32 	{%r322, %r318, %r314, %r323}, [%rd7];
+	.loc	1 186 17
+	shr.u32 	%r321, %r322, 8;
+	.loc	1 187 17
+	shr.u32 	%r320, %r322, 16;
+	.loc	1 188 17
+	shr.u32 	%r319, %r322, 24;
+	.loc	1 186 17
+	shr.u32 	%r317, %r318, 8;
+	.loc	1 187 17
+	shr.u32 	%r316, %r318, 16;
+	.loc	1 188 17
+	shr.u32 	%r315, %r318, 24;
+	.loc	1 186 17
+	shr.u32 	%r313, %r314, 8;
+	.loc	1 187 17
+	shr.u32 	%r312, %r314, 16;
+	.loc	1 188 17
+	shr.u32 	%r311, %r314, 24;
+	.loc	1 186 17
+	shr.u32 	%r324, %r323, 8;
+	.loc	1 187 17
+	shr.u32 	%r325, %r323, 16;
+	.loc	1 188 17
+	shr.u32 	%r326, %r323, 24;
+
+$L__BB1_35:
+	.loc	1 213 9
+	cvt.u16.u32 	%rs1, %r326;
+	cvt.u16.u32 	%rs2, %r325;
+	cvt.u16.u32 	%rs3, %r324;
+	cvt.u16.u32 	%rs4, %r323;
+	cvt.u16.u32 	%rs5, %r321;
+	cvt.u16.u32 	%rs6, %r320;
+	cvt.u16.u32 	%rs7, %r319;
+	cvt.u16.u32 	%rs8, %r318;
+	cvt.u16.u32 	%rs9, %r317;
+	cvt.u16.u32 	%rs10, %r316;
+	cvt.u16.u32 	%rs11, %r315;
+	cvt.u16.u32 	%rs12, %r314;
+	cvt.u16.u32 	%rs13, %r313;
+	cvt.u16.u32 	%rs14, %r312;
+	cvt.u16.u32 	%rs15, %r311;
+	cvt.u16.u32 	%rs16, %r322;
+	cvt.s64.s32 	%rd8, %r64;
+	or.b64  	%rd166, %rd5, %rd8;
+	and.b64  	%rd167, %rd166, -4294967296;
+	setp.eq.s64 	%p19, %rd167, 0;
+	@%p19 bra 	$L__BB1_37;
+
+	.loc	1 0 9
+	div.s64 	%rd632, %rd5, %rd8;
+	bra.uni 	$L__BB1_38;
+
+$L__BB1_37:
+	cvt.u32.u64 	%r104, %rd8;
+	cvt.u32.u64 	%r105, %rd5;
+	div.u32 	%r106, %r105, %r104;
+	cvt.u64.u32 	%rd632, %r106;
+
+$L__BB1_38:
+	.loc	1 219 28
+	add.s64 	%rd168, %rd3, %rd632;
+	ld.global.nc.u8 	%rs33, [%rd168];
+	cvt.u32.u16 	%r107, %rs33;
+	and.b32  	%r108, %r107, 255;
+	mul.wide.u32 	%rd169, %r108, 4;
+	add.s64 	%rd170, %rd2, %rd169;
+	shr.s64 	%rd171, %rd632, 63;
+	shr.u64 	%rd172, %rd171, 56;
+	add.s64 	%rd173, %rd632, %rd172;
+	shr.s64 	%rd174, %rd173, 8;
+	shl.b64 	%rd175, %rd174, 2;
+	add.s64 	%rd176, %rd1, %rd175;
+	ld.global.nc.f32 	%f18, [%rd176];
+	ld.global.nc.f32 	%f19, [%rd170];
+	mul.f32 	%f20, %f19, %f18;
+	.loc	1 220 24
+	shl.b16 	%rs34, %rs16, 2;
+	cvt.u64.u16 	%rd177, %rs34;
+	and.b64  	%rd178, %rd177, 60;
+	mov.u64 	%rd179, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E;
+	add.s64 	%rd180, %rd179, %rd178;
+	ld.const.f32 	%f21, [%rd180];
+	fma.rn.f32 	%f1, %f21, %f20, %f17;
+	.loc	1 222 29
+	add.s64 	%rd12, %rd5, 1;
+	.loc	1 223 9
+	setp.lt.s64 	%p20, %rd12, %rd138;
+	cvta.to.global.u64 	%rd181, %rd139;
+	.loc	1 227 13
+	shl.b64 	%rd182, %rd5, 1;
+	add.s64 	%rd13, %rd181, %rd182;
+	.loc	1 223 9
+	@%p20 bra 	$L__BB1_40;
+	bra.uni 	$L__BB1_39;
+
+$L__BB1_40:
+	.loc	1 0 9
+	or.b64  	%rd183, %rd12, %rd8;
+	and.b64  	%rd184, %rd183, -4294967296;
+	setp.eq.s64 	%p21, %rd184, 0;
+	@%p21 bra 	$L__BB1_42;
+
+	div.s64 	%rd633, %rd12, %rd8;
+	bra.uni 	$L__BB1_43;
+
+$L__BB1_39:
+	.loc	1 229 26
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 229 26
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs35, %f1;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13], %rs35;
+	bra.uni 	$L__BB1_44;
+
+$L__BB1_42:
+	.loc	1 0 26
+	cvt.u32.u64 	%r109, %rd8;
+	cvt.u32.u64 	%r110, %rd12;
+	div.u32 	%r111, %r110, %r109;
+	cvt.u64.u32 	%rd633, %r111;
+
+$L__BB1_43:
+	.loc	1 225 32
+	add.s64 	%rd185, %rd3, %rd633;
+	ld.global.nc.u8 	%rs40, [%rd185];
+	cvt.u32.u16 	%r113, %rs40;
+	and.b32  	%r114, %r113, 255;
+	mul.wide.u32 	%rd186, %r114, 4;
+	add.s64 	%rd187, %rd2, %rd186;
+	shr.s64 	%rd188, %rd633, 63;
+	shr.u64 	%rd189, %rd188, 56;
+	add.s64 	%rd190, %rd633, %rd189;
+	shr.s64 	%rd191, %rd190, 8;
+	shl.b64 	%rd192, %rd191, 2;
+	add.s64 	%rd193, %rd1, %rd192;
+	ld.global.nc.f32 	%f25, [%rd193];
+	ld.global.nc.f32 	%f26, [%rd187];
+	mul.f32 	%f27, %f26, %f25;
+	.loc	1 216 24
+	and.b16  	%rs41, %rs16, 240;
+	shr.u16 	%rs42, %rs41, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r115, %rs42;
+	mul.wide.u32 	%rd194, %r115, 4;
+	add.s64 	%rd196, %rd179, %rd194;
+	ld.const.f32 	%f28, [%rd196];
+	fma.rn.f32 	%f24, %f28, %f27, %f17;
+	.loc	1 227 40
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 227 40
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs36, %f1;}
+
+	// end inline asm
+	.loc	1 227 56
+	.loc	1 62 71, function_name $L__info_string0, inlined_at 1 227 56
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs37, %f24;}
+
+	// end inline asm
+	.loc	1 227 13
+	.loc	1 74 22, function_name $L__info_string2, inlined_at 1 227 13
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r112, {%rs36,%rs37};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13], %r112;
+
+$L__BB1_44:
+	.loc	1 212 29
+	add.s64 	%rd17, %rd5, 2;
+	.loc	1 213 9
+	setp.ge.s64 	%p22, %rd17, %rd138;
+	@%p22 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd197, %rd17, %rd8;
+	and.b64  	%rd198, %rd197, -4294967296;
+	setp.eq.s64 	%p23, %rd198, 0;
+	@%p23 bra 	$L__BB1_47;
+
+	div.s64 	%rd634, %rd17, %rd8;
+	bra.uni 	$L__BB1_48;
+
+$L__BB1_47:
+	cvt.u32.u64 	%r116, %rd8;
+	cvt.u32.u64 	%r117, %rd17;
+	div.u32 	%r118, %r117, %r116;
+	cvt.u64.u32 	%rd634, %r118;
+
+$L__BB1_48:
+	.loc	1 219 28
+	add.s64 	%rd199, %rd3, %rd634;
+	ld.global.nc.u8 	%rs43, [%rd199];
+	cvt.u32.u16 	%r119, %rs43;
+	and.b32  	%r120, %r119, 255;
+	mul.wide.u32 	%rd200, %r120, 4;
+	add.s64 	%rd201, %rd2, %rd200;
+	shr.s64 	%rd202, %rd634, 63;
+	shr.u64 	%rd203, %rd202, 56;
+	add.s64 	%rd204, %rd634, %rd203;
+	shr.s64 	%rd205, %rd204, 8;
+	shl.b64 	%rd206, %rd205, 2;
+	add.s64 	%rd207, %rd1, %rd206;
+	ld.global.nc.f32 	%f29, [%rd207];
+	ld.global.nc.f32 	%f30, [%rd201];
+	mul.f32 	%f31, %f30, %f29;
+	.loc	1 220 24
+	shl.b16 	%rs44, %rs5, 2;
+	cvt.u64.u16 	%rd208, %rs44;
+	and.b64  	%rd209, %rd208, 60;
+	add.s64 	%rd211, %rd179, %rd209;
+	ld.const.f32 	%f32, [%rd211];
+	fma.rn.f32 	%f2, %f32, %f31, %f17;
+	.loc	1 222 29
+	add.s64 	%rd21, %rd5, 3;
+	.loc	1 223 9
+	setp.lt.s64 	%p24, %rd21, %rd138;
+	@%p24 bra 	$L__BB1_50;
+	bra.uni 	$L__BB1_49;
+
+$L__BB1_50:
+	.loc	1 0 9
+	or.b64  	%rd212, %rd21, %rd8;
+	and.b64  	%rd213, %rd212, -4294967296;
+	setp.eq.s64 	%p25, %rd213, 0;
+	@%p25 bra 	$L__BB1_52;
+
+	div.s64 	%rd635, %rd21, %rd8;
+	bra.uni 	$L__BB1_53;
+
+$L__BB1_49:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs45, %f2;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+4], %rs45;
+	bra.uni 	$L__BB1_54;
+
+$L__BB1_52:
+	.loc	1 0 26
+	cvt.u32.u64 	%r121, %rd8;
+	cvt.u32.u64 	%r122, %rd21;
+	div.u32 	%r123, %r122, %r121;
+	cvt.u64.u32 	%rd635, %r123;
+
+$L__BB1_53:
+	.loc	1 225 32
+	add.s64 	%rd214, %rd3, %rd635;
+	ld.global.nc.u8 	%rs50, [%rd214];
+	cvt.u32.u16 	%r125, %rs50;
+	and.b32  	%r126, %r125, 255;
+	mul.wide.u32 	%rd215, %r126, 4;
+	add.s64 	%rd216, %rd2, %rd215;
+	shr.s64 	%rd217, %rd635, 63;
+	shr.u64 	%rd218, %rd217, 56;
+	add.s64 	%rd219, %rd635, %rd218;
+	shr.s64 	%rd220, %rd219, 8;
+	shl.b64 	%rd221, %rd220, 2;
+	add.s64 	%rd222, %rd1, %rd221;
+	ld.global.nc.f32 	%f36, [%rd222];
+	ld.global.nc.f32 	%f37, [%rd216];
+	mul.f32 	%f38, %f37, %f36;
+	.loc	1 216 24
+	and.b16  	%rs51, %rs5, 240;
+	shr.u16 	%rs52, %rs51, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r127, %rs52;
+	mul.wide.u32 	%rd223, %r127, 4;
+	add.s64 	%rd225, %rd179, %rd223;
+	ld.const.f32 	%f39, [%rd225];
+	fma.rn.f32 	%f35, %f39, %f38, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs46, %f2;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs47, %f35;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r124, {%rs46,%rs47};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+4], %r124;
+
+$L__BB1_54:
+	.loc	1 212 29
+	add.s64 	%rd25, %rd5, 4;
+	.loc	1 213 9
+	setp.ge.s64 	%p26, %rd25, %rd138;
+	@%p26 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd226, %rd25, %rd8;
+	and.b64  	%rd227, %rd226, -4294967296;
+	setp.eq.s64 	%p27, %rd227, 0;
+	@%p27 bra 	$L__BB1_57;
+
+	div.s64 	%rd636, %rd25, %rd8;
+	bra.uni 	$L__BB1_58;
+
+$L__BB1_57:
+	cvt.u32.u64 	%r128, %rd8;
+	cvt.u32.u64 	%r129, %rd25;
+	div.u32 	%r130, %r129, %r128;
+	cvt.u64.u32 	%rd636, %r130;
+
+$L__BB1_58:
+	.loc	1 219 28
+	add.s64 	%rd228, %rd3, %rd636;
+	ld.global.nc.u8 	%rs53, [%rd228];
+	cvt.u32.u16 	%r131, %rs53;
+	and.b32  	%r132, %r131, 255;
+	mul.wide.u32 	%rd229, %r132, 4;
+	add.s64 	%rd230, %rd2, %rd229;
+	shr.s64 	%rd231, %rd636, 63;
+	shr.u64 	%rd232, %rd231, 56;
+	add.s64 	%rd233, %rd636, %rd232;
+	shr.s64 	%rd234, %rd233, 8;
+	shl.b64 	%rd235, %rd234, 2;
+	add.s64 	%rd236, %rd1, %rd235;
+	ld.global.nc.f32 	%f40, [%rd236];
+	ld.global.nc.f32 	%f41, [%rd230];
+	mul.f32 	%f42, %f41, %f40;
+	.loc	1 220 24
+	shl.b16 	%rs54, %rs6, 2;
+	cvt.u64.u16 	%rd237, %rs54;
+	and.b64  	%rd238, %rd237, 60;
+	add.s64 	%rd240, %rd179, %rd238;
+	ld.const.f32 	%f43, [%rd240];
+	fma.rn.f32 	%f3, %f43, %f42, %f17;
+	.loc	1 222 29
+	add.s64 	%rd29, %rd5, 5;
+	.loc	1 223 9
+	setp.lt.s64 	%p28, %rd29, %rd138;
+	@%p28 bra 	$L__BB1_60;
+	bra.uni 	$L__BB1_59;
+
+$L__BB1_60:
+	.loc	1 0 9
+	or.b64  	%rd241, %rd29, %rd8;
+	and.b64  	%rd242, %rd241, -4294967296;
+	setp.eq.s64 	%p29, %rd242, 0;
+	@%p29 bra 	$L__BB1_62;
+
+	div.s64 	%rd637, %rd29, %rd8;
+	bra.uni 	$L__BB1_63;
+
+$L__BB1_59:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs55, %f3;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+8], %rs55;
+	bra.uni 	$L__BB1_64;
+
+$L__BB1_62:
+	.loc	1 0 26
+	cvt.u32.u64 	%r133, %rd8;
+	cvt.u32.u64 	%r134, %rd29;
+	div.u32 	%r135, %r134, %r133;
+	cvt.u64.u32 	%rd637, %r135;
+
+$L__BB1_63:
+	.loc	1 225 32
+	add.s64 	%rd243, %rd3, %rd637;
+	ld.global.nc.u8 	%rs60, [%rd243];
+	cvt.u32.u16 	%r137, %rs60;
+	and.b32  	%r138, %r137, 255;
+	mul.wide.u32 	%rd244, %r138, 4;
+	add.s64 	%rd245, %rd2, %rd244;
+	shr.s64 	%rd246, %rd637, 63;
+	shr.u64 	%rd247, %rd246, 56;
+	add.s64 	%rd248, %rd637, %rd247;
+	shr.s64 	%rd249, %rd248, 8;
+	shl.b64 	%rd250, %rd249, 2;
+	add.s64 	%rd251, %rd1, %rd250;
+	ld.global.nc.f32 	%f47, [%rd251];
+	ld.global.nc.f32 	%f48, [%rd245];
+	mul.f32 	%f49, %f48, %f47;
+	.loc	1 216 24
+	and.b16  	%rs61, %rs6, 240;
+	shr.u16 	%rs62, %rs61, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r139, %rs62;
+	mul.wide.u32 	%rd252, %r139, 4;
+	add.s64 	%rd254, %rd179, %rd252;
+	ld.const.f32 	%f50, [%rd254];
+	fma.rn.f32 	%f46, %f50, %f49, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs56, %f3;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs57, %f46;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r136, {%rs56,%rs57};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+8], %r136;
+
+$L__BB1_64:
+	.loc	1 212 29
+	add.s64 	%rd33, %rd5, 6;
+	.loc	1 213 9
+	setp.ge.s64 	%p30, %rd33, %rd138;
+	@%p30 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd255, %rd33, %rd8;
+	and.b64  	%rd256, %rd255, -4294967296;
+	setp.eq.s64 	%p31, %rd256, 0;
+	@%p31 bra 	$L__BB1_67;
+
+	div.s64 	%rd638, %rd33, %rd8;
+	bra.uni 	$L__BB1_68;
+
+$L__BB1_67:
+	cvt.u32.u64 	%r140, %rd8;
+	cvt.u32.u64 	%r141, %rd33;
+	div.u32 	%r142, %r141, %r140;
+	cvt.u64.u32 	%rd638, %r142;
+
+$L__BB1_68:
+	.loc	1 219 28
+	add.s64 	%rd257, %rd3, %rd638;
+	ld.global.nc.u8 	%rs63, [%rd257];
+	cvt.u32.u16 	%r143, %rs63;
+	and.b32  	%r144, %r143, 255;
+	mul.wide.u32 	%rd258, %r144, 4;
+	add.s64 	%rd259, %rd2, %rd258;
+	shr.s64 	%rd260, %rd638, 63;
+	shr.u64 	%rd261, %rd260, 56;
+	add.s64 	%rd262, %rd638, %rd261;
+	shr.s64 	%rd263, %rd262, 8;
+	shl.b64 	%rd264, %rd263, 2;
+	add.s64 	%rd265, %rd1, %rd264;
+	ld.global.nc.f32 	%f51, [%rd265];
+	ld.global.nc.f32 	%f52, [%rd259];
+	mul.f32 	%f53, %f52, %f51;
+	.loc	1 220 24
+	shl.b16 	%rs64, %rs7, 2;
+	cvt.u64.u16 	%rd266, %rs64;
+	and.b64  	%rd267, %rd266, 60;
+	add.s64 	%rd269, %rd179, %rd267;
+	ld.const.f32 	%f54, [%rd269];
+	fma.rn.f32 	%f4, %f54, %f53, %f17;
+	.loc	1 222 29
+	add.s64 	%rd37, %rd5, 7;
+	.loc	1 223 9
+	setp.lt.s64 	%p32, %rd37, %rd138;
+	@%p32 bra 	$L__BB1_70;
+	bra.uni 	$L__BB1_69;
+
+$L__BB1_70:
+	.loc	1 0 9
+	or.b64  	%rd270, %rd37, %rd8;
+	and.b64  	%rd271, %rd270, -4294967296;
+	setp.eq.s64 	%p33, %rd271, 0;
+	@%p33 bra 	$L__BB1_72;
+
+	div.s64 	%rd639, %rd37, %rd8;
+	bra.uni 	$L__BB1_73;
+
+$L__BB1_69:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs65, %f4;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+12], %rs65;
+	bra.uni 	$L__BB1_74;
+
+$L__BB1_72:
+	.loc	1 0 26
+	cvt.u32.u64 	%r145, %rd8;
+	cvt.u32.u64 	%r146, %rd37;
+	div.u32 	%r147, %r146, %r145;
+	cvt.u64.u32 	%rd639, %r147;
+
+$L__BB1_73:
+	.loc	1 225 32
+	add.s64 	%rd272, %rd3, %rd639;
+	ld.global.nc.u8 	%rs70, [%rd272];
+	cvt.u32.u16 	%r149, %rs70;
+	and.b32  	%r150, %r149, 255;
+	mul.wide.u32 	%rd273, %r150, 4;
+	add.s64 	%rd274, %rd2, %rd273;
+	shr.s64 	%rd275, %rd639, 63;
+	shr.u64 	%rd276, %rd275, 56;
+	add.s64 	%rd277, %rd639, %rd276;
+	shr.s64 	%rd278, %rd277, 8;
+	shl.b64 	%rd279, %rd278, 2;
+	add.s64 	%rd280, %rd1, %rd279;
+	ld.global.nc.f32 	%f58, [%rd280];
+	ld.global.nc.f32 	%f59, [%rd274];
+	mul.f32 	%f60, %f59, %f58;
+	.loc	1 216 24
+	shr.u16 	%rs71, %rs7, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r151, %rs71;
+	mul.wide.u32 	%rd281, %r151, 4;
+	add.s64 	%rd283, %rd179, %rd281;
+	ld.const.f32 	%f61, [%rd283];
+	fma.rn.f32 	%f57, %f61, %f60, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs66, %f4;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs67, %f57;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r148, {%rs66,%rs67};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+12], %r148;
+
+$L__BB1_74:
+	.loc	1 212 29
+	add.s64 	%rd41, %rd5, 8;
+	.loc	1 213 9
+	setp.ge.s64 	%p34, %rd41, %rd138;
+	@%p34 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd284, %rd41, %rd8;
+	and.b64  	%rd285, %rd284, -4294967296;
+	setp.eq.s64 	%p35, %rd285, 0;
+	@%p35 bra 	$L__BB1_77;
+
+	div.s64 	%rd640, %rd41, %rd8;
+	bra.uni 	$L__BB1_78;
+
+$L__BB1_77:
+	cvt.u32.u64 	%r152, %rd8;
+	cvt.u32.u64 	%r153, %rd41;
+	div.u32 	%r154, %r153, %r152;
+	cvt.u64.u32 	%rd640, %r154;
+
+$L__BB1_78:
+	.loc	1 219 28
+	add.s64 	%rd286, %rd3, %rd640;
+	ld.global.nc.u8 	%rs72, [%rd286];
+	cvt.u32.u16 	%r155, %rs72;
+	and.b32  	%r156, %r155, 255;
+	mul.wide.u32 	%rd287, %r156, 4;
+	add.s64 	%rd288, %rd2, %rd287;
+	shr.s64 	%rd289, %rd640, 63;
+	shr.u64 	%rd290, %rd289, 56;
+	add.s64 	%rd291, %rd640, %rd290;
+	shr.s64 	%rd292, %rd291, 8;
+	shl.b64 	%rd293, %rd292, 2;
+	add.s64 	%rd294, %rd1, %rd293;
+	ld.global.nc.f32 	%f62, [%rd294];
+	ld.global.nc.f32 	%f63, [%rd288];
+	mul.f32 	%f64, %f63, %f62;
+	.loc	1 220 24
+	shl.b16 	%rs73, %rs8, 2;
+	cvt.u64.u16 	%rd295, %rs73;
+	and.b64  	%rd296, %rd295, 60;
+	add.s64 	%rd298, %rd179, %rd296;
+	ld.const.f32 	%f65, [%rd298];
+	fma.rn.f32 	%f5, %f65, %f64, %f17;
+	.loc	1 222 29
+	add.s64 	%rd45, %rd5, 9;
+	.loc	1 223 9
+	setp.lt.s64 	%p36, %rd45, %rd138;
+	@%p36 bra 	$L__BB1_80;
+	bra.uni 	$L__BB1_79;
+
+$L__BB1_80:
+	.loc	1 0 9
+	or.b64  	%rd299, %rd45, %rd8;
+	and.b64  	%rd300, %rd299, -4294967296;
+	setp.eq.s64 	%p37, %rd300, 0;
+	@%p37 bra 	$L__BB1_82;
+
+	div.s64 	%rd641, %rd45, %rd8;
+	bra.uni 	$L__BB1_83;
+
+$L__BB1_79:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs74, %f5;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+16], %rs74;
+	bra.uni 	$L__BB1_84;
+
+$L__BB1_82:
+	.loc	1 0 26
+	cvt.u32.u64 	%r157, %rd8;
+	cvt.u32.u64 	%r158, %rd45;
+	div.u32 	%r159, %r158, %r157;
+	cvt.u64.u32 	%rd641, %r159;
+
+$L__BB1_83:
+	.loc	1 225 32
+	add.s64 	%rd301, %rd3, %rd641;
+	ld.global.nc.u8 	%rs79, [%rd301];
+	cvt.u32.u16 	%r161, %rs79;
+	and.b32  	%r162, %r161, 255;
+	mul.wide.u32 	%rd302, %r162, 4;
+	add.s64 	%rd303, %rd2, %rd302;
+	shr.s64 	%rd304, %rd641, 63;
+	shr.u64 	%rd305, %rd304, 56;
+	add.s64 	%rd306, %rd641, %rd305;
+	shr.s64 	%rd307, %rd306, 8;
+	shl.b64 	%rd308, %rd307, 2;
+	add.s64 	%rd309, %rd1, %rd308;
+	ld.global.nc.f32 	%f69, [%rd309];
+	ld.global.nc.f32 	%f70, [%rd303];
+	mul.f32 	%f71, %f70, %f69;
+	.loc	1 216 24
+	and.b16  	%rs80, %rs8, 240;
+	shr.u16 	%rs81, %rs80, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r163, %rs81;
+	mul.wide.u32 	%rd310, %r163, 4;
+	add.s64 	%rd312, %rd179, %rd310;
+	ld.const.f32 	%f72, [%rd312];
+	fma.rn.f32 	%f68, %f72, %f71, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs75, %f5;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs76, %f68;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r160, {%rs75,%rs76};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+16], %r160;
+
+$L__BB1_84:
+	.loc	1 212 29
+	add.s64 	%rd49, %rd5, 10;
+	.loc	1 213 9
+	setp.ge.s64 	%p38, %rd49, %rd138;
+	@%p38 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd313, %rd49, %rd8;
+	and.b64  	%rd314, %rd313, -4294967296;
+	setp.eq.s64 	%p39, %rd314, 0;
+	@%p39 bra 	$L__BB1_87;
+
+	div.s64 	%rd642, %rd49, %rd8;
+	bra.uni 	$L__BB1_88;
+
+$L__BB1_87:
+	cvt.u32.u64 	%r164, %rd8;
+	cvt.u32.u64 	%r165, %rd49;
+	div.u32 	%r166, %r165, %r164;
+	cvt.u64.u32 	%rd642, %r166;
+
+$L__BB1_88:
+	.loc	1 219 28
+	add.s64 	%rd315, %rd3, %rd642;
+	ld.global.nc.u8 	%rs82, [%rd315];
+	cvt.u32.u16 	%r167, %rs82;
+	and.b32  	%r168, %r167, 255;
+	mul.wide.u32 	%rd316, %r168, 4;
+	add.s64 	%rd317, %rd2, %rd316;
+	shr.s64 	%rd318, %rd642, 63;
+	shr.u64 	%rd319, %rd318, 56;
+	add.s64 	%rd320, %rd642, %rd319;
+	shr.s64 	%rd321, %rd320, 8;
+	shl.b64 	%rd322, %rd321, 2;
+	add.s64 	%rd323, %rd1, %rd322;
+	ld.global.nc.f32 	%f73, [%rd323];
+	ld.global.nc.f32 	%f74, [%rd317];
+	mul.f32 	%f75, %f74, %f73;
+	.loc	1 220 24
+	shl.b16 	%rs83, %rs9, 2;
+	cvt.u64.u16 	%rd324, %rs83;
+	and.b64  	%rd325, %rd324, 60;
+	add.s64 	%rd327, %rd179, %rd325;
+	ld.const.f32 	%f76, [%rd327];
+	fma.rn.f32 	%f6, %f76, %f75, %f17;
+	.loc	1 222 29
+	add.s64 	%rd53, %rd5, 11;
+	.loc	1 223 9
+	setp.lt.s64 	%p40, %rd53, %rd138;
+	@%p40 bra 	$L__BB1_90;
+	bra.uni 	$L__BB1_89;
+
+$L__BB1_90:
+	.loc	1 0 9
+	or.b64  	%rd328, %rd53, %rd8;
+	and.b64  	%rd329, %rd328, -4294967296;
+	setp.eq.s64 	%p41, %rd329, 0;
+	@%p41 bra 	$L__BB1_92;
+
+	div.s64 	%rd643, %rd53, %rd8;
+	bra.uni 	$L__BB1_93;
+
+$L__BB1_89:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs84, %f6;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+20], %rs84;
+	bra.uni 	$L__BB1_94;
+
+$L__BB1_92:
+	.loc	1 0 26
+	cvt.u32.u64 	%r169, %rd8;
+	cvt.u32.u64 	%r170, %rd53;
+	div.u32 	%r171, %r170, %r169;
+	cvt.u64.u32 	%rd643, %r171;
+
+$L__BB1_93:
+	.loc	1 225 32
+	add.s64 	%rd330, %rd3, %rd643;
+	ld.global.nc.u8 	%rs89, [%rd330];
+	cvt.u32.u16 	%r173, %rs89;
+	and.b32  	%r174, %r173, 255;
+	mul.wide.u32 	%rd331, %r174, 4;
+	add.s64 	%rd332, %rd2, %rd331;
+	shr.s64 	%rd333, %rd643, 63;
+	shr.u64 	%rd334, %rd333, 56;
+	add.s64 	%rd335, %rd643, %rd334;
+	shr.s64 	%rd336, %rd335, 8;
+	shl.b64 	%rd337, %rd336, 2;
+	add.s64 	%rd338, %rd1, %rd337;
+	ld.global.nc.f32 	%f80, [%rd338];
+	ld.global.nc.f32 	%f81, [%rd332];
+	mul.f32 	%f82, %f81, %f80;
+	.loc	1 216 24
+	and.b16  	%rs90, %rs9, 240;
+	shr.u16 	%rs91, %rs90, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r175, %rs91;
+	mul.wide.u32 	%rd339, %r175, 4;
+	add.s64 	%rd341, %rd179, %rd339;
+	ld.const.f32 	%f83, [%rd341];
+	fma.rn.f32 	%f79, %f83, %f82, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs85, %f6;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs86, %f79;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r172, {%rs85,%rs86};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+20], %r172;
+
+$L__BB1_94:
+	.loc	1 212 29
+	add.s64 	%rd57, %rd5, 12;
+	.loc	1 213 9
+	setp.ge.s64 	%p42, %rd57, %rd138;
+	@%p42 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd342, %rd57, %rd8;
+	and.b64  	%rd343, %rd342, -4294967296;
+	setp.eq.s64 	%p43, %rd343, 0;
+	@%p43 bra 	$L__BB1_97;
+
+	div.s64 	%rd644, %rd57, %rd8;
+	bra.uni 	$L__BB1_98;
+
+$L__BB1_97:
+	cvt.u32.u64 	%r176, %rd8;
+	cvt.u32.u64 	%r177, %rd57;
+	div.u32 	%r178, %r177, %r176;
+	cvt.u64.u32 	%rd644, %r178;
+
+$L__BB1_98:
+	.loc	1 219 28
+	add.s64 	%rd344, %rd3, %rd644;
+	ld.global.nc.u8 	%rs92, [%rd344];
+	cvt.u32.u16 	%r179, %rs92;
+	and.b32  	%r180, %r179, 255;
+	mul.wide.u32 	%rd345, %r180, 4;
+	add.s64 	%rd346, %rd2, %rd345;
+	shr.s64 	%rd347, %rd644, 63;
+	shr.u64 	%rd348, %rd347, 56;
+	add.s64 	%rd349, %rd644, %rd348;
+	shr.s64 	%rd350, %rd349, 8;
+	shl.b64 	%rd351, %rd350, 2;
+	add.s64 	%rd352, %rd1, %rd351;
+	ld.global.nc.f32 	%f84, [%rd352];
+	ld.global.nc.f32 	%f85, [%rd346];
+	mul.f32 	%f86, %f85, %f84;
+	.loc	1 220 24
+	shl.b16 	%rs93, %rs10, 2;
+	cvt.u64.u16 	%rd353, %rs93;
+	and.b64  	%rd354, %rd353, 60;
+	add.s64 	%rd356, %rd179, %rd354;
+	ld.const.f32 	%f87, [%rd356];
+	fma.rn.f32 	%f7, %f87, %f86, %f17;
+	.loc	1 222 29
+	add.s64 	%rd61, %rd5, 13;
+	.loc	1 223 9
+	setp.lt.s64 	%p44, %rd61, %rd138;
+	@%p44 bra 	$L__BB1_100;
+	bra.uni 	$L__BB1_99;
+
+$L__BB1_100:
+	.loc	1 0 9
+	or.b64  	%rd357, %rd61, %rd8;
+	and.b64  	%rd358, %rd357, -4294967296;
+	setp.eq.s64 	%p45, %rd358, 0;
+	@%p45 bra 	$L__BB1_102;
+
+	div.s64 	%rd645, %rd61, %rd8;
+	bra.uni 	$L__BB1_103;
+
+$L__BB1_99:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs94, %f7;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+24], %rs94;
+	bra.uni 	$L__BB1_104;
+
+$L__BB1_102:
+	.loc	1 0 26
+	cvt.u32.u64 	%r181, %rd8;
+	cvt.u32.u64 	%r182, %rd61;
+	div.u32 	%r183, %r182, %r181;
+	cvt.u64.u32 	%rd645, %r183;
+
+$L__BB1_103:
+	.loc	1 225 32
+	add.s64 	%rd359, %rd3, %rd645;
+	ld.global.nc.u8 	%rs99, [%rd359];
+	cvt.u32.u16 	%r185, %rs99;
+	and.b32  	%r186, %r185, 255;
+	mul.wide.u32 	%rd360, %r186, 4;
+	add.s64 	%rd361, %rd2, %rd360;
+	shr.s64 	%rd362, %rd645, 63;
+	shr.u64 	%rd363, %rd362, 56;
+	add.s64 	%rd364, %rd645, %rd363;
+	shr.s64 	%rd365, %rd364, 8;
+	shl.b64 	%rd366, %rd365, 2;
+	add.s64 	%rd367, %rd1, %rd366;
+	ld.global.nc.f32 	%f91, [%rd367];
+	ld.global.nc.f32 	%f92, [%rd361];
+	mul.f32 	%f93, %f92, %f91;
+	.loc	1 216 24
+	and.b16  	%rs100, %rs10, 240;
+	shr.u16 	%rs101, %rs100, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r187, %rs101;
+	mul.wide.u32 	%rd368, %r187, 4;
+	add.s64 	%rd370, %rd179, %rd368;
+	ld.const.f32 	%f94, [%rd370];
+	fma.rn.f32 	%f90, %f94, %f93, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs95, %f7;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs96, %f90;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r184, {%rs95,%rs96};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+24], %r184;
+
+$L__BB1_104:
+	.loc	1 212 29
+	add.s64 	%rd65, %rd5, 14;
+	.loc	1 213 9
+	setp.ge.s64 	%p46, %rd65, %rd138;
+	@%p46 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd371, %rd65, %rd8;
+	and.b64  	%rd372, %rd371, -4294967296;
+	setp.eq.s64 	%p47, %rd372, 0;
+	@%p47 bra 	$L__BB1_107;
+
+	div.s64 	%rd646, %rd65, %rd8;
+	bra.uni 	$L__BB1_108;
+
+$L__BB1_107:
+	cvt.u32.u64 	%r188, %rd8;
+	cvt.u32.u64 	%r189, %rd65;
+	div.u32 	%r190, %r189, %r188;
+	cvt.u64.u32 	%rd646, %r190;
+
+$L__BB1_108:
+	.loc	1 219 28
+	add.s64 	%rd373, %rd3, %rd646;
+	ld.global.nc.u8 	%rs102, [%rd373];
+	cvt.u32.u16 	%r191, %rs102;
+	and.b32  	%r192, %r191, 255;
+	mul.wide.u32 	%rd374, %r192, 4;
+	add.s64 	%rd375, %rd2, %rd374;
+	shr.s64 	%rd376, %rd646, 63;
+	shr.u64 	%rd377, %rd376, 56;
+	add.s64 	%rd378, %rd646, %rd377;
+	shr.s64 	%rd379, %rd378, 8;
+	shl.b64 	%rd380, %rd379, 2;
+	add.s64 	%rd381, %rd1, %rd380;
+	ld.global.nc.f32 	%f95, [%rd381];
+	ld.global.nc.f32 	%f96, [%rd375];
+	mul.f32 	%f97, %f96, %f95;
+	.loc	1 220 24
+	shl.b16 	%rs103, %rs11, 2;
+	cvt.u64.u16 	%rd382, %rs103;
+	and.b64  	%rd383, %rd382, 60;
+	add.s64 	%rd385, %rd179, %rd383;
+	ld.const.f32 	%f98, [%rd385];
+	fma.rn.f32 	%f8, %f98, %f97, %f17;
+	.loc	1 222 29
+	add.s64 	%rd69, %rd5, 15;
+	.loc	1 223 9
+	setp.lt.s64 	%p48, %rd69, %rd138;
+	@%p48 bra 	$L__BB1_110;
+	bra.uni 	$L__BB1_109;
+
+$L__BB1_110:
+	.loc	1 0 9
+	or.b64  	%rd386, %rd69, %rd8;
+	and.b64  	%rd387, %rd386, -4294967296;
+	setp.eq.s64 	%p49, %rd387, 0;
+	@%p49 bra 	$L__BB1_112;
+
+	div.s64 	%rd647, %rd69, %rd8;
+	bra.uni 	$L__BB1_113;
+
+$L__BB1_109:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs104, %f8;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+28], %rs104;
+	bra.uni 	$L__BB1_114;
+
+$L__BB1_112:
+	.loc	1 0 26
+	cvt.u32.u64 	%r193, %rd8;
+	cvt.u32.u64 	%r194, %rd69;
+	div.u32 	%r195, %r194, %r193;
+	cvt.u64.u32 	%rd647, %r195;
+
+$L__BB1_113:
+	.loc	1 225 32
+	add.s64 	%rd388, %rd3, %rd647;
+	ld.global.nc.u8 	%rs109, [%rd388];
+	cvt.u32.u16 	%r197, %rs109;
+	and.b32  	%r198, %r197, 255;
+	mul.wide.u32 	%rd389, %r198, 4;
+	add.s64 	%rd390, %rd2, %rd389;
+	shr.s64 	%rd391, %rd647, 63;
+	shr.u64 	%rd392, %rd391, 56;
+	add.s64 	%rd393, %rd647, %rd392;
+	shr.s64 	%rd394, %rd393, 8;
+	shl.b64 	%rd395, %rd394, 2;
+	add.s64 	%rd396, %rd1, %rd395;
+	ld.global.nc.f32 	%f102, [%rd396];
+	ld.global.nc.f32 	%f103, [%rd390];
+	mul.f32 	%f104, %f103, %f102;
+	.loc	1 216 24
+	shr.u16 	%rs110, %rs11, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r199, %rs110;
+	mul.wide.u32 	%rd397, %r199, 4;
+	add.s64 	%rd399, %rd179, %rd397;
+	ld.const.f32 	%f105, [%rd399];
+	fma.rn.f32 	%f101, %f105, %f104, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs105, %f8;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs106, %f101;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r196, {%rs105,%rs106};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+28], %r196;
+
+$L__BB1_114:
+	.loc	1 212 29
+	add.s64 	%rd73, %rd5, 16;
+	.loc	1 213 9
+	setp.ge.s64 	%p50, %rd73, %rd138;
+	@%p50 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd400, %rd73, %rd8;
+	and.b64  	%rd401, %rd400, -4294967296;
+	setp.eq.s64 	%p51, %rd401, 0;
+	@%p51 bra 	$L__BB1_117;
+
+	div.s64 	%rd648, %rd73, %rd8;
+	bra.uni 	$L__BB1_118;
+
+$L__BB1_117:
+	cvt.u32.u64 	%r200, %rd8;
+	cvt.u32.u64 	%r201, %rd73;
+	div.u32 	%r202, %r201, %r200;
+	cvt.u64.u32 	%rd648, %r202;
+
+$L__BB1_118:
+	.loc	1 219 28
+	add.s64 	%rd402, %rd3, %rd648;
+	ld.global.nc.u8 	%rs111, [%rd402];
+	cvt.u32.u16 	%r203, %rs111;
+	and.b32  	%r204, %r203, 255;
+	mul.wide.u32 	%rd403, %r204, 4;
+	add.s64 	%rd404, %rd2, %rd403;
+	shr.s64 	%rd405, %rd648, 63;
+	shr.u64 	%rd406, %rd405, 56;
+	add.s64 	%rd407, %rd648, %rd406;
+	shr.s64 	%rd408, %rd407, 8;
+	shl.b64 	%rd409, %rd408, 2;
+	add.s64 	%rd410, %rd1, %rd409;
+	ld.global.nc.f32 	%f106, [%rd410];
+	ld.global.nc.f32 	%f107, [%rd404];
+	mul.f32 	%f108, %f107, %f106;
+	.loc	1 220 24
+	shl.b16 	%rs112, %rs12, 2;
+	cvt.u64.u16 	%rd411, %rs112;
+	and.b64  	%rd412, %rd411, 60;
+	add.s64 	%rd414, %rd179, %rd412;
+	ld.const.f32 	%f109, [%rd414];
+	fma.rn.f32 	%f9, %f109, %f108, %f17;
+	.loc	1 222 29
+	add.s64 	%rd77, %rd5, 17;
+	.loc	1 223 9
+	setp.lt.s64 	%p52, %rd77, %rd138;
+	@%p52 bra 	$L__BB1_120;
+	bra.uni 	$L__BB1_119;
+
+$L__BB1_120:
+	.loc	1 0 9
+	or.b64  	%rd415, %rd77, %rd8;
+	and.b64  	%rd416, %rd415, -4294967296;
+	setp.eq.s64 	%p53, %rd416, 0;
+	@%p53 bra 	$L__BB1_122;
+
+	div.s64 	%rd649, %rd77, %rd8;
+	bra.uni 	$L__BB1_123;
+
+$L__BB1_119:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs113, %f9;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+32], %rs113;
+	bra.uni 	$L__BB1_124;
+
+$L__BB1_122:
+	.loc	1 0 26
+	cvt.u32.u64 	%r205, %rd8;
+	cvt.u32.u64 	%r206, %rd77;
+	div.u32 	%r207, %r206, %r205;
+	cvt.u64.u32 	%rd649, %r207;
+
+$L__BB1_123:
+	.loc	1 225 32
+	add.s64 	%rd417, %rd3, %rd649;
+	ld.global.nc.u8 	%rs118, [%rd417];
+	cvt.u32.u16 	%r209, %rs118;
+	and.b32  	%r210, %r209, 255;
+	mul.wide.u32 	%rd418, %r210, 4;
+	add.s64 	%rd419, %rd2, %rd418;
+	shr.s64 	%rd420, %rd649, 63;
+	shr.u64 	%rd421, %rd420, 56;
+	add.s64 	%rd422, %rd649, %rd421;
+	shr.s64 	%rd423, %rd422, 8;
+	shl.b64 	%rd424, %rd423, 2;
+	add.s64 	%rd425, %rd1, %rd424;
+	ld.global.nc.f32 	%f113, [%rd425];
+	ld.global.nc.f32 	%f114, [%rd419];
+	mul.f32 	%f115, %f114, %f113;
+	.loc	1 216 24
+	and.b16  	%rs119, %rs12, 240;
+	shr.u16 	%rs120, %rs119, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r211, %rs120;
+	mul.wide.u32 	%rd426, %r211, 4;
+	add.s64 	%rd428, %rd179, %rd426;
+	ld.const.f32 	%f116, [%rd428];
+	fma.rn.f32 	%f112, %f116, %f115, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs114, %f9;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs115, %f112;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r208, {%rs114,%rs115};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+32], %r208;
+
+$L__BB1_124:
+	.loc	1 212 29
+	add.s64 	%rd81, %rd5, 18;
+	.loc	1 213 9
+	setp.ge.s64 	%p54, %rd81, %rd138;
+	@%p54 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd429, %rd81, %rd8;
+	and.b64  	%rd430, %rd429, -4294967296;
+	setp.eq.s64 	%p55, %rd430, 0;
+	@%p55 bra 	$L__BB1_127;
+
+	div.s64 	%rd650, %rd81, %rd8;
+	bra.uni 	$L__BB1_128;
+
+$L__BB1_127:
+	cvt.u32.u64 	%r212, %rd8;
+	cvt.u32.u64 	%r213, %rd81;
+	div.u32 	%r214, %r213, %r212;
+	cvt.u64.u32 	%rd650, %r214;
+
+$L__BB1_128:
+	.loc	1 219 28
+	add.s64 	%rd431, %rd3, %rd650;
+	ld.global.nc.u8 	%rs121, [%rd431];
+	cvt.u32.u16 	%r215, %rs121;
+	and.b32  	%r216, %r215, 255;
+	mul.wide.u32 	%rd432, %r216, 4;
+	add.s64 	%rd433, %rd2, %rd432;
+	shr.s64 	%rd434, %rd650, 63;
+	shr.u64 	%rd435, %rd434, 56;
+	add.s64 	%rd436, %rd650, %rd435;
+	shr.s64 	%rd437, %rd436, 8;
+	shl.b64 	%rd438, %rd437, 2;
+	add.s64 	%rd439, %rd1, %rd438;
+	ld.global.nc.f32 	%f117, [%rd439];
+	ld.global.nc.f32 	%f118, [%rd433];
+	mul.f32 	%f119, %f118, %f117;
+	.loc	1 220 24
+	shl.b16 	%rs122, %rs13, 2;
+	cvt.u64.u16 	%rd440, %rs122;
+	and.b64  	%rd441, %rd440, 60;
+	add.s64 	%rd443, %rd179, %rd441;
+	ld.const.f32 	%f120, [%rd443];
+	fma.rn.f32 	%f10, %f120, %f119, %f17;
+	.loc	1 222 29
+	add.s64 	%rd85, %rd5, 19;
+	.loc	1 223 9
+	setp.lt.s64 	%p56, %rd85, %rd138;
+	@%p56 bra 	$L__BB1_130;
+	bra.uni 	$L__BB1_129;
+
+$L__BB1_130:
+	.loc	1 0 9
+	or.b64  	%rd444, %rd85, %rd8;
+	and.b64  	%rd445, %rd444, -4294967296;
+	setp.eq.s64 	%p57, %rd445, 0;
+	@%p57 bra 	$L__BB1_132;
+
+	div.s64 	%rd651, %rd85, %rd8;
+	bra.uni 	$L__BB1_133;
+
+$L__BB1_129:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs123, %f10;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+36], %rs123;
+	bra.uni 	$L__BB1_134;
+
+$L__BB1_132:
+	.loc	1 0 26
+	cvt.u32.u64 	%r217, %rd8;
+	cvt.u32.u64 	%r218, %rd85;
+	div.u32 	%r219, %r218, %r217;
+	cvt.u64.u32 	%rd651, %r219;
+
+$L__BB1_133:
+	.loc	1 225 32
+	add.s64 	%rd446, %rd3, %rd651;
+	ld.global.nc.u8 	%rs128, [%rd446];
+	cvt.u32.u16 	%r221, %rs128;
+	and.b32  	%r222, %r221, 255;
+	mul.wide.u32 	%rd447, %r222, 4;
+	add.s64 	%rd448, %rd2, %rd447;
+	shr.s64 	%rd449, %rd651, 63;
+	shr.u64 	%rd450, %rd449, 56;
+	add.s64 	%rd451, %rd651, %rd450;
+	shr.s64 	%rd452, %rd451, 8;
+	shl.b64 	%rd453, %rd452, 2;
+	add.s64 	%rd454, %rd1, %rd453;
+	ld.global.nc.f32 	%f124, [%rd454];
+	ld.global.nc.f32 	%f125, [%rd448];
+	mul.f32 	%f126, %f125, %f124;
+	.loc	1 216 24
+	and.b16  	%rs129, %rs13, 240;
+	shr.u16 	%rs130, %rs129, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r223, %rs130;
+	mul.wide.u32 	%rd455, %r223, 4;
+	add.s64 	%rd457, %rd179, %rd455;
+	ld.const.f32 	%f127, [%rd457];
+	fma.rn.f32 	%f123, %f127, %f126, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs124, %f10;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs125, %f123;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r220, {%rs124,%rs125};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+36], %r220;
+
+$L__BB1_134:
+	.loc	1 212 29
+	add.s64 	%rd89, %rd5, 20;
+	.loc	1 213 9
+	setp.ge.s64 	%p58, %rd89, %rd138;
+	@%p58 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd458, %rd89, %rd8;
+	and.b64  	%rd459, %rd458, -4294967296;
+	setp.eq.s64 	%p59, %rd459, 0;
+	@%p59 bra 	$L__BB1_137;
+
+	div.s64 	%rd652, %rd89, %rd8;
+	bra.uni 	$L__BB1_138;
+
+$L__BB1_137:
+	cvt.u32.u64 	%r224, %rd8;
+	cvt.u32.u64 	%r225, %rd89;
+	div.u32 	%r226, %r225, %r224;
+	cvt.u64.u32 	%rd652, %r226;
+
+$L__BB1_138:
+	.loc	1 219 28
+	add.s64 	%rd460, %rd3, %rd652;
+	ld.global.nc.u8 	%rs131, [%rd460];
+	cvt.u32.u16 	%r227, %rs131;
+	and.b32  	%r228, %r227, 255;
+	mul.wide.u32 	%rd461, %r228, 4;
+	add.s64 	%rd462, %rd2, %rd461;
+	shr.s64 	%rd463, %rd652, 63;
+	shr.u64 	%rd464, %rd463, 56;
+	add.s64 	%rd465, %rd652, %rd464;
+	shr.s64 	%rd466, %rd465, 8;
+	shl.b64 	%rd467, %rd466, 2;
+	add.s64 	%rd468, %rd1, %rd467;
+	ld.global.nc.f32 	%f128, [%rd468];
+	ld.global.nc.f32 	%f129, [%rd462];
+	mul.f32 	%f130, %f129, %f128;
+	.loc	1 220 24
+	shl.b16 	%rs132, %rs14, 2;
+	cvt.u64.u16 	%rd469, %rs132;
+	and.b64  	%rd470, %rd469, 60;
+	add.s64 	%rd472, %rd179, %rd470;
+	ld.const.f32 	%f131, [%rd472];
+	fma.rn.f32 	%f11, %f131, %f130, %f17;
+	.loc	1 222 29
+	add.s64 	%rd93, %rd5, 21;
+	.loc	1 223 9
+	setp.lt.s64 	%p60, %rd93, %rd138;
+	@%p60 bra 	$L__BB1_140;
+	bra.uni 	$L__BB1_139;
+
+$L__BB1_140:
+	.loc	1 0 9
+	or.b64  	%rd473, %rd93, %rd8;
+	and.b64  	%rd474, %rd473, -4294967296;
+	setp.eq.s64 	%p61, %rd474, 0;
+	@%p61 bra 	$L__BB1_142;
+
+	div.s64 	%rd653, %rd93, %rd8;
+	bra.uni 	$L__BB1_143;
+
+$L__BB1_139:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs133, %f11;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+40], %rs133;
+	bra.uni 	$L__BB1_144;
+
+$L__BB1_142:
+	.loc	1 0 26
+	cvt.u32.u64 	%r229, %rd8;
+	cvt.u32.u64 	%r230, %rd93;
+	div.u32 	%r231, %r230, %r229;
+	cvt.u64.u32 	%rd653, %r231;
+
+$L__BB1_143:
+	.loc	1 225 32
+	add.s64 	%rd475, %rd3, %rd653;
+	ld.global.nc.u8 	%rs138, [%rd475];
+	cvt.u32.u16 	%r233, %rs138;
+	and.b32  	%r234, %r233, 255;
+	mul.wide.u32 	%rd476, %r234, 4;
+	add.s64 	%rd477, %rd2, %rd476;
+	shr.s64 	%rd478, %rd653, 63;
+	shr.u64 	%rd479, %rd478, 56;
+	add.s64 	%rd480, %rd653, %rd479;
+	shr.s64 	%rd481, %rd480, 8;
+	shl.b64 	%rd482, %rd481, 2;
+	add.s64 	%rd483, %rd1, %rd482;
+	ld.global.nc.f32 	%f135, [%rd483];
+	ld.global.nc.f32 	%f136, [%rd477];
+	mul.f32 	%f137, %f136, %f135;
+	.loc	1 216 24
+	and.b16  	%rs139, %rs14, 240;
+	shr.u16 	%rs140, %rs139, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r235, %rs140;
+	mul.wide.u32 	%rd484, %r235, 4;
+	add.s64 	%rd486, %rd179, %rd484;
+	ld.const.f32 	%f138, [%rd486];
+	fma.rn.f32 	%f134, %f138, %f137, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs134, %f11;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs135, %f134;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r232, {%rs134,%rs135};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+40], %r232;
+
+$L__BB1_144:
+	.loc	1 212 29
+	add.s64 	%rd97, %rd5, 22;
+	.loc	1 213 9
+	setp.ge.s64 	%p62, %rd97, %rd138;
+	@%p62 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd487, %rd97, %rd8;
+	and.b64  	%rd488, %rd487, -4294967296;
+	setp.eq.s64 	%p63, %rd488, 0;
+	@%p63 bra 	$L__BB1_147;
+
+	div.s64 	%rd654, %rd97, %rd8;
+	bra.uni 	$L__BB1_148;
+
+$L__BB1_147:
+	cvt.u32.u64 	%r236, %rd8;
+	cvt.u32.u64 	%r237, %rd97;
+	div.u32 	%r238, %r237, %r236;
+	cvt.u64.u32 	%rd654, %r238;
+
+$L__BB1_148:
+	.loc	1 219 28
+	add.s64 	%rd489, %rd3, %rd654;
+	ld.global.nc.u8 	%rs141, [%rd489];
+	cvt.u32.u16 	%r239, %rs141;
+	and.b32  	%r240, %r239, 255;
+	mul.wide.u32 	%rd490, %r240, 4;
+	add.s64 	%rd491, %rd2, %rd490;
+	shr.s64 	%rd492, %rd654, 63;
+	shr.u64 	%rd493, %rd492, 56;
+	add.s64 	%rd494, %rd654, %rd493;
+	shr.s64 	%rd495, %rd494, 8;
+	shl.b64 	%rd496, %rd495, 2;
+	add.s64 	%rd497, %rd1, %rd496;
+	ld.global.nc.f32 	%f139, [%rd497];
+	ld.global.nc.f32 	%f140, [%rd491];
+	mul.f32 	%f141, %f140, %f139;
+	.loc	1 220 24
+	shl.b16 	%rs142, %rs15, 2;
+	cvt.u64.u16 	%rd498, %rs142;
+	and.b64  	%rd499, %rd498, 60;
+	add.s64 	%rd501, %rd179, %rd499;
+	ld.const.f32 	%f142, [%rd501];
+	fma.rn.f32 	%f12, %f142, %f141, %f17;
+	.loc	1 222 29
+	add.s64 	%rd101, %rd5, 23;
+	.loc	1 223 9
+	setp.lt.s64 	%p64, %rd101, %rd138;
+	@%p64 bra 	$L__BB1_150;
+	bra.uni 	$L__BB1_149;
+
+$L__BB1_150:
+	.loc	1 0 9
+	or.b64  	%rd502, %rd101, %rd8;
+	and.b64  	%rd503, %rd502, -4294967296;
+	setp.eq.s64 	%p65, %rd503, 0;
+	@%p65 bra 	$L__BB1_152;
+
+	div.s64 	%rd655, %rd101, %rd8;
+	bra.uni 	$L__BB1_153;
+
+$L__BB1_149:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs143, %f12;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+44], %rs143;
+	bra.uni 	$L__BB1_154;
+
+$L__BB1_152:
+	.loc	1 0 26
+	cvt.u32.u64 	%r241, %rd8;
+	cvt.u32.u64 	%r242, %rd101;
+	div.u32 	%r243, %r242, %r241;
+	cvt.u64.u32 	%rd655, %r243;
+
+$L__BB1_153:
+	.loc	1 225 32
+	add.s64 	%rd504, %rd3, %rd655;
+	ld.global.nc.u8 	%rs148, [%rd504];
+	cvt.u32.u16 	%r245, %rs148;
+	and.b32  	%r246, %r245, 255;
+	mul.wide.u32 	%rd505, %r246, 4;
+	add.s64 	%rd506, %rd2, %rd505;
+	shr.s64 	%rd507, %rd655, 63;
+	shr.u64 	%rd508, %rd507, 56;
+	add.s64 	%rd509, %rd655, %rd508;
+	shr.s64 	%rd510, %rd509, 8;
+	shl.b64 	%rd511, %rd510, 2;
+	add.s64 	%rd512, %rd1, %rd511;
+	ld.global.nc.f32 	%f146, [%rd512];
+	ld.global.nc.f32 	%f147, [%rd506];
+	mul.f32 	%f148, %f147, %f146;
+	.loc	1 216 24
+	shr.u16 	%rs149, %rs15, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r247, %rs149;
+	mul.wide.u32 	%rd513, %r247, 4;
+	add.s64 	%rd515, %rd179, %rd513;
+	ld.const.f32 	%f149, [%rd515];
+	fma.rn.f32 	%f145, %f149, %f148, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs144, %f12;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs145, %f145;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r244, {%rs144,%rs145};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+44], %r244;
+
+$L__BB1_154:
+	.loc	1 212 29
+	add.s64 	%rd105, %rd5, 24;
+	.loc	1 213 9
+	setp.ge.s64 	%p66, %rd105, %rd138;
+	@%p66 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd516, %rd105, %rd8;
+	and.b64  	%rd517, %rd516, -4294967296;
+	setp.eq.s64 	%p67, %rd517, 0;
+	@%p67 bra 	$L__BB1_157;
+
+	div.s64 	%rd656, %rd105, %rd8;
+	bra.uni 	$L__BB1_158;
+
+$L__BB1_157:
+	cvt.u32.u64 	%r248, %rd8;
+	cvt.u32.u64 	%r249, %rd105;
+	div.u32 	%r250, %r249, %r248;
+	cvt.u64.u32 	%rd656, %r250;
+
+$L__BB1_158:
+	.loc	1 219 28
+	add.s64 	%rd518, %rd3, %rd656;
+	ld.global.nc.u8 	%rs150, [%rd518];
+	cvt.u32.u16 	%r251, %rs150;
+	and.b32  	%r252, %r251, 255;
+	mul.wide.u32 	%rd519, %r252, 4;
+	add.s64 	%rd520, %rd2, %rd519;
+	shr.s64 	%rd521, %rd656, 63;
+	shr.u64 	%rd522, %rd521, 56;
+	add.s64 	%rd523, %rd656, %rd522;
+	shr.s64 	%rd524, %rd523, 8;
+	shl.b64 	%rd525, %rd524, 2;
+	add.s64 	%rd526, %rd1, %rd525;
+	ld.global.nc.f32 	%f150, [%rd526];
+	ld.global.nc.f32 	%f151, [%rd520];
+	mul.f32 	%f152, %f151, %f150;
+	.loc	1 220 24
+	shl.b16 	%rs151, %rs4, 2;
+	cvt.u64.u16 	%rd527, %rs151;
+	and.b64  	%rd528, %rd527, 60;
+	add.s64 	%rd530, %rd179, %rd528;
+	ld.const.f32 	%f153, [%rd530];
+	fma.rn.f32 	%f13, %f153, %f152, %f17;
+	.loc	1 222 29
+	add.s64 	%rd109, %rd5, 25;
+	.loc	1 223 9
+	setp.lt.s64 	%p68, %rd109, %rd138;
+	@%p68 bra 	$L__BB1_160;
+	bra.uni 	$L__BB1_159;
+
+$L__BB1_160:
+	.loc	1 0 9
+	or.b64  	%rd531, %rd109, %rd8;
+	and.b64  	%rd532, %rd531, -4294967296;
+	setp.eq.s64 	%p69, %rd532, 0;
+	@%p69 bra 	$L__BB1_162;
+
+	div.s64 	%rd657, %rd109, %rd8;
+	bra.uni 	$L__BB1_163;
+
+$L__BB1_159:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs152, %f13;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+48], %rs152;
+	bra.uni 	$L__BB1_164;
+
+$L__BB1_162:
+	.loc	1 0 26
+	cvt.u32.u64 	%r253, %rd8;
+	cvt.u32.u64 	%r254, %rd109;
+	div.u32 	%r255, %r254, %r253;
+	cvt.u64.u32 	%rd657, %r255;
+
+$L__BB1_163:
+	.loc	1 225 32
+	add.s64 	%rd533, %rd3, %rd657;
+	ld.global.nc.u8 	%rs157, [%rd533];
+	cvt.u32.u16 	%r257, %rs157;
+	and.b32  	%r258, %r257, 255;
+	mul.wide.u32 	%rd534, %r258, 4;
+	add.s64 	%rd535, %rd2, %rd534;
+	shr.s64 	%rd536, %rd657, 63;
+	shr.u64 	%rd537, %rd536, 56;
+	add.s64 	%rd538, %rd657, %rd537;
+	shr.s64 	%rd539, %rd538, 8;
+	shl.b64 	%rd540, %rd539, 2;
+	add.s64 	%rd541, %rd1, %rd540;
+	ld.global.nc.f32 	%f157, [%rd541];
+	ld.global.nc.f32 	%f158, [%rd535];
+	mul.f32 	%f159, %f158, %f157;
+	.loc	1 216 24
+	and.b16  	%rs158, %rs4, 240;
+	shr.u16 	%rs159, %rs158, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r259, %rs159;
+	mul.wide.u32 	%rd542, %r259, 4;
+	add.s64 	%rd544, %rd179, %rd542;
+	ld.const.f32 	%f160, [%rd544];
+	fma.rn.f32 	%f156, %f160, %f159, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs153, %f13;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs154, %f156;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r256, {%rs153,%rs154};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+48], %r256;
+
+$L__BB1_164:
+	.loc	1 212 29
+	add.s64 	%rd113, %rd5, 26;
+	.loc	1 213 9
+	setp.ge.s64 	%p70, %rd113, %rd138;
+	@%p70 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd545, %rd113, %rd8;
+	and.b64  	%rd546, %rd545, -4294967296;
+	setp.eq.s64 	%p71, %rd546, 0;
+	@%p71 bra 	$L__BB1_167;
+
+	div.s64 	%rd658, %rd113, %rd8;
+	bra.uni 	$L__BB1_168;
+
+$L__BB1_167:
+	cvt.u32.u64 	%r260, %rd8;
+	cvt.u32.u64 	%r261, %rd113;
+	div.u32 	%r262, %r261, %r260;
+	cvt.u64.u32 	%rd658, %r262;
+
+$L__BB1_168:
+	.loc	1 219 28
+	add.s64 	%rd547, %rd3, %rd658;
+	ld.global.nc.u8 	%rs160, [%rd547];
+	cvt.u32.u16 	%r263, %rs160;
+	and.b32  	%r264, %r263, 255;
+	mul.wide.u32 	%rd548, %r264, 4;
+	add.s64 	%rd549, %rd2, %rd548;
+	shr.s64 	%rd550, %rd658, 63;
+	shr.u64 	%rd551, %rd550, 56;
+	add.s64 	%rd552, %rd658, %rd551;
+	shr.s64 	%rd553, %rd552, 8;
+	shl.b64 	%rd554, %rd553, 2;
+	add.s64 	%rd555, %rd1, %rd554;
+	ld.global.nc.f32 	%f161, [%rd555];
+	ld.global.nc.f32 	%f162, [%rd549];
+	mul.f32 	%f163, %f162, %f161;
+	.loc	1 220 24
+	shl.b16 	%rs161, %rs3, 2;
+	cvt.u64.u16 	%rd556, %rs161;
+	and.b64  	%rd557, %rd556, 60;
+	add.s64 	%rd559, %rd179, %rd557;
+	ld.const.f32 	%f164, [%rd559];
+	fma.rn.f32 	%f14, %f164, %f163, %f17;
+	.loc	1 222 29
+	add.s64 	%rd117, %rd5, 27;
+	.loc	1 223 9
+	setp.lt.s64 	%p72, %rd117, %rd138;
+	@%p72 bra 	$L__BB1_170;
+	bra.uni 	$L__BB1_169;
+
+$L__BB1_170:
+	.loc	1 0 9
+	or.b64  	%rd560, %rd117, %rd8;
+	and.b64  	%rd561, %rd560, -4294967296;
+	setp.eq.s64 	%p73, %rd561, 0;
+	@%p73 bra 	$L__BB1_172;
+
+	div.s64 	%rd659, %rd117, %rd8;
+	bra.uni 	$L__BB1_173;
+
+$L__BB1_169:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs162, %f14;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+52], %rs162;
+	bra.uni 	$L__BB1_174;
+
+$L__BB1_172:
+	.loc	1 0 26
+	cvt.u32.u64 	%r265, %rd8;
+	cvt.u32.u64 	%r266, %rd117;
+	div.u32 	%r267, %r266, %r265;
+	cvt.u64.u32 	%rd659, %r267;
+
+$L__BB1_173:
+	.loc	1 225 32
+	add.s64 	%rd562, %rd3, %rd659;
+	ld.global.nc.u8 	%rs167, [%rd562];
+	cvt.u32.u16 	%r269, %rs167;
+	and.b32  	%r270, %r269, 255;
+	mul.wide.u32 	%rd563, %r270, 4;
+	add.s64 	%rd564, %rd2, %rd563;
+	shr.s64 	%rd565, %rd659, 63;
+	shr.u64 	%rd566, %rd565, 56;
+	add.s64 	%rd567, %rd659, %rd566;
+	shr.s64 	%rd568, %rd567, 8;
+	shl.b64 	%rd569, %rd568, 2;
+	add.s64 	%rd570, %rd1, %rd569;
+	ld.global.nc.f32 	%f168, [%rd570];
+	ld.global.nc.f32 	%f169, [%rd564];
+	mul.f32 	%f170, %f169, %f168;
+	.loc	1 216 24
+	and.b16  	%rs168, %rs3, 240;
+	shr.u16 	%rs169, %rs168, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r271, %rs169;
+	mul.wide.u32 	%rd571, %r271, 4;
+	add.s64 	%rd573, %rd179, %rd571;
+	ld.const.f32 	%f171, [%rd573];
+	fma.rn.f32 	%f167, %f171, %f170, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs163, %f14;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs164, %f167;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r268, {%rs163,%rs164};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+52], %r268;
+
+$L__BB1_174:
+	.loc	1 212 29
+	add.s64 	%rd121, %rd5, 28;
+	.loc	1 213 9
+	setp.ge.s64 	%p74, %rd121, %rd138;
+	@%p74 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd574, %rd121, %rd8;
+	and.b64  	%rd575, %rd574, -4294967296;
+	setp.eq.s64 	%p75, %rd575, 0;
+	@%p75 bra 	$L__BB1_177;
+
+	div.s64 	%rd660, %rd121, %rd8;
+	bra.uni 	$L__BB1_178;
+
+$L__BB1_177:
+	cvt.u32.u64 	%r272, %rd8;
+	cvt.u32.u64 	%r273, %rd121;
+	div.u32 	%r274, %r273, %r272;
+	cvt.u64.u32 	%rd660, %r274;
+
+$L__BB1_178:
+	.loc	1 219 28
+	add.s64 	%rd576, %rd3, %rd660;
+	ld.global.nc.u8 	%rs170, [%rd576];
+	cvt.u32.u16 	%r275, %rs170;
+	and.b32  	%r276, %r275, 255;
+	mul.wide.u32 	%rd577, %r276, 4;
+	add.s64 	%rd578, %rd2, %rd577;
+	shr.s64 	%rd579, %rd660, 63;
+	shr.u64 	%rd580, %rd579, 56;
+	add.s64 	%rd581, %rd660, %rd580;
+	shr.s64 	%rd582, %rd581, 8;
+	shl.b64 	%rd583, %rd582, 2;
+	add.s64 	%rd584, %rd1, %rd583;
+	ld.global.nc.f32 	%f172, [%rd584];
+	ld.global.nc.f32 	%f173, [%rd578];
+	mul.f32 	%f174, %f173, %f172;
+	.loc	1 220 24
+	shl.b16 	%rs171, %rs2, 2;
+	cvt.u64.u16 	%rd585, %rs171;
+	and.b64  	%rd586, %rd585, 60;
+	add.s64 	%rd588, %rd179, %rd586;
+	ld.const.f32 	%f175, [%rd588];
+	fma.rn.f32 	%f15, %f175, %f174, %f17;
+	.loc	1 222 29
+	add.s64 	%rd125, %rd5, 29;
+	.loc	1 223 9
+	setp.lt.s64 	%p76, %rd125, %rd138;
+	@%p76 bra 	$L__BB1_180;
+	bra.uni 	$L__BB1_179;
+
+$L__BB1_180:
+	.loc	1 0 9
+	or.b64  	%rd589, %rd125, %rd8;
+	and.b64  	%rd590, %rd589, -4294967296;
+	setp.eq.s64 	%p77, %rd590, 0;
+	@%p77 bra 	$L__BB1_182;
+
+	div.s64 	%rd661, %rd125, %rd8;
+	bra.uni 	$L__BB1_183;
+
+$L__BB1_179:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs172, %f15;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+56], %rs172;
+	bra.uni 	$L__BB1_184;
+
+$L__BB1_182:
+	.loc	1 0 26
+	cvt.u32.u64 	%r277, %rd8;
+	cvt.u32.u64 	%r278, %rd125;
+	div.u32 	%r279, %r278, %r277;
+	cvt.u64.u32 	%rd661, %r279;
+
+$L__BB1_183:
+	.loc	1 225 32
+	add.s64 	%rd591, %rd3, %rd661;
+	ld.global.nc.u8 	%rs177, [%rd591];
+	cvt.u32.u16 	%r281, %rs177;
+	and.b32  	%r282, %r281, 255;
+	mul.wide.u32 	%rd592, %r282, 4;
+	add.s64 	%rd593, %rd2, %rd592;
+	shr.s64 	%rd594, %rd661, 63;
+	shr.u64 	%rd595, %rd594, 56;
+	add.s64 	%rd596, %rd661, %rd595;
+	shr.s64 	%rd597, %rd596, 8;
+	shl.b64 	%rd598, %rd597, 2;
+	add.s64 	%rd599, %rd1, %rd598;
+	ld.global.nc.f32 	%f179, [%rd599];
+	ld.global.nc.f32 	%f180, [%rd593];
+	mul.f32 	%f181, %f180, %f179;
+	.loc	1 216 24
+	and.b16  	%rs178, %rs2, 240;
+	shr.u16 	%rs179, %rs178, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r283, %rs179;
+	mul.wide.u32 	%rd600, %r283, 4;
+	add.s64 	%rd602, %rd179, %rd600;
+	ld.const.f32 	%f182, [%rd602];
+	fma.rn.f32 	%f178, %f182, %f181, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs173, %f15;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs174, %f178;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r280, {%rs173,%rs174};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+56], %r280;
+
+$L__BB1_184:
+	.loc	1 212 29
+	add.s64 	%rd129, %rd5, 30;
+	.loc	1 213 9
+	setp.ge.s64 	%p78, %rd129, %rd138;
+	@%p78 bra 	$L__BB1_194;
+
+	.loc	1 0 9
+	or.b64  	%rd603, %rd129, %rd8;
+	and.b64  	%rd604, %rd603, -4294967296;
+	setp.eq.s64 	%p79, %rd604, 0;
+	@%p79 bra 	$L__BB1_187;
+
+	div.s64 	%rd662, %rd129, %rd8;
+	bra.uni 	$L__BB1_188;
+
+$L__BB1_187:
+	cvt.u32.u64 	%r284, %rd8;
+	cvt.u32.u64 	%r285, %rd129;
+	div.u32 	%r286, %r285, %r284;
+	cvt.u64.u32 	%rd662, %r286;
+
+$L__BB1_188:
+	.loc	1 219 28
+	add.s64 	%rd605, %rd3, %rd662;
+	ld.global.nc.u8 	%rs180, [%rd605];
+	cvt.u32.u16 	%r287, %rs180;
+	and.b32  	%r288, %r287, 255;
+	mul.wide.u32 	%rd606, %r288, 4;
+	add.s64 	%rd607, %rd2, %rd606;
+	shr.s64 	%rd608, %rd662, 63;
+	shr.u64 	%rd609, %rd608, 56;
+	add.s64 	%rd610, %rd662, %rd609;
+	shr.s64 	%rd611, %rd610, 8;
+	shl.b64 	%rd612, %rd611, 2;
+	add.s64 	%rd613, %rd1, %rd612;
+	ld.global.nc.f32 	%f183, [%rd613];
+	ld.global.nc.f32 	%f184, [%rd607];
+	mul.f32 	%f185, %f184, %f183;
+	.loc	1 220 24
+	shl.b16 	%rs181, %rs1, 2;
+	cvt.u64.u16 	%rd614, %rs181;
+	and.b64  	%rd615, %rd614, 60;
+	add.s64 	%rd617, %rd179, %rd615;
+	ld.const.f32 	%f186, [%rd617];
+	fma.rn.f32 	%f16, %f186, %f185, %f17;
+	.loc	1 222 29
+	add.s64 	%rd133, %rd5, 31;
+	.loc	1 223 9
+	setp.lt.s64 	%p80, %rd133, %rd138;
+	@%p80 bra 	$L__BB1_190;
+	bra.uni 	$L__BB1_189;
+
+$L__BB1_190:
+	.loc	1 0 9
+	or.b64  	%rd618, %rd133, %rd8;
+	and.b64  	%rd619, %rd618, -4294967296;
+	setp.eq.s64 	%p81, %rd619, 0;
+	@%p81 bra 	$L__BB1_192;
+
+	div.s64 	%rd663, %rd133, %rd8;
+	bra.uni 	$L__BB1_193;
+
+$L__BB1_189:
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs182, %f16;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+60], %rs182;
+	bra.uni 	$L__BB1_194;
+
+$L__BB1_192:
+	.loc	1 0 26
+	cvt.u32.u64 	%r289, %rd8;
+	cvt.u32.u64 	%r290, %rd133;
+	div.u32 	%r291, %r290, %r289;
+	cvt.u64.u32 	%rd663, %r291;
+
+$L__BB1_193:
+	.loc	1 225 32
+	add.s64 	%rd620, %rd3, %rd663;
+	ld.global.nc.u8 	%rs187, [%rd620];
+	cvt.u32.u16 	%r293, %rs187;
+	and.b32  	%r294, %r293, 255;
+	mul.wide.u32 	%rd621, %r294, 4;
+	add.s64 	%rd622, %rd2, %rd621;
+	shr.s64 	%rd623, %rd663, 63;
+	shr.u64 	%rd624, %rd623, 56;
+	add.s64 	%rd625, %rd663, %rd624;
+	shr.s64 	%rd626, %rd625, 8;
+	shl.b64 	%rd627, %rd626, 2;
+	add.s64 	%rd628, %rd1, %rd627;
+	ld.global.nc.f32 	%f190, [%rd628];
+	ld.global.nc.f32 	%f191, [%rd622];
+	mul.f32 	%f192, %f191, %f190;
+	.loc	1 216 24
+	shr.u16 	%rs188, %rs1, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r295, %rs188;
+	mul.wide.u32 	%rd629, %r295, 4;
+	add.s64 	%rd631, %rd179, %rd629;
+	ld.const.f32 	%f193, [%rd631];
+	fma.rn.f32 	%f189, %f193, %f192, %f17;
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs183, %f16;}
+
+	// end inline asm
+	.loc	2 596 3, function_name $L__info_string1, inlined_at 1 62 71
+	// begin inline asm
+	{  cvt.rn.f16.f32 %rs184, %f189;}
+
+	// end inline asm
+	.loc	2 1419 5, function_name $L__info_string3, inlined_at 1 74 22
+	// begin inline asm
+	{  mov.b32 %r292, {%rs183,%rs184};}
+
+	// end inline asm
+	.loc	1 75 5, function_name $L__info_string2, inlined_at 1 227 13
+	st.global.u32 	[%rd13+60], %r292;
+
+$L__BB1_194:
+	.loc	1 232 1
+	ret;
+
+}
+.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT_(
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3,
+	.param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5,
+	.param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7
+)
+{
+	.reg .pred 	%p<5>;
+	.reg .b16 	%rs<12>;
+	.reg .f32 	%f<14>;
+	.reg .b32 	%r<17>;
+	.reg .b64 	%rd<58>;
+	.loc	1 102 0
+
+
+	ld.param.u64 	%rd15, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0];
+	ld.param.u64 	%rd18, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1];
+	ld.param.u64 	%rd19, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2];
+	ld.param.u64 	%rd20, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3];
+	ld.param.f32 	%f2, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4];
+	ld.param.u64 	%rd16, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5];
+	ld.param.u32 	%r1, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6];
+	ld.param.u64 	%rd17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3214dequant_kernelI13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7];
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd1, %rd19;
+	cvta.to.global.u64 	%rd2, %rd20;
+	cvta.to.global.u64 	%rd3, %rd18;
+	mov.u32 	%r2, %ctaid.x;
+	mov.u32 	%r3, %ntid.x;
+	mul.wide.u32 	%rd21, %r2, %r3;
+	mov.u32 	%r4, %tid.x;
+	cvt.u64.u32 	%rd22, %r4;
+	add.s64 	%rd4, %rd21, %rd22;
+	.loc	1 114 25
+	shl.b64 	%rd5, %rd4, 1;
+	.loc	1 115 5
+	setp.ge.s64 	%p1, %rd5, %rd16;
+	@%p1 bra 	$L__BB2_10;
+
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd23, %rd15;
+	.loc	1 121 26
+	add.s64 	%rd24, %rd23, %rd4;
+	ld.global.nc.u8 	%rs1, [%rd24];
+	.loc	1 130 30
+	cvt.s64.s32 	%rd6, %r1;
+	or.b64  	%rd25, %rd5, %rd6;
+	and.b64  	%rd26, %rd25, -4294967296;
+	setp.eq.s64 	%p2, %rd26, 0;
+	@%p2 bra 	$L__BB2_3;
+
+	.loc	1 0 30
+	div.s64 	%rd56, %rd5, %rd6;
+	bra.uni 	$L__BB2_4;
+
+$L__BB2_3:
+	cvt.u32.u64 	%r5, %rd6;
+	cvt.u32.u64 	%r6, %rd5;
+	div.u32 	%r7, %r6, %r5;
+	cvt.u64.u32 	%rd56, %r7;
+
+$L__BB2_4:
+	.loc	1 132 24
+	add.s64 	%rd27, %rd3, %rd56;
+	ld.global.nc.u8 	%rs2, [%rd27];
+	cvt.u32.u16 	%r8, %rs2;
+	and.b32  	%r9, %r8, 255;
+	mul.wide.u32 	%rd28, %r9, 4;
+	add.s64 	%rd29, %rd2, %rd28;
+	.loc	1 131 30
+	shr.s64 	%rd30, %rd56, 63;
+	shr.u64 	%rd31, %rd30, 56;
+	add.s64 	%rd32, %rd56, %rd31;
+	shr.s64 	%rd33, %rd32, 8;
+	.loc	1 132 24
+	shl.b64 	%rd34, %rd33, 2;
+	add.s64 	%rd35, %rd1, %rd34;
+	ld.global.nc.f32 	%f3, [%rd35];
+	ld.global.nc.f32 	%f4, [%rd29];
+	mul.f32 	%f5, %f4, %f3;
+	.loc	1 133 20
+	shl.b16 	%rs3, %rs1, 2;
+	cvt.u64.u16 	%rd36, %rs3;
+	and.b64  	%rd37, %rd36, 60;
+	mov.u64 	%rd38, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E;
+	add.s64 	%rd39, %rd38, %rd37;
+	ld.const.f32 	%f6, [%rd39];
+	fma.rn.f32 	%f1, %f6, %f5, %f2;
+	.loc	1 136 25
+	add.s64 	%rd10, %rd5, 1;
+	.loc	1 137 5
+	setp.lt.s64 	%p3, %rd10, %rd16;
+	.loc	1 113 28
+	cvta.to.global.u64 	%rd40, %rd17;
+	.loc	1 149 9
+	shl.b64 	%rd41, %rd5, 1;
+	add.s64 	%rd11, %rd40, %rd41;
+	.loc	1 137 5
+	@%p3 bra 	$L__BB2_6;
+	bra.uni 	$L__BB2_5;
+
+$L__BB2_6:
+	.loc	1 0 5
+	or.b64  	%rd42, %rd10, %rd6;
+	and.b64  	%rd43, %rd42, -4294967296;
+	setp.eq.s64 	%p4, %rd43, 0;
+	@%p4 bra 	$L__BB2_8;
+
+	div.s64 	%rd57, %rd10, %rd6;
+	bra.uni 	$L__BB2_9;
+
+$L__BB2_5:
+	.loc	1 152 22
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 152 22
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs4, %f1;}
+
+	// end inline asm
+	.loc	1 152 22
+	st.global.u16 	[%rd11], %rs4;
+	bra.uni 	$L__BB2_10;
+
+$L__BB2_8:
+	.loc	1 0 22
+	cvt.u32.u64 	%r10, %rd6;
+	cvt.u32.u64 	%r11, %rd10;
+	div.u32 	%r12, %r11, %r10;
+	cvt.u64.u32 	%rd57, %r12;
+
+$L__BB2_9:
+	.loc	1 143 28
+	add.s64 	%rd44, %rd3, %rd57;
+	ld.global.nc.u8 	%rs9, [%rd44];
+	cvt.u32.u16 	%r14, %rs9;
+	and.b32  	%r15, %r14, 255;
+	mul.wide.u32 	%rd45, %r15, 4;
+	add.s64 	%rd46, %rd2, %rd45;
+	.loc	1 142 34
+	shr.s64 	%rd47, %rd57, 63;
+	shr.u64 	%rd48, %rd47, 56;
+	add.s64 	%rd49, %rd57, %rd48;
+	shr.s64 	%rd50, %rd49, 8;
+	.loc	1 143 28
+	shl.b64 	%rd51, %rd50, 2;
+	add.s64 	%rd52, %rd1, %rd51;
+	ld.global.nc.f32 	%f10, [%rd52];
+	ld.global.nc.f32 	%f11, [%rd46];
+	mul.f32 	%f12, %f11, %f10;
+	.loc	1 123 20
+	and.b16  	%rs10, %rs1, 240;
+	shr.u16 	%rs11, %rs10, 4;
+	.loc	1 144 24
+	cvt.u32.u16 	%r16, %rs11;
+	mul.wide.u32 	%rd53, %r16, 4;
+	add.s64 	%rd55, %rd38, %rd53;
+	ld.const.f32 	%f13, [%rd55];
+	fma.rn.f32 	%f9, %f13, %f12, %f2;
+	.loc	1 149 36
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 149 36
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs5, %f1;}
+
+	// end inline asm
+	.loc	1 149 52
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 149 52
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs6, %f9;}
+
+	// end inline asm
+	.loc	1 149 9
+	.loc	1 82 29, function_name $L__info_string6, inlined_at 1 149 9
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r13, {%rs5,%rs6};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 149 9
+	st.global.u32 	[%rd11], %r13;
+
+$L__BB2_10:
+	.loc	1 154 1
+	ret;
+
+}
+.entry _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT_(
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3,
+	.param .f32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5,
+	.param .u32 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6,
+	.param .u64 _ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7
+)
+{
+	.reg .pred 	%p<82>;
+	.reg .b16 	%rs<189>;
+	.reg .f32 	%f<194>;
+	.reg .b32 	%r<327>;
+	.reg .b64 	%rd<664>;
+	.loc	1 157 0
+
+
+	ld.param.u64 	%rd137, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_0];
+	ld.param.u64 	%rd140, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_1];
+	ld.param.u64 	%rd141, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_2];
+	ld.param.u64 	%rd142, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_3];
+	ld.param.f32 	%f17, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_4];
+	ld.param.u64 	%rd138, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_5];
+	ld.param.u32 	%r64, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_6];
+	ld.param.u64 	%rd139, [_ZN50_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb3217dequant_kernel_v3I13__nv_bfloat16EEvPKhS3_PKfS5_fliPT__param_7];
+	cvta.to.global.u64 	%rd1, %rd141;
+	cvta.to.global.u64 	%rd2, %rd142;
+	cvta.to.global.u64 	%rd3, %rd140;
+	.loc	1 169 23
+	mov.u32 	%r65, %ctaid.x;
+	mov.u32 	%r66, %ntid.x;
+	mul.wide.u32 	%rd143, %r65, %r66;
+	mov.u32 	%r67, %tid.x;
+	cvt.u64.u32 	%rd144, %r67;
+	add.s64 	%rd145, %rd143, %rd144;
+	.loc	1 170 29
+	shl.b64 	%rd4, %rd145, 4;
+	.loc	1 171 29
+	shl.b64 	%rd5, %rd145, 5;
+	.loc	1 172 35
+	add.s64 	%rd146, %rd138, 1;
+	shr.u64 	%rd147, %rd146, 63;
+	add.s64 	%rd148, %rd146, %rd147;
+	shr.s64 	%rd6, %rd148, 1;
+	.loc	1 174 5
+	setp.ge.s64 	%p1, %rd5, %rd138;
+	@%p1 bra 	$L__BB3_194;
+
+	.loc	1 178 5
+	add.s64 	%rd149, %rd4, 16;
+	setp.gt.s64 	%p2, %rd149, %rd6;
+	cvta.to.global.u64 	%rd150, %rd137;
+	.loc	1 206 13
+	add.s64 	%rd7, %rd150, %rd4;
+	.loc	1 178 5
+	@%p2 bra 	$L__BB3_3;
+	bra.uni 	$L__BB3_2;
+
+$L__BB3_3:
+	.loc	1 206 13
+	setp.ge.s64 	%p3, %rd4, %rd6;
+	mov.u32 	%r321, 0;
+	mov.u32 	%r322, %r321;
+	@%p3 bra 	$L__BB3_5;
+
+	ld.global.nc.u8 	%rs17, [%rd7];
+	cvt.u32.u16 	%r73, %rs17;
+	and.b32  	%r322, %r73, 255;
+
+$L__BB3_5:
+	.loc	1 205 29
+	add.s64 	%rd151, %rd4, 1;
+	.loc	1 206 13
+	setp.ge.s64 	%p4, %rd151, %rd6;
+	@%p4 bra 	$L__BB3_7;
+
+	ld.global.nc.u8 	%rs18, [%rd7+1];
+	cvt.u32.u16 	%r75, %rs18;
+	and.b32  	%r321, %r75, 255;
+
+$L__BB3_7:
+	.loc	1 205 29
+	add.s64 	%rd152, %rd4, 2;
+	.loc	1 206 13
+	setp.ge.s64 	%p5, %rd152, %rd6;
+	mov.u32 	%r319, 0;
+	mov.u32 	%r320, %r319;
+	@%p5 bra 	$L__BB3_9;
+
+	ld.global.nc.u8 	%rs19, [%rd7+2];
+	cvt.u32.u16 	%r77, %rs19;
+	and.b32  	%r320, %r77, 255;
+
+$L__BB3_9:
+	.loc	1 205 29
+	add.s64 	%rd153, %rd4, 3;
+	.loc	1 206 13
+	setp.ge.s64 	%p6, %rd153, %rd6;
+	@%p6 bra 	$L__BB3_11;
+
+	ld.global.nc.u8 	%rs20, [%rd7+3];
+	cvt.u32.u16 	%r79, %rs20;
+	and.b32  	%r319, %r79, 255;
+
+$L__BB3_11:
+	.loc	1 205 29
+	add.s64 	%rd154, %rd4, 4;
+	.loc	1 206 13
+	setp.ge.s64 	%p7, %rd154, %rd6;
+	mov.u32 	%r317, 0;
+	mov.u32 	%r318, %r317;
+	@%p7 bra 	$L__BB3_13;
+
+	ld.global.nc.u8 	%rs21, [%rd7+4];
+	cvt.u32.u16 	%r81, %rs21;
+	and.b32  	%r318, %r81, 255;
+
+$L__BB3_13:
+	.loc	1 205 29
+	add.s64 	%rd155, %rd4, 5;
+	.loc	1 206 13
+	setp.ge.s64 	%p8, %rd155, %rd6;
+	@%p8 bra 	$L__BB3_15;
+
+	ld.global.nc.u8 	%rs22, [%rd7+5];
+	cvt.u32.u16 	%r83, %rs22;
+	and.b32  	%r317, %r83, 255;
+
+$L__BB3_15:
+	.loc	1 205 29
+	add.s64 	%rd156, %rd4, 6;
+	.loc	1 206 13
+	setp.ge.s64 	%p9, %rd156, %rd6;
+	mov.u32 	%r315, 0;
+	mov.u32 	%r316, %r315;
+	@%p9 bra 	$L__BB3_17;
+
+	ld.global.nc.u8 	%rs23, [%rd7+6];
+	cvt.u32.u16 	%r85, %rs23;
+	and.b32  	%r316, %r85, 255;
+
+$L__BB3_17:
+	.loc	1 205 29
+	add.s64 	%rd157, %rd4, 7;
+	.loc	1 206 13
+	setp.ge.s64 	%p10, %rd157, %rd6;
+	@%p10 bra 	$L__BB3_19;
+
+	ld.global.nc.u8 	%rs24, [%rd7+7];
+	cvt.u32.u16 	%r87, %rs24;
+	and.b32  	%r315, %r87, 255;
+
+$L__BB3_19:
+	.loc	1 205 29
+	add.s64 	%rd158, %rd4, 8;
+	.loc	1 206 13
+	setp.ge.s64 	%p11, %rd158, %rd6;
+	mov.u32 	%r313, 0;
+	mov.u32 	%r314, %r313;
+	@%p11 bra 	$L__BB3_21;
+
+	ld.global.nc.u8 	%rs25, [%rd7+8];
+	cvt.u32.u16 	%r89, %rs25;
+	and.b32  	%r314, %r89, 255;
+
+$L__BB3_21:
+	.loc	1 205 29
+	add.s64 	%rd159, %rd4, 9;
+	.loc	1 206 13
+	setp.ge.s64 	%p12, %rd159, %rd6;
+	@%p12 bra 	$L__BB3_23;
+
+	ld.global.nc.u8 	%rs26, [%rd7+9];
+	cvt.u32.u16 	%r91, %rs26;
+	and.b32  	%r313, %r91, 255;
+
+$L__BB3_23:
+	.loc	1 205 29
+	add.s64 	%rd160, %rd4, 10;
+	.loc	1 206 13
+	setp.ge.s64 	%p13, %rd160, %rd6;
+	mov.u32 	%r311, 0;
+	mov.u32 	%r312, %r311;
+	@%p13 bra 	$L__BB3_25;
+
+	ld.global.nc.u8 	%rs27, [%rd7+10];
+	cvt.u32.u16 	%r93, %rs27;
+	and.b32  	%r312, %r93, 255;
+
+$L__BB3_25:
+	.loc	1 205 29
+	add.s64 	%rd161, %rd4, 11;
+	.loc	1 206 13
+	setp.ge.s64 	%p14, %rd161, %rd6;
+	@%p14 bra 	$L__BB3_27;
+
+	ld.global.nc.u8 	%rs28, [%rd7+11];
+	cvt.u32.u16 	%r95, %rs28;
+	and.b32  	%r311, %r95, 255;
+
+$L__BB3_27:
+	.loc	1 205 29
+	add.s64 	%rd162, %rd4, 12;
+	.loc	1 206 13
+	setp.ge.s64 	%p15, %rd162, %rd6;
+	mov.u32 	%r324, 0;
+	mov.u32 	%r323, %r324;
+	@%p15 bra 	$L__BB3_29;
+
+	ld.global.nc.u8 	%rs29, [%rd7+12];
+	cvt.u32.u16 	%r97, %rs29;
+	and.b32  	%r323, %r97, 255;
+
+$L__BB3_29:
+	.loc	1 205 29
+	add.s64 	%rd163, %rd4, 13;
+	.loc	1 206 13
+	setp.ge.s64 	%p16, %rd163, %rd6;
+	@%p16 bra 	$L__BB3_31;
+
+	ld.global.nc.u8 	%rs30, [%rd7+13];
+	cvt.u32.u16 	%r99, %rs30;
+	and.b32  	%r324, %r99, 255;
+
+$L__BB3_31:
+	.loc	1 205 29
+	add.s64 	%rd164, %rd4, 14;
+	.loc	1 206 13
+	setp.ge.s64 	%p17, %rd164, %rd6;
+	mov.u32 	%r326, 0;
+	mov.u32 	%r325, %r326;
+	@%p17 bra 	$L__BB3_33;
+
+	ld.global.nc.u8 	%rs31, [%rd7+14];
+	cvt.u32.u16 	%r101, %rs31;
+	and.b32  	%r325, %r101, 255;
+
+$L__BB3_33:
+	.loc	1 205 29
+	add.s64 	%rd165, %rd4, 15;
+	.loc	1 206 13
+	setp.ge.s64 	%p18, %rd165, %rd6;
+	@%p18 bra 	$L__BB3_35;
+
+	ld.global.nc.u8 	%rs32, [%rd7+15];
+	cvt.u32.u16 	%r103, %rs32;
+	and.b32  	%r326, %r103, 255;
+	bra.uni 	$L__BB3_35;
+
+$L__BB3_2:
+	.loc	1 180 29
+	ld.global.nc.v4.u32 	{%r322, %r318, %r314, %r323}, [%rd7];
+	.loc	1 186 17
+	shr.u32 	%r321, %r322, 8;
+	.loc	1 187 17
+	shr.u32 	%r320, %r322, 16;
+	.loc	1 188 17
+	shr.u32 	%r319, %r322, 24;
+	.loc	1 186 17
+	shr.u32 	%r317, %r318, 8;
+	.loc	1 187 17
+	shr.u32 	%r316, %r318, 16;
+	.loc	1 188 17
+	shr.u32 	%r315, %r318, 24;
+	.loc	1 186 17
+	shr.u32 	%r313, %r314, 8;
+	.loc	1 187 17
+	shr.u32 	%r312, %r314, 16;
+	.loc	1 188 17
+	shr.u32 	%r311, %r314, 24;
+	.loc	1 186 17
+	shr.u32 	%r324, %r323, 8;
+	.loc	1 187 17
+	shr.u32 	%r325, %r323, 16;
+	.loc	1 188 17
+	shr.u32 	%r326, %r323, 24;
+
+$L__BB3_35:
+	.loc	1 213 9
+	cvt.u16.u32 	%rs1, %r326;
+	cvt.u16.u32 	%rs2, %r325;
+	cvt.u16.u32 	%rs3, %r324;
+	cvt.u16.u32 	%rs4, %r323;
+	cvt.u16.u32 	%rs5, %r321;
+	cvt.u16.u32 	%rs6, %r320;
+	cvt.u16.u32 	%rs7, %r319;
+	cvt.u16.u32 	%rs8, %r318;
+	cvt.u16.u32 	%rs9, %r317;
+	cvt.u16.u32 	%rs10, %r316;
+	cvt.u16.u32 	%rs11, %r315;
+	cvt.u16.u32 	%rs12, %r314;
+	cvt.u16.u32 	%rs13, %r313;
+	cvt.u16.u32 	%rs14, %r312;
+	cvt.u16.u32 	%rs15, %r311;
+	cvt.u16.u32 	%rs16, %r322;
+	cvt.s64.s32 	%rd8, %r64;
+	or.b64  	%rd166, %rd5, %rd8;
+	and.b64  	%rd167, %rd166, -4294967296;
+	setp.eq.s64 	%p19, %rd167, 0;
+	@%p19 bra 	$L__BB3_37;
+
+	.loc	1 0 9
+	div.s64 	%rd632, %rd5, %rd8;
+	bra.uni 	$L__BB3_38;
+
+$L__BB3_37:
+	cvt.u32.u64 	%r104, %rd8;
+	cvt.u32.u64 	%r105, %rd5;
+	div.u32 	%r106, %r105, %r104;
+	cvt.u64.u32 	%rd632, %r106;
+
+$L__BB3_38:
+	.loc	1 219 28
+	add.s64 	%rd168, %rd3, %rd632;
+	ld.global.nc.u8 	%rs33, [%rd168];
+	cvt.u32.u16 	%r107, %rs33;
+	and.b32  	%r108, %r107, 255;
+	mul.wide.u32 	%rd169, %r108, 4;
+	add.s64 	%rd170, %rd2, %rd169;
+	shr.s64 	%rd171, %rd632, 63;
+	shr.u64 	%rd172, %rd171, 56;
+	add.s64 	%rd173, %rd632, %rd172;
+	shr.s64 	%rd174, %rd173, 8;
+	shl.b64 	%rd175, %rd174, 2;
+	add.s64 	%rd176, %rd1, %rd175;
+	ld.global.nc.f32 	%f18, [%rd176];
+	ld.global.nc.f32 	%f19, [%rd170];
+	mul.f32 	%f20, %f19, %f18;
+	.loc	1 220 24
+	shl.b16 	%rs34, %rs16, 2;
+	cvt.u64.u16 	%rd177, %rs34;
+	and.b64  	%rd178, %rd177, 60;
+	mov.u64 	%rd179, _ZN48_INTERNAL_848bf537_17_dequant_kernel_cu_622ebb3250_GLOBAL__N__848bf537_17_dequant_kernel_cu_622ebb325d_nf4E;
+	add.s64 	%rd180, %rd179, %rd178;
+	ld.const.f32 	%f21, [%rd180];
+	fma.rn.f32 	%f1, %f21, %f20, %f17;
+	.loc	1 222 29
+	add.s64 	%rd12, %rd5, 1;
+	.loc	1 223 9
+	setp.lt.s64 	%p20, %rd12, %rd138;
+	cvta.to.global.u64 	%rd181, %rd139;
+	.loc	1 227 13
+	shl.b64 	%rd182, %rd5, 1;
+	add.s64 	%rd13, %rd181, %rd182;
+	.loc	1 223 9
+	@%p20 bra 	$L__BB3_40;
+	bra.uni 	$L__BB3_39;
+
+$L__BB3_40:
+	.loc	1 0 9
+	or.b64  	%rd183, %rd12, %rd8;
+	and.b64  	%rd184, %rd183, -4294967296;
+	setp.eq.s64 	%p21, %rd184, 0;
+	@%p21 bra 	$L__BB3_42;
+
+	div.s64 	%rd633, %rd12, %rd8;
+	bra.uni 	$L__BB3_43;
+
+$L__BB3_39:
+	.loc	1 229 26
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 229 26
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs35, %f1;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13], %rs35;
+	bra.uni 	$L__BB3_44;
+
+$L__BB3_42:
+	.loc	1 0 26
+	cvt.u32.u64 	%r109, %rd8;
+	cvt.u32.u64 	%r110, %rd12;
+	div.u32 	%r111, %r110, %r109;
+	cvt.u64.u32 	%rd633, %r111;
+
+$L__BB3_43:
+	.loc	1 225 32
+	add.s64 	%rd185, %rd3, %rd633;
+	ld.global.nc.u8 	%rs40, [%rd185];
+	cvt.u32.u16 	%r113, %rs40;
+	and.b32  	%r114, %r113, 255;
+	mul.wide.u32 	%rd186, %r114, 4;
+	add.s64 	%rd187, %rd2, %rd186;
+	shr.s64 	%rd188, %rd633, 63;
+	shr.u64 	%rd189, %rd188, 56;
+	add.s64 	%rd190, %rd633, %rd189;
+	shr.s64 	%rd191, %rd190, 8;
+	shl.b64 	%rd192, %rd191, 2;
+	add.s64 	%rd193, %rd1, %rd192;
+	ld.global.nc.f32 	%f25, [%rd193];
+	ld.global.nc.f32 	%f26, [%rd187];
+	mul.f32 	%f27, %f26, %f25;
+	.loc	1 216 24
+	and.b16  	%rs41, %rs16, 240;
+	shr.u16 	%rs42, %rs41, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r115, %rs42;
+	mul.wide.u32 	%rd194, %r115, 4;
+	add.s64 	%rd196, %rd179, %rd194;
+	ld.const.f32 	%f28, [%rd196];
+	fma.rn.f32 	%f24, %f28, %f27, %f17;
+	.loc	1 227 40
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 227 40
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs36, %f1;}
+
+	// end inline asm
+	.loc	1 227 56
+	.loc	1 63 85, function_name $L__info_string4, inlined_at 1 227 56
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs37, %f24;}
+
+	// end inline asm
+	.loc	1 227 13
+	.loc	1 82 29, function_name $L__info_string6, inlined_at 1 227 13
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r112, {%rs36,%rs37};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13], %r112;
+
+$L__BB3_44:
+	.loc	1 212 29
+	add.s64 	%rd17, %rd5, 2;
+	.loc	1 213 9
+	setp.ge.s64 	%p22, %rd17, %rd138;
+	@%p22 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd197, %rd17, %rd8;
+	and.b64  	%rd198, %rd197, -4294967296;
+	setp.eq.s64 	%p23, %rd198, 0;
+	@%p23 bra 	$L__BB3_47;
+
+	div.s64 	%rd634, %rd17, %rd8;
+	bra.uni 	$L__BB3_48;
+
+$L__BB3_47:
+	cvt.u32.u64 	%r116, %rd8;
+	cvt.u32.u64 	%r117, %rd17;
+	div.u32 	%r118, %r117, %r116;
+	cvt.u64.u32 	%rd634, %r118;
+
+$L__BB3_48:
+	.loc	1 219 28
+	add.s64 	%rd199, %rd3, %rd634;
+	ld.global.nc.u8 	%rs43, [%rd199];
+	cvt.u32.u16 	%r119, %rs43;
+	and.b32  	%r120, %r119, 255;
+	mul.wide.u32 	%rd200, %r120, 4;
+	add.s64 	%rd201, %rd2, %rd200;
+	shr.s64 	%rd202, %rd634, 63;
+	shr.u64 	%rd203, %rd202, 56;
+	add.s64 	%rd204, %rd634, %rd203;
+	shr.s64 	%rd205, %rd204, 8;
+	shl.b64 	%rd206, %rd205, 2;
+	add.s64 	%rd207, %rd1, %rd206;
+	ld.global.nc.f32 	%f29, [%rd207];
+	ld.global.nc.f32 	%f30, [%rd201];
+	mul.f32 	%f31, %f30, %f29;
+	.loc	1 220 24
+	shl.b16 	%rs44, %rs5, 2;
+	cvt.u64.u16 	%rd208, %rs44;
+	and.b64  	%rd209, %rd208, 60;
+	add.s64 	%rd211, %rd179, %rd209;
+	ld.const.f32 	%f32, [%rd211];
+	fma.rn.f32 	%f2, %f32, %f31, %f17;
+	.loc	1 222 29
+	add.s64 	%rd21, %rd5, 3;
+	.loc	1 223 9
+	setp.lt.s64 	%p24, %rd21, %rd138;
+	@%p24 bra 	$L__BB3_50;
+	bra.uni 	$L__BB3_49;
+
+$L__BB3_50:
+	.loc	1 0 9
+	or.b64  	%rd212, %rd21, %rd8;
+	and.b64  	%rd213, %rd212, -4294967296;
+	setp.eq.s64 	%p25, %rd213, 0;
+	@%p25 bra 	$L__BB3_52;
+
+	div.s64 	%rd635, %rd21, %rd8;
+	bra.uni 	$L__BB3_53;
+
+$L__BB3_49:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs45, %f2;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+4], %rs45;
+	bra.uni 	$L__BB3_54;
+
+$L__BB3_52:
+	.loc	1 0 26
+	cvt.u32.u64 	%r121, %rd8;
+	cvt.u32.u64 	%r122, %rd21;
+	div.u32 	%r123, %r122, %r121;
+	cvt.u64.u32 	%rd635, %r123;
+
+$L__BB3_53:
+	.loc	1 225 32
+	add.s64 	%rd214, %rd3, %rd635;
+	ld.global.nc.u8 	%rs50, [%rd214];
+	cvt.u32.u16 	%r125, %rs50;
+	and.b32  	%r126, %r125, 255;
+	mul.wide.u32 	%rd215, %r126, 4;
+	add.s64 	%rd216, %rd2, %rd215;
+	shr.s64 	%rd217, %rd635, 63;
+	shr.u64 	%rd218, %rd217, 56;
+	add.s64 	%rd219, %rd635, %rd218;
+	shr.s64 	%rd220, %rd219, 8;
+	shl.b64 	%rd221, %rd220, 2;
+	add.s64 	%rd222, %rd1, %rd221;
+	ld.global.nc.f32 	%f36, [%rd222];
+	ld.global.nc.f32 	%f37, [%rd216];
+	mul.f32 	%f38, %f37, %f36;
+	.loc	1 216 24
+	and.b16  	%rs51, %rs5, 240;
+	shr.u16 	%rs52, %rs51, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r127, %rs52;
+	mul.wide.u32 	%rd223, %r127, 4;
+	add.s64 	%rd225, %rd179, %rd223;
+	ld.const.f32 	%f39, [%rd225];
+	fma.rn.f32 	%f35, %f39, %f38, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs46, %f2;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs47, %f35;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r124, {%rs46,%rs47};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+4], %r124;
+
+$L__BB3_54:
+	.loc	1 212 29
+	add.s64 	%rd25, %rd5, 4;
+	.loc	1 213 9
+	setp.ge.s64 	%p26, %rd25, %rd138;
+	@%p26 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd226, %rd25, %rd8;
+	and.b64  	%rd227, %rd226, -4294967296;
+	setp.eq.s64 	%p27, %rd227, 0;
+	@%p27 bra 	$L__BB3_57;
+
+	div.s64 	%rd636, %rd25, %rd8;
+	bra.uni 	$L__BB3_58;
+
+$L__BB3_57:
+	cvt.u32.u64 	%r128, %rd8;
+	cvt.u32.u64 	%r129, %rd25;
+	div.u32 	%r130, %r129, %r128;
+	cvt.u64.u32 	%rd636, %r130;
+
+$L__BB3_58:
+	.loc	1 219 28
+	add.s64 	%rd228, %rd3, %rd636;
+	ld.global.nc.u8 	%rs53, [%rd228];
+	cvt.u32.u16 	%r131, %rs53;
+	and.b32  	%r132, %r131, 255;
+	mul.wide.u32 	%rd229, %r132, 4;
+	add.s64 	%rd230, %rd2, %rd229;
+	shr.s64 	%rd231, %rd636, 63;
+	shr.u64 	%rd232, %rd231, 56;
+	add.s64 	%rd233, %rd636, %rd232;
+	shr.s64 	%rd234, %rd233, 8;
+	shl.b64 	%rd235, %rd234, 2;
+	add.s64 	%rd236, %rd1, %rd235;
+	ld.global.nc.f32 	%f40, [%rd236];
+	ld.global.nc.f32 	%f41, [%rd230];
+	mul.f32 	%f42, %f41, %f40;
+	.loc	1 220 24
+	shl.b16 	%rs54, %rs6, 2;
+	cvt.u64.u16 	%rd237, %rs54;
+	and.b64  	%rd238, %rd237, 60;
+	add.s64 	%rd240, %rd179, %rd238;
+	ld.const.f32 	%f43, [%rd240];
+	fma.rn.f32 	%f3, %f43, %f42, %f17;
+	.loc	1 222 29
+	add.s64 	%rd29, %rd5, 5;
+	.loc	1 223 9
+	setp.lt.s64 	%p28, %rd29, %rd138;
+	@%p28 bra 	$L__BB3_60;
+	bra.uni 	$L__BB3_59;
+
+$L__BB3_60:
+	.loc	1 0 9
+	or.b64  	%rd241, %rd29, %rd8;
+	and.b64  	%rd242, %rd241, -4294967296;
+	setp.eq.s64 	%p29, %rd242, 0;
+	@%p29 bra 	$L__BB3_62;
+
+	div.s64 	%rd637, %rd29, %rd8;
+	bra.uni 	$L__BB3_63;
+
+$L__BB3_59:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs55, %f3;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+8], %rs55;
+	bra.uni 	$L__BB3_64;
+
+$L__BB3_62:
+	.loc	1 0 26
+	cvt.u32.u64 	%r133, %rd8;
+	cvt.u32.u64 	%r134, %rd29;
+	div.u32 	%r135, %r134, %r133;
+	cvt.u64.u32 	%rd637, %r135;
+
+$L__BB3_63:
+	.loc	1 225 32
+	add.s64 	%rd243, %rd3, %rd637;
+	ld.global.nc.u8 	%rs60, [%rd243];
+	cvt.u32.u16 	%r137, %rs60;
+	and.b32  	%r138, %r137, 255;
+	mul.wide.u32 	%rd244, %r138, 4;
+	add.s64 	%rd245, %rd2, %rd244;
+	shr.s64 	%rd246, %rd637, 63;
+	shr.u64 	%rd247, %rd246, 56;
+	add.s64 	%rd248, %rd637, %rd247;
+	shr.s64 	%rd249, %rd248, 8;
+	shl.b64 	%rd250, %rd249, 2;
+	add.s64 	%rd251, %rd1, %rd250;
+	ld.global.nc.f32 	%f47, [%rd251];
+	ld.global.nc.f32 	%f48, [%rd245];
+	mul.f32 	%f49, %f48, %f47;
+	.loc	1 216 24
+	and.b16  	%rs61, %rs6, 240;
+	shr.u16 	%rs62, %rs61, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r139, %rs62;
+	mul.wide.u32 	%rd252, %r139, 4;
+	add.s64 	%rd254, %rd179, %rd252;
+	ld.const.f32 	%f50, [%rd254];
+	fma.rn.f32 	%f46, %f50, %f49, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs56, %f3;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs57, %f46;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r136, {%rs56,%rs57};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+8], %r136;
+
+$L__BB3_64:
+	.loc	1 212 29
+	add.s64 	%rd33, %rd5, 6;
+	.loc	1 213 9
+	setp.ge.s64 	%p30, %rd33, %rd138;
+	@%p30 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd255, %rd33, %rd8;
+	and.b64  	%rd256, %rd255, -4294967296;
+	setp.eq.s64 	%p31, %rd256, 0;
+	@%p31 bra 	$L__BB3_67;
+
+	div.s64 	%rd638, %rd33, %rd8;
+	bra.uni 	$L__BB3_68;
+
+$L__BB3_67:
+	cvt.u32.u64 	%r140, %rd8;
+	cvt.u32.u64 	%r141, %rd33;
+	div.u32 	%r142, %r141, %r140;
+	cvt.u64.u32 	%rd638, %r142;
+
+$L__BB3_68:
+	.loc	1 219 28
+	add.s64 	%rd257, %rd3, %rd638;
+	ld.global.nc.u8 	%rs63, [%rd257];
+	cvt.u32.u16 	%r143, %rs63;
+	and.b32  	%r144, %r143, 255;
+	mul.wide.u32 	%rd258, %r144, 4;
+	add.s64 	%rd259, %rd2, %rd258;
+	shr.s64 	%rd260, %rd638, 63;
+	shr.u64 	%rd261, %rd260, 56;
+	add.s64 	%rd262, %rd638, %rd261;
+	shr.s64 	%rd263, %rd262, 8;
+	shl.b64 	%rd264, %rd263, 2;
+	add.s64 	%rd265, %rd1, %rd264;
+	ld.global.nc.f32 	%f51, [%rd265];
+	ld.global.nc.f32 	%f52, [%rd259];
+	mul.f32 	%f53, %f52, %f51;
+	.loc	1 220 24
+	shl.b16 	%rs64, %rs7, 2;
+	cvt.u64.u16 	%rd266, %rs64;
+	and.b64  	%rd267, %rd266, 60;
+	add.s64 	%rd269, %rd179, %rd267;
+	ld.const.f32 	%f54, [%rd269];
+	fma.rn.f32 	%f4, %f54, %f53, %f17;
+	.loc	1 222 29
+	add.s64 	%rd37, %rd5, 7;
+	.loc	1 223 9
+	setp.lt.s64 	%p32, %rd37, %rd138;
+	@%p32 bra 	$L__BB3_70;
+	bra.uni 	$L__BB3_69;
+
+$L__BB3_70:
+	.loc	1 0 9
+	or.b64  	%rd270, %rd37, %rd8;
+	and.b64  	%rd271, %rd270, -4294967296;
+	setp.eq.s64 	%p33, %rd271, 0;
+	@%p33 bra 	$L__BB3_72;
+
+	div.s64 	%rd639, %rd37, %rd8;
+	bra.uni 	$L__BB3_73;
+
+$L__BB3_69:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs65, %f4;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+12], %rs65;
+	bra.uni 	$L__BB3_74;
+
+$L__BB3_72:
+	.loc	1 0 26
+	cvt.u32.u64 	%r145, %rd8;
+	cvt.u32.u64 	%r146, %rd37;
+	div.u32 	%r147, %r146, %r145;
+	cvt.u64.u32 	%rd639, %r147;
+
+$L__BB3_73:
+	.loc	1 225 32
+	add.s64 	%rd272, %rd3, %rd639;
+	ld.global.nc.u8 	%rs70, [%rd272];
+	cvt.u32.u16 	%r149, %rs70;
+	and.b32  	%r150, %r149, 255;
+	mul.wide.u32 	%rd273, %r150, 4;
+	add.s64 	%rd274, %rd2, %rd273;
+	shr.s64 	%rd275, %rd639, 63;
+	shr.u64 	%rd276, %rd275, 56;
+	add.s64 	%rd277, %rd639, %rd276;
+	shr.s64 	%rd278, %rd277, 8;
+	shl.b64 	%rd279, %rd278, 2;
+	add.s64 	%rd280, %rd1, %rd279;
+	ld.global.nc.f32 	%f58, [%rd280];
+	ld.global.nc.f32 	%f59, [%rd274];
+	mul.f32 	%f60, %f59, %f58;
+	.loc	1 216 24
+	shr.u16 	%rs71, %rs7, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r151, %rs71;
+	mul.wide.u32 	%rd281, %r151, 4;
+	add.s64 	%rd283, %rd179, %rd281;
+	ld.const.f32 	%f61, [%rd283];
+	fma.rn.f32 	%f57, %f61, %f60, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs66, %f4;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs67, %f57;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r148, {%rs66,%rs67};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+12], %r148;
+
+$L__BB3_74:
+	.loc	1 212 29
+	add.s64 	%rd41, %rd5, 8;
+	.loc	1 213 9
+	setp.ge.s64 	%p34, %rd41, %rd138;
+	@%p34 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd284, %rd41, %rd8;
+	and.b64  	%rd285, %rd284, -4294967296;
+	setp.eq.s64 	%p35, %rd285, 0;
+	@%p35 bra 	$L__BB3_77;
+
+	div.s64 	%rd640, %rd41, %rd8;
+	bra.uni 	$L__BB3_78;
+
+$L__BB3_77:
+	cvt.u32.u64 	%r152, %rd8;
+	cvt.u32.u64 	%r153, %rd41;
+	div.u32 	%r154, %r153, %r152;
+	cvt.u64.u32 	%rd640, %r154;
+
+$L__BB3_78:
+	.loc	1 219 28
+	add.s64 	%rd286, %rd3, %rd640;
+	ld.global.nc.u8 	%rs72, [%rd286];
+	cvt.u32.u16 	%r155, %rs72;
+	and.b32  	%r156, %r155, 255;
+	mul.wide.u32 	%rd287, %r156, 4;
+	add.s64 	%rd288, %rd2, %rd287;
+	shr.s64 	%rd289, %rd640, 63;
+	shr.u64 	%rd290, %rd289, 56;
+	add.s64 	%rd291, %rd640, %rd290;
+	shr.s64 	%rd292, %rd291, 8;
+	shl.b64 	%rd293, %rd292, 2;
+	add.s64 	%rd294, %rd1, %rd293;
+	ld.global.nc.f32 	%f62, [%rd294];
+	ld.global.nc.f32 	%f63, [%rd288];
+	mul.f32 	%f64, %f63, %f62;
+	.loc	1 220 24
+	shl.b16 	%rs73, %rs8, 2;
+	cvt.u64.u16 	%rd295, %rs73;
+	and.b64  	%rd296, %rd295, 60;
+	add.s64 	%rd298, %rd179, %rd296;
+	ld.const.f32 	%f65, [%rd298];
+	fma.rn.f32 	%f5, %f65, %f64, %f17;
+	.loc	1 222 29
+	add.s64 	%rd45, %rd5, 9;
+	.loc	1 223 9
+	setp.lt.s64 	%p36, %rd45, %rd138;
+	@%p36 bra 	$L__BB3_80;
+	bra.uni 	$L__BB3_79;
+
+$L__BB3_80:
+	.loc	1 0 9
+	or.b64  	%rd299, %rd45, %rd8;
+	and.b64  	%rd300, %rd299, -4294967296;
+	setp.eq.s64 	%p37, %rd300, 0;
+	@%p37 bra 	$L__BB3_82;
+
+	div.s64 	%rd641, %rd45, %rd8;
+	bra.uni 	$L__BB3_83;
+
+$L__BB3_79:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs74, %f5;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+16], %rs74;
+	bra.uni 	$L__BB3_84;
+
+$L__BB3_82:
+	.loc	1 0 26
+	cvt.u32.u64 	%r157, %rd8;
+	cvt.u32.u64 	%r158, %rd45;
+	div.u32 	%r159, %r158, %r157;
+	cvt.u64.u32 	%rd641, %r159;
+
+$L__BB3_83:
+	.loc	1 225 32
+	add.s64 	%rd301, %rd3, %rd641;
+	ld.global.nc.u8 	%rs79, [%rd301];
+	cvt.u32.u16 	%r161, %rs79;
+	and.b32  	%r162, %r161, 255;
+	mul.wide.u32 	%rd302, %r162, 4;
+	add.s64 	%rd303, %rd2, %rd302;
+	shr.s64 	%rd304, %rd641, 63;
+	shr.u64 	%rd305, %rd304, 56;
+	add.s64 	%rd306, %rd641, %rd305;
+	shr.s64 	%rd307, %rd306, 8;
+	shl.b64 	%rd308, %rd307, 2;
+	add.s64 	%rd309, %rd1, %rd308;
+	ld.global.nc.f32 	%f69, [%rd309];
+	ld.global.nc.f32 	%f70, [%rd303];
+	mul.f32 	%f71, %f70, %f69;
+	.loc	1 216 24
+	and.b16  	%rs80, %rs8, 240;
+	shr.u16 	%rs81, %rs80, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r163, %rs81;
+	mul.wide.u32 	%rd310, %r163, 4;
+	add.s64 	%rd312, %rd179, %rd310;
+	ld.const.f32 	%f72, [%rd312];
+	fma.rn.f32 	%f68, %f72, %f71, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs75, %f5;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs76, %f68;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r160, {%rs75,%rs76};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+16], %r160;
+
+$L__BB3_84:
+	.loc	1 212 29
+	add.s64 	%rd49, %rd5, 10;
+	.loc	1 213 9
+	setp.ge.s64 	%p38, %rd49, %rd138;
+	@%p38 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd313, %rd49, %rd8;
+	and.b64  	%rd314, %rd313, -4294967296;
+	setp.eq.s64 	%p39, %rd314, 0;
+	@%p39 bra 	$L__BB3_87;
+
+	div.s64 	%rd642, %rd49, %rd8;
+	bra.uni 	$L__BB3_88;
+
+$L__BB3_87:
+	cvt.u32.u64 	%r164, %rd8;
+	cvt.u32.u64 	%r165, %rd49;
+	div.u32 	%r166, %r165, %r164;
+	cvt.u64.u32 	%rd642, %r166;
+
+$L__BB3_88:
+	.loc	1 219 28
+	add.s64 	%rd315, %rd3, %rd642;
+	ld.global.nc.u8 	%rs82, [%rd315];
+	cvt.u32.u16 	%r167, %rs82;
+	and.b32  	%r168, %r167, 255;
+	mul.wide.u32 	%rd316, %r168, 4;
+	add.s64 	%rd317, %rd2, %rd316;
+	shr.s64 	%rd318, %rd642, 63;
+	shr.u64 	%rd319, %rd318, 56;
+	add.s64 	%rd320, %rd642, %rd319;
+	shr.s64 	%rd321, %rd320, 8;
+	shl.b64 	%rd322, %rd321, 2;
+	add.s64 	%rd323, %rd1, %rd322;
+	ld.global.nc.f32 	%f73, [%rd323];
+	ld.global.nc.f32 	%f74, [%rd317];
+	mul.f32 	%f75, %f74, %f73;
+	.loc	1 220 24
+	shl.b16 	%rs83, %rs9, 2;
+	cvt.u64.u16 	%rd324, %rs83;
+	and.b64  	%rd325, %rd324, 60;
+	add.s64 	%rd327, %rd179, %rd325;
+	ld.const.f32 	%f76, [%rd327];
+	fma.rn.f32 	%f6, %f76, %f75, %f17;
+	.loc	1 222 29
+	add.s64 	%rd53, %rd5, 11;
+	.loc	1 223 9
+	setp.lt.s64 	%p40, %rd53, %rd138;
+	@%p40 bra 	$L__BB3_90;
+	bra.uni 	$L__BB3_89;
+
+$L__BB3_90:
+	.loc	1 0 9
+	or.b64  	%rd328, %rd53, %rd8;
+	and.b64  	%rd329, %rd328, -4294967296;
+	setp.eq.s64 	%p41, %rd329, 0;
+	@%p41 bra 	$L__BB3_92;
+
+	div.s64 	%rd643, %rd53, %rd8;
+	bra.uni 	$L__BB3_93;
+
+$L__BB3_89:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs84, %f6;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+20], %rs84;
+	bra.uni 	$L__BB3_94;
+
+$L__BB3_92:
+	.loc	1 0 26
+	cvt.u32.u64 	%r169, %rd8;
+	cvt.u32.u64 	%r170, %rd53;
+	div.u32 	%r171, %r170, %r169;
+	cvt.u64.u32 	%rd643, %r171;
+
+$L__BB3_93:
+	.loc	1 225 32
+	add.s64 	%rd330, %rd3, %rd643;
+	ld.global.nc.u8 	%rs89, [%rd330];
+	cvt.u32.u16 	%r173, %rs89;
+	and.b32  	%r174, %r173, 255;
+	mul.wide.u32 	%rd331, %r174, 4;
+	add.s64 	%rd332, %rd2, %rd331;
+	shr.s64 	%rd333, %rd643, 63;
+	shr.u64 	%rd334, %rd333, 56;
+	add.s64 	%rd335, %rd643, %rd334;
+	shr.s64 	%rd336, %rd335, 8;
+	shl.b64 	%rd337, %rd336, 2;
+	add.s64 	%rd338, %rd1, %rd337;
+	ld.global.nc.f32 	%f80, [%rd338];
+	ld.global.nc.f32 	%f81, [%rd332];
+	mul.f32 	%f82, %f81, %f80;
+	.loc	1 216 24
+	and.b16  	%rs90, %rs9, 240;
+	shr.u16 	%rs91, %rs90, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r175, %rs91;
+	mul.wide.u32 	%rd339, %r175, 4;
+	add.s64 	%rd341, %rd179, %rd339;
+	ld.const.f32 	%f83, [%rd341];
+	fma.rn.f32 	%f79, %f83, %f82, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs85, %f6;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs86, %f79;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r172, {%rs85,%rs86};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+20], %r172;
+
+$L__BB3_94:
+	.loc	1 212 29
+	add.s64 	%rd57, %rd5, 12;
+	.loc	1 213 9
+	setp.ge.s64 	%p42, %rd57, %rd138;
+	@%p42 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd342, %rd57, %rd8;
+	and.b64  	%rd343, %rd342, -4294967296;
+	setp.eq.s64 	%p43, %rd343, 0;
+	@%p43 bra 	$L__BB3_97;
+
+	div.s64 	%rd644, %rd57, %rd8;
+	bra.uni 	$L__BB3_98;
+
+$L__BB3_97:
+	cvt.u32.u64 	%r176, %rd8;
+	cvt.u32.u64 	%r177, %rd57;
+	div.u32 	%r178, %r177, %r176;
+	cvt.u64.u32 	%rd644, %r178;
+
+$L__BB3_98:
+	.loc	1 219 28
+	add.s64 	%rd344, %rd3, %rd644;
+	ld.global.nc.u8 	%rs92, [%rd344];
+	cvt.u32.u16 	%r179, %rs92;
+	and.b32  	%r180, %r179, 255;
+	mul.wide.u32 	%rd345, %r180, 4;
+	add.s64 	%rd346, %rd2, %rd345;
+	shr.s64 	%rd347, %rd644, 63;
+	shr.u64 	%rd348, %rd347, 56;
+	add.s64 	%rd349, %rd644, %rd348;
+	shr.s64 	%rd350, %rd349, 8;
+	shl.b64 	%rd351, %rd350, 2;
+	add.s64 	%rd352, %rd1, %rd351;
+	ld.global.nc.f32 	%f84, [%rd352];
+	ld.global.nc.f32 	%f85, [%rd346];
+	mul.f32 	%f86, %f85, %f84;
+	.loc	1 220 24
+	shl.b16 	%rs93, %rs10, 2;
+	cvt.u64.u16 	%rd353, %rs93;
+	and.b64  	%rd354, %rd353, 60;
+	add.s64 	%rd356, %rd179, %rd354;
+	ld.const.f32 	%f87, [%rd356];
+	fma.rn.f32 	%f7, %f87, %f86, %f17;
+	.loc	1 222 29
+	add.s64 	%rd61, %rd5, 13;
+	.loc	1 223 9
+	setp.lt.s64 	%p44, %rd61, %rd138;
+	@%p44 bra 	$L__BB3_100;
+	bra.uni 	$L__BB3_99;
+
+$L__BB3_100:
+	.loc	1 0 9
+	or.b64  	%rd357, %rd61, %rd8;
+	and.b64  	%rd358, %rd357, -4294967296;
+	setp.eq.s64 	%p45, %rd358, 0;
+	@%p45 bra 	$L__BB3_102;
+
+	div.s64 	%rd645, %rd61, %rd8;
+	bra.uni 	$L__BB3_103;
+
+$L__BB3_99:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs94, %f7;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+24], %rs94;
+	bra.uni 	$L__BB3_104;
+
+$L__BB3_102:
+	.loc	1 0 26
+	cvt.u32.u64 	%r181, %rd8;
+	cvt.u32.u64 	%r182, %rd61;
+	div.u32 	%r183, %r182, %r181;
+	cvt.u64.u32 	%rd645, %r183;
+
+$L__BB3_103:
+	.loc	1 225 32
+	add.s64 	%rd359, %rd3, %rd645;
+	ld.global.nc.u8 	%rs99, [%rd359];
+	cvt.u32.u16 	%r185, %rs99;
+	and.b32  	%r186, %r185, 255;
+	mul.wide.u32 	%rd360, %r186, 4;
+	add.s64 	%rd361, %rd2, %rd360;
+	shr.s64 	%rd362, %rd645, 63;
+	shr.u64 	%rd363, %rd362, 56;
+	add.s64 	%rd364, %rd645, %rd363;
+	shr.s64 	%rd365, %rd364, 8;
+	shl.b64 	%rd366, %rd365, 2;
+	add.s64 	%rd367, %rd1, %rd366;
+	ld.global.nc.f32 	%f91, [%rd367];
+	ld.global.nc.f32 	%f92, [%rd361];
+	mul.f32 	%f93, %f92, %f91;
+	.loc	1 216 24
+	and.b16  	%rs100, %rs10, 240;
+	shr.u16 	%rs101, %rs100, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r187, %rs101;
+	mul.wide.u32 	%rd368, %r187, 4;
+	add.s64 	%rd370, %rd179, %rd368;
+	ld.const.f32 	%f94, [%rd370];
+	fma.rn.f32 	%f90, %f94, %f93, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs95, %f7;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs96, %f90;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r184, {%rs95,%rs96};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+24], %r184;
+
+$L__BB3_104:
+	.loc	1 212 29
+	add.s64 	%rd65, %rd5, 14;
+	.loc	1 213 9
+	setp.ge.s64 	%p46, %rd65, %rd138;
+	@%p46 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd371, %rd65, %rd8;
+	and.b64  	%rd372, %rd371, -4294967296;
+	setp.eq.s64 	%p47, %rd372, 0;
+	@%p47 bra 	$L__BB3_107;
+
+	div.s64 	%rd646, %rd65, %rd8;
+	bra.uni 	$L__BB3_108;
+
+$L__BB3_107:
+	cvt.u32.u64 	%r188, %rd8;
+	cvt.u32.u64 	%r189, %rd65;
+	div.u32 	%r190, %r189, %r188;
+	cvt.u64.u32 	%rd646, %r190;
+
+$L__BB3_108:
+	.loc	1 219 28
+	add.s64 	%rd373, %rd3, %rd646;
+	ld.global.nc.u8 	%rs102, [%rd373];
+	cvt.u32.u16 	%r191, %rs102;
+	and.b32  	%r192, %r191, 255;
+	mul.wide.u32 	%rd374, %r192, 4;
+	add.s64 	%rd375, %rd2, %rd374;
+	shr.s64 	%rd376, %rd646, 63;
+	shr.u64 	%rd377, %rd376, 56;
+	add.s64 	%rd378, %rd646, %rd377;
+	shr.s64 	%rd379, %rd378, 8;
+	shl.b64 	%rd380, %rd379, 2;
+	add.s64 	%rd381, %rd1, %rd380;
+	ld.global.nc.f32 	%f95, [%rd381];
+	ld.global.nc.f32 	%f96, [%rd375];
+	mul.f32 	%f97, %f96, %f95;
+	.loc	1 220 24
+	shl.b16 	%rs103, %rs11, 2;
+	cvt.u64.u16 	%rd382, %rs103;
+	and.b64  	%rd383, %rd382, 60;
+	add.s64 	%rd385, %rd179, %rd383;
+	ld.const.f32 	%f98, [%rd385];
+	fma.rn.f32 	%f8, %f98, %f97, %f17;
+	.loc	1 222 29
+	add.s64 	%rd69, %rd5, 15;
+	.loc	1 223 9
+	setp.lt.s64 	%p48, %rd69, %rd138;
+	@%p48 bra 	$L__BB3_110;
+	bra.uni 	$L__BB3_109;
+
+$L__BB3_110:
+	.loc	1 0 9
+	or.b64  	%rd386, %rd69, %rd8;
+	and.b64  	%rd387, %rd386, -4294967296;
+	setp.eq.s64 	%p49, %rd387, 0;
+	@%p49 bra 	$L__BB3_112;
+
+	div.s64 	%rd647, %rd69, %rd8;
+	bra.uni 	$L__BB3_113;
+
+$L__BB3_109:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs104, %f8;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+28], %rs104;
+	bra.uni 	$L__BB3_114;
+
+$L__BB3_112:
+	.loc	1 0 26
+	cvt.u32.u64 	%r193, %rd8;
+	cvt.u32.u64 	%r194, %rd69;
+	div.u32 	%r195, %r194, %r193;
+	cvt.u64.u32 	%rd647, %r195;
+
+$L__BB3_113:
+	.loc	1 225 32
+	add.s64 	%rd388, %rd3, %rd647;
+	ld.global.nc.u8 	%rs109, [%rd388];
+	cvt.u32.u16 	%r197, %rs109;
+	and.b32  	%r198, %r197, 255;
+	mul.wide.u32 	%rd389, %r198, 4;
+	add.s64 	%rd390, %rd2, %rd389;
+	shr.s64 	%rd391, %rd647, 63;
+	shr.u64 	%rd392, %rd391, 56;
+	add.s64 	%rd393, %rd647, %rd392;
+	shr.s64 	%rd394, %rd393, 8;
+	shl.b64 	%rd395, %rd394, 2;
+	add.s64 	%rd396, %rd1, %rd395;
+	ld.global.nc.f32 	%f102, [%rd396];
+	ld.global.nc.f32 	%f103, [%rd390];
+	mul.f32 	%f104, %f103, %f102;
+	.loc	1 216 24
+	shr.u16 	%rs110, %rs11, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r199, %rs110;
+	mul.wide.u32 	%rd397, %r199, 4;
+	add.s64 	%rd399, %rd179, %rd397;
+	ld.const.f32 	%f105, [%rd399];
+	fma.rn.f32 	%f101, %f105, %f104, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs105, %f8;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs106, %f101;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r196, {%rs105,%rs106};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+28], %r196;
+
+$L__BB3_114:
+	.loc	1 212 29
+	add.s64 	%rd73, %rd5, 16;
+	.loc	1 213 9
+	setp.ge.s64 	%p50, %rd73, %rd138;
+	@%p50 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd400, %rd73, %rd8;
+	and.b64  	%rd401, %rd400, -4294967296;
+	setp.eq.s64 	%p51, %rd401, 0;
+	@%p51 bra 	$L__BB3_117;
+
+	div.s64 	%rd648, %rd73, %rd8;
+	bra.uni 	$L__BB3_118;
+
+$L__BB3_117:
+	cvt.u32.u64 	%r200, %rd8;
+	cvt.u32.u64 	%r201, %rd73;
+	div.u32 	%r202, %r201, %r200;
+	cvt.u64.u32 	%rd648, %r202;
+
+$L__BB3_118:
+	.loc	1 219 28
+	add.s64 	%rd402, %rd3, %rd648;
+	ld.global.nc.u8 	%rs111, [%rd402];
+	cvt.u32.u16 	%r203, %rs111;
+	and.b32  	%r204, %r203, 255;
+	mul.wide.u32 	%rd403, %r204, 4;
+	add.s64 	%rd404, %rd2, %rd403;
+	shr.s64 	%rd405, %rd648, 63;
+	shr.u64 	%rd406, %rd405, 56;
+	add.s64 	%rd407, %rd648, %rd406;
+	shr.s64 	%rd408, %rd407, 8;
+	shl.b64 	%rd409, %rd408, 2;
+	add.s64 	%rd410, %rd1, %rd409;
+	ld.global.nc.f32 	%f106, [%rd410];
+	ld.global.nc.f32 	%f107, [%rd404];
+	mul.f32 	%f108, %f107, %f106;
+	.loc	1 220 24
+	shl.b16 	%rs112, %rs12, 2;
+	cvt.u64.u16 	%rd411, %rs112;
+	and.b64  	%rd412, %rd411, 60;
+	add.s64 	%rd414, %rd179, %rd412;
+	ld.const.f32 	%f109, [%rd414];
+	fma.rn.f32 	%f9, %f109, %f108, %f17;
+	.loc	1 222 29
+	add.s64 	%rd77, %rd5, 17;
+	.loc	1 223 9
+	setp.lt.s64 	%p52, %rd77, %rd138;
+	@%p52 bra 	$L__BB3_120;
+	bra.uni 	$L__BB3_119;
+
+$L__BB3_120:
+	.loc	1 0 9
+	or.b64  	%rd415, %rd77, %rd8;
+	and.b64  	%rd416, %rd415, -4294967296;
+	setp.eq.s64 	%p53, %rd416, 0;
+	@%p53 bra 	$L__BB3_122;
+
+	div.s64 	%rd649, %rd77, %rd8;
+	bra.uni 	$L__BB3_123;
+
+$L__BB3_119:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs113, %f9;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+32], %rs113;
+	bra.uni 	$L__BB3_124;
+
+$L__BB3_122:
+	.loc	1 0 26
+	cvt.u32.u64 	%r205, %rd8;
+	cvt.u32.u64 	%r206, %rd77;
+	div.u32 	%r207, %r206, %r205;
+	cvt.u64.u32 	%rd649, %r207;
+
+$L__BB3_123:
+	.loc	1 225 32
+	add.s64 	%rd417, %rd3, %rd649;
+	ld.global.nc.u8 	%rs118, [%rd417];
+	cvt.u32.u16 	%r209, %rs118;
+	and.b32  	%r210, %r209, 255;
+	mul.wide.u32 	%rd418, %r210, 4;
+	add.s64 	%rd419, %rd2, %rd418;
+	shr.s64 	%rd420, %rd649, 63;
+	shr.u64 	%rd421, %rd420, 56;
+	add.s64 	%rd422, %rd649, %rd421;
+	shr.s64 	%rd423, %rd422, 8;
+	shl.b64 	%rd424, %rd423, 2;
+	add.s64 	%rd425, %rd1, %rd424;
+	ld.global.nc.f32 	%f113, [%rd425];
+	ld.global.nc.f32 	%f114, [%rd419];
+	mul.f32 	%f115, %f114, %f113;
+	.loc	1 216 24
+	and.b16  	%rs119, %rs12, 240;
+	shr.u16 	%rs120, %rs119, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r211, %rs120;
+	mul.wide.u32 	%rd426, %r211, 4;
+	add.s64 	%rd428, %rd179, %rd426;
+	ld.const.f32 	%f116, [%rd428];
+	fma.rn.f32 	%f112, %f116, %f115, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs114, %f9;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs115, %f112;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r208, {%rs114,%rs115};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+32], %r208;
+
+$L__BB3_124:
+	.loc	1 212 29
+	add.s64 	%rd81, %rd5, 18;
+	.loc	1 213 9
+	setp.ge.s64 	%p54, %rd81, %rd138;
+	@%p54 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd429, %rd81, %rd8;
+	and.b64  	%rd430, %rd429, -4294967296;
+	setp.eq.s64 	%p55, %rd430, 0;
+	@%p55 bra 	$L__BB3_127;
+
+	div.s64 	%rd650, %rd81, %rd8;
+	bra.uni 	$L__BB3_128;
+
+$L__BB3_127:
+	cvt.u32.u64 	%r212, %rd8;
+	cvt.u32.u64 	%r213, %rd81;
+	div.u32 	%r214, %r213, %r212;
+	cvt.u64.u32 	%rd650, %r214;
+
+$L__BB3_128:
+	.loc	1 219 28
+	add.s64 	%rd431, %rd3, %rd650;
+	ld.global.nc.u8 	%rs121, [%rd431];
+	cvt.u32.u16 	%r215, %rs121;
+	and.b32  	%r216, %r215, 255;
+	mul.wide.u32 	%rd432, %r216, 4;
+	add.s64 	%rd433, %rd2, %rd432;
+	shr.s64 	%rd434, %rd650, 63;
+	shr.u64 	%rd435, %rd434, 56;
+	add.s64 	%rd436, %rd650, %rd435;
+	shr.s64 	%rd437, %rd436, 8;
+	shl.b64 	%rd438, %rd437, 2;
+	add.s64 	%rd439, %rd1, %rd438;
+	ld.global.nc.f32 	%f117, [%rd439];
+	ld.global.nc.f32 	%f118, [%rd433];
+	mul.f32 	%f119, %f118, %f117;
+	.loc	1 220 24
+	shl.b16 	%rs122, %rs13, 2;
+	cvt.u64.u16 	%rd440, %rs122;
+	and.b64  	%rd441, %rd440, 60;
+	add.s64 	%rd443, %rd179, %rd441;
+	ld.const.f32 	%f120, [%rd443];
+	fma.rn.f32 	%f10, %f120, %f119, %f17;
+	.loc	1 222 29
+	add.s64 	%rd85, %rd5, 19;
+	.loc	1 223 9
+	setp.lt.s64 	%p56, %rd85, %rd138;
+	@%p56 bra 	$L__BB3_130;
+	bra.uni 	$L__BB3_129;
+
+$L__BB3_130:
+	.loc	1 0 9
+	or.b64  	%rd444, %rd85, %rd8;
+	and.b64  	%rd445, %rd444, -4294967296;
+	setp.eq.s64 	%p57, %rd445, 0;
+	@%p57 bra 	$L__BB3_132;
+
+	div.s64 	%rd651, %rd85, %rd8;
+	bra.uni 	$L__BB3_133;
+
+$L__BB3_129:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs123, %f10;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+36], %rs123;
+	bra.uni 	$L__BB3_134;
+
+$L__BB3_132:
+	.loc	1 0 26
+	cvt.u32.u64 	%r217, %rd8;
+	cvt.u32.u64 	%r218, %rd85;
+	div.u32 	%r219, %r218, %r217;
+	cvt.u64.u32 	%rd651, %r219;
+
+$L__BB3_133:
+	.loc	1 225 32
+	add.s64 	%rd446, %rd3, %rd651;
+	ld.global.nc.u8 	%rs128, [%rd446];
+	cvt.u32.u16 	%r221, %rs128;
+	and.b32  	%r222, %r221, 255;
+	mul.wide.u32 	%rd447, %r222, 4;
+	add.s64 	%rd448, %rd2, %rd447;
+	shr.s64 	%rd449, %rd651, 63;
+	shr.u64 	%rd450, %rd449, 56;
+	add.s64 	%rd451, %rd651, %rd450;
+	shr.s64 	%rd452, %rd451, 8;
+	shl.b64 	%rd453, %rd452, 2;
+	add.s64 	%rd454, %rd1, %rd453;
+	ld.global.nc.f32 	%f124, [%rd454];
+	ld.global.nc.f32 	%f125, [%rd448];
+	mul.f32 	%f126, %f125, %f124;
+	.loc	1 216 24
+	and.b16  	%rs129, %rs13, 240;
+	shr.u16 	%rs130, %rs129, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r223, %rs130;
+	mul.wide.u32 	%rd455, %r223, 4;
+	add.s64 	%rd457, %rd179, %rd455;
+	ld.const.f32 	%f127, [%rd457];
+	fma.rn.f32 	%f123, %f127, %f126, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs124, %f10;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs125, %f123;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r220, {%rs124,%rs125};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+36], %r220;
+
+$L__BB3_134:
+	.loc	1 212 29
+	add.s64 	%rd89, %rd5, 20;
+	.loc	1 213 9
+	setp.ge.s64 	%p58, %rd89, %rd138;
+	@%p58 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd458, %rd89, %rd8;
+	and.b64  	%rd459, %rd458, -4294967296;
+	setp.eq.s64 	%p59, %rd459, 0;
+	@%p59 bra 	$L__BB3_137;
+
+	div.s64 	%rd652, %rd89, %rd8;
+	bra.uni 	$L__BB3_138;
+
+$L__BB3_137:
+	cvt.u32.u64 	%r224, %rd8;
+	cvt.u32.u64 	%r225, %rd89;
+	div.u32 	%r226, %r225, %r224;
+	cvt.u64.u32 	%rd652, %r226;
+
+$L__BB3_138:
+	.loc	1 219 28
+	add.s64 	%rd460, %rd3, %rd652;
+	ld.global.nc.u8 	%rs131, [%rd460];
+	cvt.u32.u16 	%r227, %rs131;
+	and.b32  	%r228, %r227, 255;
+	mul.wide.u32 	%rd461, %r228, 4;
+	add.s64 	%rd462, %rd2, %rd461;
+	shr.s64 	%rd463, %rd652, 63;
+	shr.u64 	%rd464, %rd463, 56;
+	add.s64 	%rd465, %rd652, %rd464;
+	shr.s64 	%rd466, %rd465, 8;
+	shl.b64 	%rd467, %rd466, 2;
+	add.s64 	%rd468, %rd1, %rd467;
+	ld.global.nc.f32 	%f128, [%rd468];
+	ld.global.nc.f32 	%f129, [%rd462];
+	mul.f32 	%f130, %f129, %f128;
+	.loc	1 220 24
+	shl.b16 	%rs132, %rs14, 2;
+	cvt.u64.u16 	%rd469, %rs132;
+	and.b64  	%rd470, %rd469, 60;
+	add.s64 	%rd472, %rd179, %rd470;
+	ld.const.f32 	%f131, [%rd472];
+	fma.rn.f32 	%f11, %f131, %f130, %f17;
+	.loc	1 222 29
+	add.s64 	%rd93, %rd5, 21;
+	.loc	1 223 9
+	setp.lt.s64 	%p60, %rd93, %rd138;
+	@%p60 bra 	$L__BB3_140;
+	bra.uni 	$L__BB3_139;
+
+$L__BB3_140:
+	.loc	1 0 9
+	or.b64  	%rd473, %rd93, %rd8;
+	and.b64  	%rd474, %rd473, -4294967296;
+	setp.eq.s64 	%p61, %rd474, 0;
+	@%p61 bra 	$L__BB3_142;
+
+	div.s64 	%rd653, %rd93, %rd8;
+	bra.uni 	$L__BB3_143;
+
+$L__BB3_139:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs133, %f11;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+40], %rs133;
+	bra.uni 	$L__BB3_144;
+
+$L__BB3_142:
+	.loc	1 0 26
+	cvt.u32.u64 	%r229, %rd8;
+	cvt.u32.u64 	%r230, %rd93;
+	div.u32 	%r231, %r230, %r229;
+	cvt.u64.u32 	%rd653, %r231;
+
+$L__BB3_143:
+	.loc	1 225 32
+	add.s64 	%rd475, %rd3, %rd653;
+	ld.global.nc.u8 	%rs138, [%rd475];
+	cvt.u32.u16 	%r233, %rs138;
+	and.b32  	%r234, %r233, 255;
+	mul.wide.u32 	%rd476, %r234, 4;
+	add.s64 	%rd477, %rd2, %rd476;
+	shr.s64 	%rd478, %rd653, 63;
+	shr.u64 	%rd479, %rd478, 56;
+	add.s64 	%rd480, %rd653, %rd479;
+	shr.s64 	%rd481, %rd480, 8;
+	shl.b64 	%rd482, %rd481, 2;
+	add.s64 	%rd483, %rd1, %rd482;
+	ld.global.nc.f32 	%f135, [%rd483];
+	ld.global.nc.f32 	%f136, [%rd477];
+	mul.f32 	%f137, %f136, %f135;
+	.loc	1 216 24
+	and.b16  	%rs139, %rs14, 240;
+	shr.u16 	%rs140, %rs139, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r235, %rs140;
+	mul.wide.u32 	%rd484, %r235, 4;
+	add.s64 	%rd486, %rd179, %rd484;
+	ld.const.f32 	%f138, [%rd486];
+	fma.rn.f32 	%f134, %f138, %f137, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs134, %f11;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs135, %f134;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r232, {%rs134,%rs135};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+40], %r232;
+
+$L__BB3_144:
+	.loc	1 212 29
+	add.s64 	%rd97, %rd5, 22;
+	.loc	1 213 9
+	setp.ge.s64 	%p62, %rd97, %rd138;
+	@%p62 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd487, %rd97, %rd8;
+	and.b64  	%rd488, %rd487, -4294967296;
+	setp.eq.s64 	%p63, %rd488, 0;
+	@%p63 bra 	$L__BB3_147;
+
+	div.s64 	%rd654, %rd97, %rd8;
+	bra.uni 	$L__BB3_148;
+
+$L__BB3_147:
+	cvt.u32.u64 	%r236, %rd8;
+	cvt.u32.u64 	%r237, %rd97;
+	div.u32 	%r238, %r237, %r236;
+	cvt.u64.u32 	%rd654, %r238;
+
+$L__BB3_148:
+	.loc	1 219 28
+	add.s64 	%rd489, %rd3, %rd654;
+	ld.global.nc.u8 	%rs141, [%rd489];
+	cvt.u32.u16 	%r239, %rs141;
+	and.b32  	%r240, %r239, 255;
+	mul.wide.u32 	%rd490, %r240, 4;
+	add.s64 	%rd491, %rd2, %rd490;
+	shr.s64 	%rd492, %rd654, 63;
+	shr.u64 	%rd493, %rd492, 56;
+	add.s64 	%rd494, %rd654, %rd493;
+	shr.s64 	%rd495, %rd494, 8;
+	shl.b64 	%rd496, %rd495, 2;
+	add.s64 	%rd497, %rd1, %rd496;
+	ld.global.nc.f32 	%f139, [%rd497];
+	ld.global.nc.f32 	%f140, [%rd491];
+	mul.f32 	%f141, %f140, %f139;
+	.loc	1 220 24
+	shl.b16 	%rs142, %rs15, 2;
+	cvt.u64.u16 	%rd498, %rs142;
+	and.b64  	%rd499, %rd498, 60;
+	add.s64 	%rd501, %rd179, %rd499;
+	ld.const.f32 	%f142, [%rd501];
+	fma.rn.f32 	%f12, %f142, %f141, %f17;
+	.loc	1 222 29
+	add.s64 	%rd101, %rd5, 23;
+	.loc	1 223 9
+	setp.lt.s64 	%p64, %rd101, %rd138;
+	@%p64 bra 	$L__BB3_150;
+	bra.uni 	$L__BB3_149;
+
+$L__BB3_150:
+	.loc	1 0 9
+	or.b64  	%rd502, %rd101, %rd8;
+	and.b64  	%rd503, %rd502, -4294967296;
+	setp.eq.s64 	%p65, %rd503, 0;
+	@%p65 bra 	$L__BB3_152;
+
+	div.s64 	%rd655, %rd101, %rd8;
+	bra.uni 	$L__BB3_153;
+
+$L__BB3_149:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs143, %f12;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+44], %rs143;
+	bra.uni 	$L__BB3_154;
+
+$L__BB3_152:
+	.loc	1 0 26
+	cvt.u32.u64 	%r241, %rd8;
+	cvt.u32.u64 	%r242, %rd101;
+	div.u32 	%r243, %r242, %r241;
+	cvt.u64.u32 	%rd655, %r243;
+
+$L__BB3_153:
+	.loc	1 225 32
+	add.s64 	%rd504, %rd3, %rd655;
+	ld.global.nc.u8 	%rs148, [%rd504];
+	cvt.u32.u16 	%r245, %rs148;
+	and.b32  	%r246, %r245, 255;
+	mul.wide.u32 	%rd505, %r246, 4;
+	add.s64 	%rd506, %rd2, %rd505;
+	shr.s64 	%rd507, %rd655, 63;
+	shr.u64 	%rd508, %rd507, 56;
+	add.s64 	%rd509, %rd655, %rd508;
+	shr.s64 	%rd510, %rd509, 8;
+	shl.b64 	%rd511, %rd510, 2;
+	add.s64 	%rd512, %rd1, %rd511;
+	ld.global.nc.f32 	%f146, [%rd512];
+	ld.global.nc.f32 	%f147, [%rd506];
+	mul.f32 	%f148, %f147, %f146;
+	.loc	1 216 24
+	shr.u16 	%rs149, %rs15, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r247, %rs149;
+	mul.wide.u32 	%rd513, %r247, 4;
+	add.s64 	%rd515, %rd179, %rd513;
+	ld.const.f32 	%f149, [%rd515];
+	fma.rn.f32 	%f145, %f149, %f148, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs144, %f12;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs145, %f145;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r244, {%rs144,%rs145};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+44], %r244;
+
+$L__BB3_154:
+	.loc	1 212 29
+	add.s64 	%rd105, %rd5, 24;
+	.loc	1 213 9
+	setp.ge.s64 	%p66, %rd105, %rd138;
+	@%p66 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd516, %rd105, %rd8;
+	and.b64  	%rd517, %rd516, -4294967296;
+	setp.eq.s64 	%p67, %rd517, 0;
+	@%p67 bra 	$L__BB3_157;
+
+	div.s64 	%rd656, %rd105, %rd8;
+	bra.uni 	$L__BB3_158;
+
+$L__BB3_157:
+	cvt.u32.u64 	%r248, %rd8;
+	cvt.u32.u64 	%r249, %rd105;
+	div.u32 	%r250, %r249, %r248;
+	cvt.u64.u32 	%rd656, %r250;
+
+$L__BB3_158:
+	.loc	1 219 28
+	add.s64 	%rd518, %rd3, %rd656;
+	ld.global.nc.u8 	%rs150, [%rd518];
+	cvt.u32.u16 	%r251, %rs150;
+	and.b32  	%r252, %r251, 255;
+	mul.wide.u32 	%rd519, %r252, 4;
+	add.s64 	%rd520, %rd2, %rd519;
+	shr.s64 	%rd521, %rd656, 63;
+	shr.u64 	%rd522, %rd521, 56;
+	add.s64 	%rd523, %rd656, %rd522;
+	shr.s64 	%rd524, %rd523, 8;
+	shl.b64 	%rd525, %rd524, 2;
+	add.s64 	%rd526, %rd1, %rd525;
+	ld.global.nc.f32 	%f150, [%rd526];
+	ld.global.nc.f32 	%f151, [%rd520];
+	mul.f32 	%f152, %f151, %f150;
+	.loc	1 220 24
+	shl.b16 	%rs151, %rs4, 2;
+	cvt.u64.u16 	%rd527, %rs151;
+	and.b64  	%rd528, %rd527, 60;
+	add.s64 	%rd530, %rd179, %rd528;
+	ld.const.f32 	%f153, [%rd530];
+	fma.rn.f32 	%f13, %f153, %f152, %f17;
+	.loc	1 222 29
+	add.s64 	%rd109, %rd5, 25;
+	.loc	1 223 9
+	setp.lt.s64 	%p68, %rd109, %rd138;
+	@%p68 bra 	$L__BB3_160;
+	bra.uni 	$L__BB3_159;
+
+$L__BB3_160:
+	.loc	1 0 9
+	or.b64  	%rd531, %rd109, %rd8;
+	and.b64  	%rd532, %rd531, -4294967296;
+	setp.eq.s64 	%p69, %rd532, 0;
+	@%p69 bra 	$L__BB3_162;
+
+	div.s64 	%rd657, %rd109, %rd8;
+	bra.uni 	$L__BB3_163;
+
+$L__BB3_159:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs152, %f13;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+48], %rs152;
+	bra.uni 	$L__BB3_164;
+
+$L__BB3_162:
+	.loc	1 0 26
+	cvt.u32.u64 	%r253, %rd8;
+	cvt.u32.u64 	%r254, %rd109;
+	div.u32 	%r255, %r254, %r253;
+	cvt.u64.u32 	%rd657, %r255;
+
+$L__BB3_163:
+	.loc	1 225 32
+	add.s64 	%rd533, %rd3, %rd657;
+	ld.global.nc.u8 	%rs157, [%rd533];
+	cvt.u32.u16 	%r257, %rs157;
+	and.b32  	%r258, %r257, 255;
+	mul.wide.u32 	%rd534, %r258, 4;
+	add.s64 	%rd535, %rd2, %rd534;
+	shr.s64 	%rd536, %rd657, 63;
+	shr.u64 	%rd537, %rd536, 56;
+	add.s64 	%rd538, %rd657, %rd537;
+	shr.s64 	%rd539, %rd538, 8;
+	shl.b64 	%rd540, %rd539, 2;
+	add.s64 	%rd541, %rd1, %rd540;
+	ld.global.nc.f32 	%f157, [%rd541];
+	ld.global.nc.f32 	%f158, [%rd535];
+	mul.f32 	%f159, %f158, %f157;
+	.loc	1 216 24
+	and.b16  	%rs158, %rs4, 240;
+	shr.u16 	%rs159, %rs158, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r259, %rs159;
+	mul.wide.u32 	%rd542, %r259, 4;
+	add.s64 	%rd544, %rd179, %rd542;
+	ld.const.f32 	%f160, [%rd544];
+	fma.rn.f32 	%f156, %f160, %f159, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs153, %f13;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs154, %f156;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r256, {%rs153,%rs154};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+48], %r256;
+
+$L__BB3_164:
+	.loc	1 212 29
+	add.s64 	%rd113, %rd5, 26;
+	.loc	1 213 9
+	setp.ge.s64 	%p70, %rd113, %rd138;
+	@%p70 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd545, %rd113, %rd8;
+	and.b64  	%rd546, %rd545, -4294967296;
+	setp.eq.s64 	%p71, %rd546, 0;
+	@%p71 bra 	$L__BB3_167;
+
+	div.s64 	%rd658, %rd113, %rd8;
+	bra.uni 	$L__BB3_168;
+
+$L__BB3_167:
+	cvt.u32.u64 	%r260, %rd8;
+	cvt.u32.u64 	%r261, %rd113;
+	div.u32 	%r262, %r261, %r260;
+	cvt.u64.u32 	%rd658, %r262;
+
+$L__BB3_168:
+	.loc	1 219 28
+	add.s64 	%rd547, %rd3, %rd658;
+	ld.global.nc.u8 	%rs160, [%rd547];
+	cvt.u32.u16 	%r263, %rs160;
+	and.b32  	%r264, %r263, 255;
+	mul.wide.u32 	%rd548, %r264, 4;
+	add.s64 	%rd549, %rd2, %rd548;
+	shr.s64 	%rd550, %rd658, 63;
+	shr.u64 	%rd551, %rd550, 56;
+	add.s64 	%rd552, %rd658, %rd551;
+	shr.s64 	%rd553, %rd552, 8;
+	shl.b64 	%rd554, %rd553, 2;
+	add.s64 	%rd555, %rd1, %rd554;
+	ld.global.nc.f32 	%f161, [%rd555];
+	ld.global.nc.f32 	%f162, [%rd549];
+	mul.f32 	%f163, %f162, %f161;
+	.loc	1 220 24
+	shl.b16 	%rs161, %rs3, 2;
+	cvt.u64.u16 	%rd556, %rs161;
+	and.b64  	%rd557, %rd556, 60;
+	add.s64 	%rd559, %rd179, %rd557;
+	ld.const.f32 	%f164, [%rd559];
+	fma.rn.f32 	%f14, %f164, %f163, %f17;
+	.loc	1 222 29
+	add.s64 	%rd117, %rd5, 27;
+	.loc	1 223 9
+	setp.lt.s64 	%p72, %rd117, %rd138;
+	@%p72 bra 	$L__BB3_170;
+	bra.uni 	$L__BB3_169;
+
+$L__BB3_170:
+	.loc	1 0 9
+	or.b64  	%rd560, %rd117, %rd8;
+	and.b64  	%rd561, %rd560, -4294967296;
+	setp.eq.s64 	%p73, %rd561, 0;
+	@%p73 bra 	$L__BB3_172;
+
+	div.s64 	%rd659, %rd117, %rd8;
+	bra.uni 	$L__BB3_173;
+
+$L__BB3_169:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs162, %f14;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+52], %rs162;
+	bra.uni 	$L__BB3_174;
+
+$L__BB3_172:
+	.loc	1 0 26
+	cvt.u32.u64 	%r265, %rd8;
+	cvt.u32.u64 	%r266, %rd117;
+	div.u32 	%r267, %r266, %r265;
+	cvt.u64.u32 	%rd659, %r267;
+
+$L__BB3_173:
+	.loc	1 225 32
+	add.s64 	%rd562, %rd3, %rd659;
+	ld.global.nc.u8 	%rs167, [%rd562];
+	cvt.u32.u16 	%r269, %rs167;
+	and.b32  	%r270, %r269, 255;
+	mul.wide.u32 	%rd563, %r270, 4;
+	add.s64 	%rd564, %rd2, %rd563;
+	shr.s64 	%rd565, %rd659, 63;
+	shr.u64 	%rd566, %rd565, 56;
+	add.s64 	%rd567, %rd659, %rd566;
+	shr.s64 	%rd568, %rd567, 8;
+	shl.b64 	%rd569, %rd568, 2;
+	add.s64 	%rd570, %rd1, %rd569;
+	ld.global.nc.f32 	%f168, [%rd570];
+	ld.global.nc.f32 	%f169, [%rd564];
+	mul.f32 	%f170, %f169, %f168;
+	.loc	1 216 24
+	and.b16  	%rs168, %rs3, 240;
+	shr.u16 	%rs169, %rs168, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r271, %rs169;
+	mul.wide.u32 	%rd571, %r271, 4;
+	add.s64 	%rd573, %rd179, %rd571;
+	ld.const.f32 	%f171, [%rd573];
+	fma.rn.f32 	%f167, %f171, %f170, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs163, %f14;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs164, %f167;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r268, {%rs163,%rs164};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+52], %r268;
+
+$L__BB3_174:
+	.loc	1 212 29
+	add.s64 	%rd121, %rd5, 28;
+	.loc	1 213 9
+	setp.ge.s64 	%p74, %rd121, %rd138;
+	@%p74 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd574, %rd121, %rd8;
+	and.b64  	%rd575, %rd574, -4294967296;
+	setp.eq.s64 	%p75, %rd575, 0;
+	@%p75 bra 	$L__BB3_177;
+
+	div.s64 	%rd660, %rd121, %rd8;
+	bra.uni 	$L__BB3_178;
+
+$L__BB3_177:
+	cvt.u32.u64 	%r272, %rd8;
+	cvt.u32.u64 	%r273, %rd121;
+	div.u32 	%r274, %r273, %r272;
+	cvt.u64.u32 	%rd660, %r274;
+
+$L__BB3_178:
+	.loc	1 219 28
+	add.s64 	%rd576, %rd3, %rd660;
+	ld.global.nc.u8 	%rs170, [%rd576];
+	cvt.u32.u16 	%r275, %rs170;
+	and.b32  	%r276, %r275, 255;
+	mul.wide.u32 	%rd577, %r276, 4;
+	add.s64 	%rd578, %rd2, %rd577;
+	shr.s64 	%rd579, %rd660, 63;
+	shr.u64 	%rd580, %rd579, 56;
+	add.s64 	%rd581, %rd660, %rd580;
+	shr.s64 	%rd582, %rd581, 8;
+	shl.b64 	%rd583, %rd582, 2;
+	add.s64 	%rd584, %rd1, %rd583;
+	ld.global.nc.f32 	%f172, [%rd584];
+	ld.global.nc.f32 	%f173, [%rd578];
+	mul.f32 	%f174, %f173, %f172;
+	.loc	1 220 24
+	shl.b16 	%rs171, %rs2, 2;
+	cvt.u64.u16 	%rd585, %rs171;
+	and.b64  	%rd586, %rd585, 60;
+	add.s64 	%rd588, %rd179, %rd586;
+	ld.const.f32 	%f175, [%rd588];
+	fma.rn.f32 	%f15, %f175, %f174, %f17;
+	.loc	1 222 29
+	add.s64 	%rd125, %rd5, 29;
+	.loc	1 223 9
+	setp.lt.s64 	%p76, %rd125, %rd138;
+	@%p76 bra 	$L__BB3_180;
+	bra.uni 	$L__BB3_179;
+
+$L__BB3_180:
+	.loc	1 0 9
+	or.b64  	%rd589, %rd125, %rd8;
+	and.b64  	%rd590, %rd589, -4294967296;
+	setp.eq.s64 	%p77, %rd590, 0;
+	@%p77 bra 	$L__BB3_182;
+
+	div.s64 	%rd661, %rd125, %rd8;
+	bra.uni 	$L__BB3_183;
+
+$L__BB3_179:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs172, %f15;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+56], %rs172;
+	bra.uni 	$L__BB3_184;
+
+$L__BB3_182:
+	.loc	1 0 26
+	cvt.u32.u64 	%r277, %rd8;
+	cvt.u32.u64 	%r278, %rd125;
+	div.u32 	%r279, %r278, %r277;
+	cvt.u64.u32 	%rd661, %r279;
+
+$L__BB3_183:
+	.loc	1 225 32
+	add.s64 	%rd591, %rd3, %rd661;
+	ld.global.nc.u8 	%rs177, [%rd591];
+	cvt.u32.u16 	%r281, %rs177;
+	and.b32  	%r282, %r281, 255;
+	mul.wide.u32 	%rd592, %r282, 4;
+	add.s64 	%rd593, %rd2, %rd592;
+	shr.s64 	%rd594, %rd661, 63;
+	shr.u64 	%rd595, %rd594, 56;
+	add.s64 	%rd596, %rd661, %rd595;
+	shr.s64 	%rd597, %rd596, 8;
+	shl.b64 	%rd598, %rd597, 2;
+	add.s64 	%rd599, %rd1, %rd598;
+	ld.global.nc.f32 	%f179, [%rd599];
+	ld.global.nc.f32 	%f180, [%rd593];
+	mul.f32 	%f181, %f180, %f179;
+	.loc	1 216 24
+	and.b16  	%rs178, %rs2, 240;
+	shr.u16 	%rs179, %rs178, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r283, %rs179;
+	mul.wide.u32 	%rd600, %r283, 4;
+	add.s64 	%rd602, %rd179, %rd600;
+	ld.const.f32 	%f182, [%rd602];
+	fma.rn.f32 	%f178, %f182, %f181, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs173, %f15;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs174, %f178;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r280, {%rs173,%rs174};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+56], %r280;
+
+$L__BB3_184:
+	.loc	1 212 29
+	add.s64 	%rd129, %rd5, 30;
+	.loc	1 213 9
+	setp.ge.s64 	%p78, %rd129, %rd138;
+	@%p78 bra 	$L__BB3_194;
+
+	.loc	1 0 9
+	or.b64  	%rd603, %rd129, %rd8;
+	and.b64  	%rd604, %rd603, -4294967296;
+	setp.eq.s64 	%p79, %rd604, 0;
+	@%p79 bra 	$L__BB3_187;
+
+	div.s64 	%rd662, %rd129, %rd8;
+	bra.uni 	$L__BB3_188;
+
+$L__BB3_187:
+	cvt.u32.u64 	%r284, %rd8;
+	cvt.u32.u64 	%r285, %rd129;
+	div.u32 	%r286, %r285, %r284;
+	cvt.u64.u32 	%rd662, %r286;
+
+$L__BB3_188:
+	.loc	1 219 28
+	add.s64 	%rd605, %rd3, %rd662;
+	ld.global.nc.u8 	%rs180, [%rd605];
+	cvt.u32.u16 	%r287, %rs180;
+	and.b32  	%r288, %r287, 255;
+	mul.wide.u32 	%rd606, %r288, 4;
+	add.s64 	%rd607, %rd2, %rd606;
+	shr.s64 	%rd608, %rd662, 63;
+	shr.u64 	%rd609, %rd608, 56;
+	add.s64 	%rd610, %rd662, %rd609;
+	shr.s64 	%rd611, %rd610, 8;
+	shl.b64 	%rd612, %rd611, 2;
+	add.s64 	%rd613, %rd1, %rd612;
+	ld.global.nc.f32 	%f183, [%rd613];
+	ld.global.nc.f32 	%f184, [%rd607];
+	mul.f32 	%f185, %f184, %f183;
+	.loc	1 220 24
+	shl.b16 	%rs181, %rs1, 2;
+	cvt.u64.u16 	%rd614, %rs181;
+	and.b64  	%rd615, %rd614, 60;
+	add.s64 	%rd617, %rd179, %rd615;
+	ld.const.f32 	%f186, [%rd617];
+	fma.rn.f32 	%f16, %f186, %f185, %f17;
+	.loc	1 222 29
+	add.s64 	%rd133, %rd5, 31;
+	.loc	1 223 9
+	setp.lt.s64 	%p80, %rd133, %rd138;
+	@%p80 bra 	$L__BB3_190;
+	bra.uni 	$L__BB3_189;
+
+$L__BB3_190:
+	.loc	1 0 9
+	or.b64  	%rd618, %rd133, %rd8;
+	and.b64  	%rd619, %rd618, -4294967296;
+	setp.eq.s64 	%p81, %rd619, 0;
+	@%p81 bra 	$L__BB3_192;
+
+	div.s64 	%rd663, %rd133, %rd8;
+	bra.uni 	$L__BB3_193;
+
+$L__BB3_189:
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs182, %f16;}
+
+	// end inline asm
+	.loc	1 229 26
+	st.global.u16 	[%rd13+60], %rs182;
+	bra.uni 	$L__BB3_194;
+
+$L__BB3_192:
+	.loc	1 0 26
+	cvt.u32.u64 	%r289, %rd8;
+	cvt.u32.u64 	%r290, %rd133;
+	div.u32 	%r291, %r290, %r289;
+	cvt.u64.u32 	%rd663, %r291;
+
+$L__BB3_193:
+	.loc	1 225 32
+	add.s64 	%rd620, %rd3, %rd663;
+	ld.global.nc.u8 	%rs187, [%rd620];
+	cvt.u32.u16 	%r293, %rs187;
+	and.b32  	%r294, %r293, 255;
+	mul.wide.u32 	%rd621, %r294, 4;
+	add.s64 	%rd622, %rd2, %rd621;
+	shr.s64 	%rd623, %rd663, 63;
+	shr.u64 	%rd624, %rd623, 56;
+	add.s64 	%rd625, %rd663, %rd624;
+	shr.s64 	%rd626, %rd625, 8;
+	shl.b64 	%rd627, %rd626, 2;
+	add.s64 	%rd628, %rd1, %rd627;
+	ld.global.nc.f32 	%f190, [%rd628];
+	ld.global.nc.f32 	%f191, [%rd622];
+	mul.f32 	%f192, %f191, %f190;
+	.loc	1 216 24
+	shr.u16 	%rs188, %rs1, 4;
+	.loc	1 226 28
+	cvt.u32.u16 	%r295, %rs188;
+	mul.wide.u32 	%rd629, %r295, 4;
+	add.s64 	%rd631, %rd179, %rd629;
+	ld.const.f32 	%f193, [%rd631];
+	fma.rn.f32 	%f189, %f193, %f192, %f17;
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs183, %f16;}
+
+	// end inline asm
+	.loc	3 455 3, function_name $L__info_string5, inlined_at 1 63 85
+	// begin inline asm
+	{  cvt.rn.bf16.f32 %rs184, %f189;}
+
+	// end inline asm
+	.loc	3 1534 5, function_name $L__info_string7, inlined_at 1 82 29
+	// begin inline asm
+	{  mov.b32 %r292, {%rs183,%rs184};}
+
+	// end inline asm
+	.loc	1 83 5, function_name $L__info_string6, inlined_at 1 227 13
+	st.global.u32 	[%rd13+60], %r292;
+
+$L__BB3_194:
+	.loc	1 232 1
+	ret;
+
+}
+	.file	1 "/mnt/d/thu-project/Learning-CUDA-master/Learning-CUDA-master/nf4/dequant_kernel_v2.cu"
+	.file	2 "/usr/include/cuda_fp16.hpp"
+	.file	3 "/usr/include/cuda_bf16.hpp"
+	.section	.debug_str
+	{
+$L__info_string0:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100
+.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,55,99,97,115,116,95,116,111,73,54,95,95,104,97,108
+.b8 102,69,69,84,95,102,0
+$L__info_string1:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,50,95,95,102,108,111,97,116,50,104,97,108,102,69,102,0
+$L__info_string2:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100
+.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,49,48,115,116,111,114,101,95,112,97,105,114,73,54,95
+.b8 95,104,97,108,102,69,69,118,80,84,95,83,51,95,83,51,95,0
+$L__info_string3:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,52,95,95,104,97,108,118,101,115,50,104,97,108,102,50,69,54,95,95,104,97,108,102,83,48,95
+.b8 0
+$L__info_string4:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100
+.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,55,99,97,115,116,95,116,111,73,49,51,95,95,110,118
+.b8 95,98,102,108,111,97,116,49,54,69,69,84,95,102,0
+$L__info_string5:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,54,95,95,102,108,111,97,116,50,98,102,108,111,97,116,49,54,69,102,0
+$L__info_string6:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,53,48,95,71,76,79,66,65,76,95,95,78,95,95,56,52,56,98,102,53,51,55,95,49,55,95,100
+.b8 101,113,117,97,110,116,95,107,101,114,110,101,108,95,99,117,95,54,50,50,101,98,98,51,50,49,48,115,116,111,114,101,95,112,97,105,114,73,49,51
+.b8 95,95,110,118,95,98,102,108,111,97,116,49,54,69,69,118,80,84,95,83,51,95,83,51,95,0
+$L__info_string7:
+.b8 95,90,78,52,56,95,73,78,84,69,82,78,65,76,95,56,52,56,98,102,53,51,55,95,49,55,95,100,101,113,117,97,110,116,95,107,101,114,110,101
+.b8 108,95,99,117,95,54,50,50,101,98,98,51,50,49,56,95,95,104,97,108,118,101,115,50,98,102,108,111,97,116,49,54,50,69,49,51,95,95,110,118
+.b8 95,98,102,108,111,97,116,49,54,83,48,95,0
+
+	}
diff --git a/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu
new file mode 100644
index 0000000..6f97600
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/dequant_kernel_v2.cu
@@ -0,0 +1,544 @@
+/**
+ * dequant_kernel.cu  —  NF4 Dequantization Kernel (Optimized)
+ *
+ * Formula:  w = NF4[4bit_index] * code2[absmax_q[block]] * absmax2[block/256] + offset
+ *
+ * This is bitsandbytes "double quantization" (quant_type=nf4, double_quant=True):
+ *   - Level 1 (absmax_q): uint8 index into code2 lookup table
+ *   - Level 2 (absmax2):  FP16 scale for groups of 256 L1 blocks
+ *   - The L1 scale itself is quantized to save memory (~3% overhead instead of 8%)
+ *
+ * Optimization history (for presentation):
+ *   v1 Naive:      one thread per element, scalar FP16 write  → ~5% A100 bandwidth
+ *   v2 Vectorized: one thread per uint8 (2 elements), __half2 packed store → ~35% bandwidth
+ *   v3 (TODO):     128-bit vectorized load (int4, 32 elements/thread) → ~80% bandwidth
+ */
+
+#include "dequant_kernel.h"
+
+#include <cuda_bf16.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+
+#include <cmath>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <type_traits>
+
+namespace {
+
+constexpr int kPairsPerThreadV3 = 8;
+
+// ── NF4 Lookup Table ─────────────────────────────────────────────────────────
+// From QLoRA paper Table 1: 16 quantile values of N(0,1) scaled to [-1, 1]
+// Stored in __constant__ memory: all 108 SMs on A100 share one copy,
+// with 8KB constant cache per SM. 16 * 4 = 64 bytes — fits in a single cache line.
+__device__ __constant__ float d_nf4[16] = {
+    -1.0f, -0.6961928f, -0.52507305f, -0.3949175f,
+    -0.28444138f, -0.18477343f, -0.091050036f, 0.0f,
+    0.0795803f, 0.1609302f, 0.2461123f, 0.33791524f,
+    0.44070983f, 0.562617f, 0.72295684f, 1.0f
+};
+
+// CPU copy for reference computation (identical values)
+constexpr float kNF4[16] = {
+    -1.0f, -0.6961928f, -0.52507305f, -0.3949175f,
+    -0.28444138f, -0.18477343f, -0.091050036f, 0.0f,
+    0.0795803f, 0.1609302f, 0.2461123f, 0.33791524f,
+    0.44070983f, 0.562617f, 0.72295684f, 1.0f
+};
+
+inline int64_t ceil_div(int64_t a, int64_t b) { return (a + b - 1) / b; }
+
+inline float fp16_bits_to_float(uint16_t bits) {
+    __half h; std::memcpy(&h, &bits, sizeof(uint16_t));
+    return __half2float(h);
+}
+
+// ── Type-agnostic cast helper ─────────────────────────────────────────────────
+template <typename T> __device__ inline T cast_to(float v);
+template <> __device__ inline __half cast_to<__half>(float v)          { return __float2half(v); }
+template <> __device__ inline __nv_bfloat16 cast_to<__nv_bfloat16>(float v) { return __float2bfloat16(v); }
+
+// ── Packed-pair write helper ───────────────────────────────────────────────────
+// Instead of two separate 16-bit stores, pack into one 32-bit store.
+// This halves the number of store instructions and improves L2 write efficiency.
+template <typename T>
+__device__ inline void store_pair(T* __restrict__ ptr, T a, T b);
+
+template <>
+__device__ inline void store_pair<__half>(__half* __restrict__ ptr, __half a, __half b) {
+    // __half2 is two consecutive __half values; store as uint32 for atomic/vectorized write
+    __half2 packed = __halves2half2(a, b);
+    *reinterpret_cast<uint32_t*>(ptr) = *reinterpret_cast<uint32_t*>(&packed);
+}
+
+template <>
+__device__ inline void store_pair<__nv_bfloat16>(
+    __nv_bfloat16* __restrict__ ptr, __nv_bfloat16 a, __nv_bfloat16 b)
+{
+    __nv_bfloat162 packed = __halves2bfloat162(a, b);
+    *reinterpret_cast<uint32_t*>(ptr) = *reinterpret_cast<uint32_t*>(&packed);
+}
+
+// ── Main Dequant Kernel (v2: Vectorized / Packed Store) ──────────────────────
+//
+// Thread mapping:
+//   - 1 thread handles 1 uint8 = 2 packed 4-bit weights = 2 output elements
+//   - pair_idx = blockIdx.x * blockDim.x + threadIdx.x
+//   - elem0    = pair_idx * 2
+//   - elem1    = pair_idx * 2 + 1
+//
+// Memory access pattern (key for A100 bandwidth):
+//   - packed_weights: consecutive threads → consecutive bytes → COALESCED READ
+//   - output:         consecutive threads → consecutive 32-bit stores → COALESCED WRITE
+//   - absmax_q:       32 consecutive threads share the same block (blocksize=64)
+//                     → same cache line → effectively a BROADCAST, no divergence
+//   - code2:          random access into 256-entry table → stays in L1 after warmup
+//
+template <typename T>
+__global__ void dequant_kernel(
+    const uint8_t* __restrict__ packed_weights,  // [num_pairs]    bytes
+    const uint8_t* __restrict__ absmax_q,         // [num_blocks]   uint8
+    const float*   __restrict__ absmax2,           // [num_groups]   float
+    const float*   __restrict__ code2,             // [256]          float
+    float          offset,
+    int64_t        numel,
+    int32_t        blocksize,
+    T*             __restrict__ out)               // [numel]        T
+{
+    // Which pair of elements does this thread own?
+    const int64_t pair_idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
+    const int64_t elem0    = pair_idx * 2;
+    if (elem0 >= numel) return;
+
+    // ── Unpack two 4-bit indices from one byte ────────────────────────────────
+    //   byte layout (bitsandbytes convention):
+    //     bits[3:0] → element at even position (elem0)
+    //     bits[7:4] → element at odd  position (elem1)
+    const uint8_t packed = packed_weights[pair_idx];
+    const int     idx0   = packed & 0x0F;           // low  nibble
+    const int     idx1   = (packed >> 4) & 0x0F;    // high nibble
+
+    // ── Compute scale for elem0 ───────────────────────────────────────────────
+    //   Two-level (double) quantization:
+    //     L1 scale = code2[ absmax_q[block_idx] ]   (uint8 → float via codebook)
+    //     L2 scale = absmax2[ block_idx / 256 ]      (float)
+    //     final scale = L1 * L2
+    const int64_t block_idx0 = elem0 / blocksize;
+    const int64_t group_idx0 = block_idx0 / 256;
+    const float   scale0     = code2[absmax_q[block_idx0]] * absmax2[group_idx0];
+    const float   w0         = d_nf4[idx0] * scale0 + offset;
+
+    // ── Compute scale for elem1 (may be in a different block at boundaries) ───
+    const int64_t elem1 = elem0 + 1;
+    if (elem1 < numel) {
+        // Normal path: two valid elements
+        // For blocksize >= 2, block_idx1 == block_idx0 in the vast majority of cases.
+        // We still compute it correctly for generality (compiler will optimize same-block case).
+        const int64_t block_idx1 = elem1 / blocksize;
+        const int64_t group_idx1 = block_idx1 / 256;
+        const float   scale1     = code2[absmax_q[block_idx1]] * absmax2[group_idx1];
+        const float   w1         = d_nf4[idx1] * scale1 + offset;
+
+        // ── Vectorized (packed) store: 2 × T in one 32-bit write ─────────────
+        // Equivalent to two separate stores, but issues a single 32-bit transaction
+        // to the L2/HBM, halving write pressure.
+        store_pair<T>(out + elem0, cast_to<T>(w0), cast_to<T>(w1));
+    } else {
+        // Edge case: only elem0 is valid (odd-sized matrix, last element)
+        out[elem0] = cast_to<T>(w0);
+    }
+}
+
+template <typename T>
+__global__ void dequant_kernel_v3(
+    const uint8_t* __restrict__ packed_weights,
+    const uint8_t* __restrict__ absmax_q,
+    const float*   __restrict__ absmax2,
+    const float*   __restrict__ code2,
+    float          offset,
+    int64_t        numel,
+    int32_t        blocksize,
+    T*             __restrict__ out)
+{
+    constexpr int pairs_per_thread = kPairsPerThreadV3;
+
+    const int64_t tid       = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
+    const int64_t pair_base = tid * pairs_per_thread;
+    const int64_t elem_base = pair_base * 2;
+    const int64_t num_pairs_total = (numel + 1) / 2;
+
+    if (elem_base >= numel) return;
+
+    uint8_t bytes[pairs_per_thread];
+
+    if (pair_base + pairs_per_thread <= num_pairs_total) {
+        if constexpr (pairs_per_thread == 16) {
+            const uint4 raw = *reinterpret_cast<const uint4*>(packed_weights + pair_base);
+            const uint32_t lanes[4] = {raw.x, raw.y, raw.z, raw.w};
+            #pragma unroll
+            for (int l = 0; l < 4; ++l) {
+                const uint32_t v = lanes[l];
+                bytes[l * 4 + 0] = static_cast<uint8_t>(v & 0xFFu);
+                bytes[l * 4 + 1] = static_cast<uint8_t>((v >> 8) & 0xFFu);
+                bytes[l * 4 + 2] = static_cast<uint8_t>((v >> 16) & 0xFFu);
+                bytes[l * 4 + 3] = static_cast<uint8_t>((v >> 24) & 0xFFu);
+            }
+        } else {
+            const uint2 raw = *reinterpret_cast<const uint2*>(packed_weights + pair_base);
+            const uint32_t lanes[2] = {raw.x, raw.y};
+            #pragma unroll
+            for (int l = 0; l < 2; ++l) {
+                const uint32_t v = lanes[l];
+                bytes[l * 4 + 0] = static_cast<uint8_t>(v & 0xFFu);
+                bytes[l * 4 + 1] = static_cast<uint8_t>((v >> 8) & 0xFFu);
+                bytes[l * 4 + 2] = static_cast<uint8_t>((v >> 16) & 0xFFu);
+                bytes[l * 4 + 3] = static_cast<uint8_t>((v >> 24) & 0xFFu);
+            }
+        }
+    } else {
+        #pragma unroll
+        for (int i = 0; i < pairs_per_thread; ++i) {
+            const int64_t p = pair_base + i;
+            bytes[i] = (p < num_pairs_total) ? packed_weights[p] : 0u;
+        }
+    }
+
+    #pragma unroll
+    for (int i = 0; i < pairs_per_thread; ++i) {
+        const int64_t elem0 = elem_base + static_cast<int64_t>(i) * 2;
+        if (elem0 >= numel) break;
+
+        const int idx0 = bytes[i] & 0x0F;
+        const int idx1 = (bytes[i] >> 4) & 0x0F;
+
+        const int64_t block_idx0 = elem0 / blocksize;
+        const float scale0 = code2[absmax_q[block_idx0]] * absmax2[block_idx0 / 256];
+        const float w0 = d_nf4[idx0] * scale0 + offset;
+
+        const int64_t elem1 = elem0 + 1;
+        if (elem1 < numel) {
+            const int64_t block_idx1 = elem1 / blocksize;
+            const float scale1 = code2[absmax_q[block_idx1]] * absmax2[block_idx1 / 256];
+            const float w1 = d_nf4[idx1] * scale1 + offset;
+            store_pair<T>(out + elem0, cast_to<T>(w0), cast_to<T>(w1));
+        } else {
+            out[elem0] = cast_to<T>(w0);
+        }
+    }
+}
+
+template <typename T>
+__global__ __launch_bounds__(128, 8) void dequant_kernel_v4(
+    const uint8_t* __restrict__ packed_weights,
+    const uint8_t* __restrict__ absmax_q,
+    const float*   __restrict__ absmax2,
+    const float*   __restrict__ code2,
+    float          offset,
+    int64_t        numel,
+    int32_t        blocksize,
+    T*             __restrict__ out)
+{
+    constexpr int pairs_per_thread = kPairsPerThreadV3;
+
+    const int64_t tid       = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
+    const int64_t pair_base = tid * pairs_per_thread;
+    const int64_t elem_base = pair_base * 2;
+    const int64_t num_pairs_total = (numel + 1) / 2;
+
+    if (elem_base >= numel) return;
+
+    uint8_t bytes[pairs_per_thread];
+
+    if (pair_base + pairs_per_thread <= num_pairs_total) {
+        const uint2 raw = *reinterpret_cast<const uint2*>(packed_weights + pair_base);
+        const uint32_t lanes[2] = {raw.x, raw.y};
+        #pragma unroll
+        for (int l = 0; l < 2; ++l) {
+            const uint32_t v = lanes[l];
+            bytes[l * 4 + 0] = static_cast<uint8_t>(v & 0xFFu);
+            bytes[l * 4 + 1] = static_cast<uint8_t>((v >> 8) & 0xFFu);
+            bytes[l * 4 + 2] = static_cast<uint8_t>((v >> 16) & 0xFFu);
+            bytes[l * 4 + 3] = static_cast<uint8_t>((v >> 24) & 0xFFu);
+        }
+    } else {
+        #pragma unroll
+        for (int i = 0; i < pairs_per_thread; ++i) {
+            const int64_t p = pair_base + i;
+            bytes[i] = (p < num_pairs_total) ? packed_weights[p] : 0u;
+        }
+    }
+
+    #pragma unroll
+    for (int i = 0; i < pairs_per_thread; ++i) {
+        const int64_t elem0 = elem_base + static_cast<int64_t>(i) * 2;
+        if (elem0 >= numel) break;
+
+        const int idx0 = bytes[i] & 0x0F;
+        const int idx1 = (bytes[i] >> 4) & 0x0F;
+
+        const int64_t block_idx0 = elem0 / blocksize;
+        const float scale0 = code2[absmax_q[block_idx0]] * absmax2[block_idx0 / 256];
+        const float w0 = d_nf4[idx0] * scale0 + offset;
+
+        const int64_t elem1 = elem0 + 1;
+        if (elem1 < numel) {
+            const int64_t block_idx1 = elem1 / blocksize;
+            const float scale1 = code2[absmax_q[block_idx1]] * absmax2[block_idx1 / 256];
+            const float w1 = d_nf4[idx1] * scale1 + offset;
+            store_pair<T>(out + elem0, cast_to<T>(w0), cast_to<T>(w1));
+        } else {
+            out[elem0] = cast_to<T>(w0);
+        }
+    }
+}
+
+// ── CPU Reference (for MAE verification) ─────────────────────────────────────
+// Identical formula to the GPU kernel, computed in FP32 on the host.
+// Used to verify: |gpu_output[i] - cpu_ref[i]| < 1e-2 for all i
+void cpu_reference(const NF4Binary& input, std::vector<float>& ref) {
+    const int64_t numel     = input.config.rows * input.config.cols;
+    const int32_t blocksize = input.config.blocksize;
+    ref.resize(numel);
+
+    for (int64_t i = 0; i < numel; ++i) {
+        const int64_t pair_idx  = i / 2;
+        const bool    low       = (i % 2) == 0;
+        const uint8_t packed    = input.packed_weights[pair_idx];
+        const int     idx       = low ? (packed & 0x0F) : ((packed >> 4) & 0x0F);
+
+        const int64_t block_idx = i / blocksize;
+        const int64_t group_idx = block_idx / 256;
+
+        // Reconstruct two-level scale from stored FP16 bits
+        const float scale_l1 = fp16_bits_to_float(input.code2_raw[input.absmax_q[block_idx]]);
+        const float scale_l2 = fp16_bits_to_float(input.absmax2_raw[group_idx]);
+        ref[i] = kNF4[idx] * scale_l1 * scale_l2 + input.offset;
+    }
+}
+
+// ── GPU Launch Wrapper ────────────────────────────────────────────────────────
+template <typename T>
+bool launch_cuda(const NF4Binary& input, std::vector<float>& output, std::vector<float>& gpu_fp32_out) {
+    auto fail_cuda = [](const char* stage, cudaError_t err) -> bool {
+        std::cerr << "FAIL " << stage << ": [" << (int)err << "] "
+                  << cudaGetErrorString(err) << std::endl;
+        return false;
+    };
+
+    const int64_t numel      = input.config.rows * input.config.cols;
+    const int64_t num_pairs  = ceil_div(numel, 2);
+    const int64_t num_blocks = ceil_div(numel, input.config.blocksize);
+    const int64_t num_groups = ceil_div(num_blocks, 256);
+
+    std::cout << "GPU launch: numel=" << numel
+              << " pairs=" << num_pairs
+              << " blocks=" << num_blocks
+              << " groups=" << num_groups << std::endl;
+
+    // ── Device allocations ────────────────────────────────────────────────────
+    uint8_t* d_packed   = nullptr;
+    uint8_t* d_absmax_q = nullptr;
+    float*   d_absmax2  = nullptr;
+    float*   d_code2    = nullptr;
+    T*       d_out      = nullptr;
+
+    cudaError_t err;
+    if ((err = cudaMalloc(&d_packed,   num_pairs))  != cudaSuccess) return fail_cuda("malloc packed",   err);
+    if ((err = cudaMalloc(&d_absmax_q, num_blocks)) != cudaSuccess) return fail_cuda("malloc absmax_q", err);
+    if ((err = cudaMalloc(&d_absmax2,  num_groups * sizeof(float))) != cudaSuccess) return fail_cuda("malloc absmax2", err);
+    if ((err = cudaMalloc(&d_code2,    256 * sizeof(float)))        != cudaSuccess) return fail_cuda("malloc code2",   err);
+    if ((err = cudaMalloc(&d_out,      numel * sizeof(T)))          != cudaSuccess) return fail_cuda("malloc out",     err);
+
+    // ── Convert FP16 metadata to FP32 for GPU ─────────────────────────────────
+    // (GPU kernel uses float to avoid fp16 precision issues in scale multiplication)
+    std::vector<float> h_absmax2(num_groups), h_code2(256);
+    for (int64_t i = 0; i < num_groups; ++i) h_absmax2[i] = fp16_bits_to_float(input.absmax2_raw[i]);
+    for (int i = 0; i < 256; ++i)            h_code2[i]   = fp16_bits_to_float(input.code2_raw[i]);
+
+    // ── Host → Device transfers ───────────────────────────────────────────────
+    if ((err = cudaMemcpy(d_packed,   input.packed_weights.data(), num_pairs,           cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy packed",   err);
+    if ((err = cudaMemcpy(d_absmax_q, input.absmax_q.data(),       num_blocks,          cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy absmax_q", err);
+    if ((err = cudaMemcpy(d_absmax2,  h_absmax2.data(),            num_groups * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy absmax2",  err);
+    if ((err = cudaMemcpy(d_code2,    h_code2.data(),              256 * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) return fail_cuda("memcpy code2",    err);
+
+    const double bytes_read  = (double)(num_pairs + num_blocks + num_groups * 2 + 256 * 2);
+    const double bytes_write = (double)(numel * sizeof(T));
+    const double total_gb    = (bytes_read + bytes_write) / 1e9;
+    float ms_v2 = 0.0f;
+    float ms_v3 = 0.0f;
+    float ms_v4 = 0.0f;
+
+    {
+        const int threads_v2 = 256;
+        const int blocks_v2 = static_cast<int>(ceil_div(num_pairs, static_cast<int64_t>(threads_v2)));
+        cudaEvent_t t_start, t_stop;
+        cudaEventCreate(&t_start);
+        cudaEventCreate(&t_stop);
+        cudaEventRecord(t_start);
+        dequant_kernel<T><<<blocks_v2, threads_v2>>>(
+            d_packed, d_absmax_q, d_absmax2, d_code2,
+            input.offset, numel, input.config.blocksize, d_out);
+        cudaEventRecord(t_stop);
+        if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v2", err);
+        if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v2", err);
+        cudaEventElapsedTime(&ms_v2, t_start, t_stop);
+        cudaEventDestroy(t_start);
+        cudaEventDestroy(t_stop);
+    }
+
+    {
+        const int threads_v3 = 128;
+        const int64_t num_threads_v3 = ceil_div(num_pairs, static_cast<int64_t>(kPairsPerThreadV3));
+        const int blocks_v3 = static_cast<int>(ceil_div(num_threads_v3, static_cast<int64_t>(threads_v3)));
+        cudaEvent_t t_start, t_stop;
+        cudaEventCreate(&t_start);
+        cudaEventCreate(&t_stop);
+        cudaEventRecord(t_start);
+        dequant_kernel_v3<T><<<blocks_v3, threads_v3>>>(
+            d_packed, d_absmax_q, d_absmax2, d_code2,
+            input.offset, numel, input.config.blocksize, d_out);
+        cudaEventRecord(t_stop);
+        if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v3", err);
+        if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v3", err);
+        cudaEventElapsedTime(&ms_v3, t_start, t_stop);
+        cudaEventDestroy(t_start);
+        cudaEventDestroy(t_stop);
+    }
+
+    int min_grid_size_v4 = 0;
+    int block_size_v4 = 0;
+    if ((err = cudaOccupancyMaxPotentialBlockSize(
+             &min_grid_size_v4,
+             &block_size_v4,
+             dequant_kernel_v4<T>,
+             0,
+             128)) != cudaSuccess) {
+        return fail_cuda("occupancy v4", err);
+    }
+    if (block_size_v4 <= 0 || block_size_v4 > 128) {
+        block_size_v4 = 128;
+    }
+
+    {
+        const int64_t num_threads_v4 = ceil_div(num_pairs, static_cast<int64_t>(kPairsPerThreadV3));
+        const int blocks_v4 = static_cast<int>(ceil_div(num_threads_v4, static_cast<int64_t>(block_size_v4)));
+        cudaEvent_t t_start, t_stop;
+        cudaEventCreate(&t_start);
+        cudaEventCreate(&t_stop);
+        cudaEventRecord(t_start);
+        dequant_kernel_v4<T><<<blocks_v4, block_size_v4>>>(
+            d_packed, d_absmax_q, d_absmax2, d_code2,
+            input.offset, numel, input.config.blocksize, d_out);
+        cudaEventRecord(t_stop);
+        if ((err = cudaGetLastError()) != cudaSuccess) return fail_cuda("kernel launch v4", err);
+        if ((err = cudaEventSynchronize(t_stop)) != cudaSuccess) return fail_cuda("event sync v4", err);
+        cudaEventElapsedTime(&ms_v4, t_start, t_stop);
+        cudaEventDestroy(t_start);
+        cudaEventDestroy(t_stop);
+    }
+
+    if ((err = cudaDeviceSynchronize()) != cudaSuccess) return fail_cuda("sync", err);
+
+    const double bw_v2 = total_gb / (ms_v2 / 1000.0);
+    const double bw_v3 = total_gb / (ms_v3 / 1000.0);
+    const double bw_v4 = total_gb / (ms_v4 / 1000.0);
+    const double speedup_v3 = ms_v3 > 0.0 ? (ms_v2 / ms_v3) : 0.0;
+    const double speedup_v4 = ms_v4 > 0.0 ? (ms_v2 / ms_v4) : 0.0;
+    const double speedup_v4_vs_v3 = ms_v4 > 0.0 ? (ms_v3 / ms_v4) : 0.0;
+
+    std::cout << "[v2] Kernel time : " << ms_v2 << " ms  |  Bandwidth : " << bw_v2
+              << " GB/s  (" << (bw_v2 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl;
+    std::cout << "[v3] Kernel time : " << ms_v3 << " ms  |  Bandwidth : " << bw_v3
+              << " GB/s  (" << (bw_v3 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl;
+    std::cout << "[v3 speedup vs v2]: " << speedup_v3 << "x" << std::endl;
+    std::cout << "[v4] Kernel time : " << ms_v4 << " ms  |  Bandwidth : " << bw_v4
+              << " GB/s  (" << (bw_v4 / 1935.0 * 100.0) << "% of A100 peak 1935 GB/s)" << std::endl;
+    std::cout << "[v4 speedup vs v2]: " << speedup_v4 << "x" << std::endl;
+    std::cout << "[v4 speedup vs v3]: " << speedup_v4_vs_v3 << "x"
+              << "  |  occupancy block=" << block_size_v4
+              << " min_grid=" << min_grid_size_v4 << std::endl;
+
+    // ── Device → Host copy ────────────────────────────────────────────────────
+    gpu_fp32_out.resize(numel);
+    if constexpr (std::is_same<T, __half>::value) {
+        std::vector<__half> h_out(numel);
+        if ((err = cudaMemcpy(h_out.data(), d_out, numel * sizeof(__half), cudaMemcpyDeviceToHost)) != cudaSuccess)
+            return fail_cuda("memcpy output fp16", err);
+        for (int64_t i = 0; i < numel; ++i) gpu_fp32_out[i] = __half2float(h_out[i]);
+    } else {
+        std::vector<__nv_bfloat16> h_out(numel);
+        if ((err = cudaMemcpy(h_out.data(), d_out, numel * sizeof(__nv_bfloat16), cudaMemcpyDeviceToHost)) != cudaSuccess)
+            return fail_cuda("memcpy output bf16", err);
+        for (int64_t i = 0; i < numel; ++i) gpu_fp32_out[i] = __bfloat162float(h_out[i]);
+    }
+    output = gpu_fp32_out;
+
+    cudaFree(d_packed); cudaFree(d_absmax_q);
+    cudaFree(d_absmax2); cudaFree(d_code2); cudaFree(d_out);
+    return true;
+}
+
+}  // namespace
+
+// ── Public API ───────────────────────────────────────────────────────────────
+
+bool load_nf4_binary(const char* file_path, NF4Binary& out) {
+    std::ifstream fin(file_path, std::ios::binary);
+    if (!fin.is_open()) return false;
+
+    int64_t rows = 0, cols = 0; int32_t blocksize = 0;
+    fin.read(reinterpret_cast<char*>(&rows),      sizeof(rows));
+    fin.read(reinterpret_cast<char*>(&cols),      sizeof(cols));
+    fin.read(reinterpret_cast<char*>(&blocksize), sizeof(blocksize));
+    if (!fin.good()) return false;
+
+    const int64_t numel      = rows * cols;
+    const int64_t num_pairs  = ceil_div(numel, 2);
+    const int64_t num_blocks = ceil_div(numel, blocksize);
+    const int64_t num_groups = ceil_div(num_blocks, 256);
+
+    out.config = {rows, cols, blocksize, ComputeType::FP16};
+    out.packed_weights.resize(num_pairs);
+    out.absmax_q.resize(num_blocks);
+    out.absmax2_raw.resize(num_groups);
+    out.code2_raw.resize(256);
+
+    fin.read(reinterpret_cast<char*>(out.packed_weights.data()), num_pairs);
+    fin.read(reinterpret_cast<char*>(out.absmax_q.data()),       num_blocks);
+    fin.read(reinterpret_cast<char*>(out.absmax2_raw.data()),    num_groups * sizeof(uint16_t));
+    fin.read(reinterpret_cast<char*>(out.code2_raw.data()),      256 * sizeof(uint16_t));
+    fin.read(reinterpret_cast<char*>(&out.offset),               sizeof(float));
+
+    std::cout << "Loaded: " << rows << "x" << cols
+              << " blocksize=" << blocksize << " offset=" << out.offset << std::endl;
+    return fin.good();
+}
+
+bool save_float_output(const char* file_path, const std::vector<float>& data) {
+    std::ofstream fout(file_path, std::ios::binary);
+    if (!fout.is_open()) return false;
+    fout.write(reinterpret_cast<const char*>(data.data()), data.size() * sizeof(float));
+    return fout.good();
+}
+
+bool run_dequant_cuda(const NF4Binary& input, std::vector<float>& output, float& mae) {
+    std::vector<float> gpu_out;
+    const bool ok = (input.config.compute_type == ComputeType::FP16)
+        ? launch_cuda<__half>(input, output, gpu_out)
+        : launch_cuda<__nv_bfloat16>(input, output, gpu_out);
+    if (!ok) return false;
+
+    // MAE against CPU reference
+    std::vector<float> ref;
+    cpu_reference(input, ref);
+    double err_sum = 0.0;
+    for (size_t i = 0; i < ref.size(); ++i)
+        err_sum += std::abs((double)gpu_out[i] - (double)ref[i]);
+    mae = static_cast<float>(err_sum / (double)ref.size());
+    std::cout << "MAE (v4 GPU vs CPU ref): " << mae << (mae < 1e-2f ? "  ✓ PASS" : "  ✗ FAIL (threshold 1e-2)") << std::endl;
+    return true;
+}
diff --git a/03_nf4_dequant/SkyHigh-achieving/main.cpp b/03_nf4_dequant/SkyHigh-achieving/main.cpp
new file mode 100644
index 0000000..a891e6c
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/main.cpp
@@ -0,0 +1,76 @@
+#include "dequant_kernel.h"
+#include <cuda_runtime.h>
+#include <iostream>
+#include <string>
+#include <vector>
+
+namespace {
+
+ComputeType parse_compute_type(const std::string& s) {
+    if (s == "bf16") {
+        return ComputeType::BF16;
+    }
+    return ComputeType::FP16;
+}
+
+}  
+
+int main(int argc, char** argv) {
+    int deviceCount = 0;
+    cudaError_t error = cudaGetDeviceCount(&deviceCount);
+    if (error != cudaSuccess) {
+        std::cerr << "cudaGetDeviceCount failed: " << cudaGetErrorString(error) << std::endl;
+        std::cerr << "Ensure this binary is executed inside a GPU allocation (srun/sbatch)." << std::endl;
+        return -1;
+    }
+    if (deviceCount == 0) {
+        std::cerr << "No CUDA-capable devices found in current context." << std::endl;
+        std::cerr << "Use: srun --partition=nvidia --gres=gpu:nvidia:1 ... ./nf4_dequant" << std::endl;
+        return -1;
+    }
+
+    cudaDeviceProp prop;
+    cudaError_t propErr = cudaGetDeviceProperties(&prop, 0);
+    if (propErr == cudaSuccess) {
+        std::cout << "Using device 0: " << prop.name << " (Compute Capability " << prop.major << "." << prop.minor << ")" << std::endl;
+    } else {
+        std::cerr << "cudaGetDeviceProperties failed: " << cudaGetErrorString(propErr) << std::endl;
+        return -1;
+    }
+
+    if (argc < 4) {
+        std::cerr << "Usage: nf4_dequant <input.bin> <fp16|bf16> <output.bin>" << std::endl;
+        return 1;
+    }
+
+    NF4Binary input;
+    if (!load_nf4_binary(argv[1], input)) {
+        std::cerr << "Failed to load input binary: " << argv[1] << std::endl;
+        return 2;
+    }
+    input.config.compute_type = parse_compute_type(argv[2]);
+
+    std::vector<float> output;
+    float mae = 0.0f;
+    if (!run_dequant_cuda(input, output, mae)) {
+        std::cerr << "CUDA run failed" << std::endl;
+        return 3;
+    }
+
+    if (!save_float_output(argv[3], output)) {
+        std::cerr << "Failed to save output: " << argv[3] << std::endl;
+        return 4;
+    }
+
+    std::cout << "rows=" << input.config.rows
+              << " cols=" << input.config.cols
+              << " blocksize=" << input.config.blocksize
+              << " mae=" << mae << std::endl;
+
+    if (mae >= 1e-2f) {
+        std::cerr << "MAE threshold failed" << std::endl;
+        return 5;
+    }
+
+    return 0;
+}
diff --git a/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md b/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md
new file mode 100644
index 0000000..5fd877e
--- /dev/null
+++ b/03_nf4_dequant/SkyHigh-achieving/run_log_remote.md
@@ -0,0 +1,889 @@
+
+## Step0-CheckEnv
+- Time: 2026-03-10 20:57:45
+- Status: FAIL
+- Command: nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+./run_pipeline.sh: line 41: nvcc: command not found
+~~~
+
+## Step0-CheckEnv
+- Time: 2026-03-10 21:08:01
+- Status: FAIL
+- Command: nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+./run_pipeline.sh: line 41: nvcc: command not found
+~~~
+
+## Step0-CheckEnv
+- Time: 2026-03-10 22:27:26
+- Status: SUCCESS
+- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+NVCC=/usr/local/cuda/bin/nvcc
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Jan_15_19:20:09_PST_2025
+Cuda compilation tools, release 12.8, V12.8.61
+Build cuda_12.8.r12.8/compiler.35404655_0
+nvcc OK
+Python 3.12.3
+~~~
+
+## Step1-GenerateData
+- Time: 2026-03-10 22:27:26
+- Status: SUCCESS
+- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin
+
+~~~text
+Generating data: 1024x1024 (numel=1048576)
+  blocksize=64
+  num_pairs=524288
+  num_blocks=16384
+  num_groups=64
+Saved to sample_nf4.bin
+~~~
+
+## Step2-BuildCUDA
+- Time: 2026-03-10 22:27:29
+- Status: SUCCESS
+- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu
+
+~~~text
+
+~~~
+
+## Step3-RunDequant-GPU
+- Time: 2026-03-10 22:27:29
+- Status: SUCCESS
+- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+Kernel time : 2.43862 ms
+Bandwidth   : 1.08195 GB/s  (A100 peak ~1935 GB/s, 0.0559146%)
+MAE (GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+~~~
+
+## Step4-Profile-nsys
+- Time: 2026-03-10 22:27:35
+- Status: SUCCESS
+- Command: nsys profile             -o profile_report             -f true             --stats=true             --cuda-memory-usage=true             ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+Kernel time : 2.44131 ms
+Bandwidth   : 1.08076 GB/s  (A100 peak ~1935 GB/s, 0.0558531%)
+MAE (GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+Collecting data...
+Generating '/tmp/nsys-report-9fb8.qdstrm'
+[1/8] [0%                          ] profile_report.nsys-rep[1/8] [0%                          ] profile_report.nsys-rep[1/8] [6%                          ] profile_report.nsys-rep[1/8] [=======39%                  ] profile_report.nsys-rep[1/8] [=================73%        ] profile_report.nsys-rep[1/8] [===================79%      ] profile_report.nsys-rep[1/8] [====================84%     ] profile_report.nsys-rep[1/8] [====================85%     ] profile_report.nsys-rep[1/8] [======================92%   ] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep
+[2/8] [0%                          ] profile_report.sqlite[2/8] [1%                          ] profile_report.sqlite[2/8] [2%                          ] profile_report.sqlite[2/8] [3%                          ] profile_report.sqlite[2/8] [4%                          ] profile_report.sqlite[2/8] [5%                          ] profile_report.sqlite[2/8] [6%                          ] profile_report.sqlite[2/8] [7%                          ] profile_report.sqlite[2/8] [8%                          ] profile_report.sqlite[2/8] [9%                          ] profile_report.sqlite[2/8] [10%                         ] profile_report.sqlite[2/8] [11%                         ] profile_report.sqlite[2/8] [12%                         ] profile_report.sqlite[2/8] [13%                         ] profile_report.sqlite[2/8] [14%                         ] profile_report.sqlite[2/8] [=15%                        ] profile_report.sqlite[2/8] [=16%                        ] profile_report.sqlite[2/8] [=17%                        ] profile_report.sqlite[2/8] [==18%                       ] profile_report.sqlite[2/8] [==19%                       ] profile_report.sqlite[2/8] [==20%                       ] profile_report.sqlite[2/8] [==21%                       ] profile_report.sqlite[2/8] [===22%                      ] profile_report.sqlite[2/8] [===23%                      ] profile_report.sqlite[2/8] [===24%                      ] profile_report.sqlite[2/8] [====25%                     ] profile_report.sqlite[2/8] [====26%                     ] profile_report.sqlite[2/8] [====27%                     ] profile_report.sqlite[2/8] [====28%                     ] profile_report.sqlite[2/8] [=====29%                    ] profile_report.sqlite[2/8] [=====30%                    ] profile_report.sqlite[2/8] [=====31%                    ] profile_report.sqlite[2/8] [=====32%                    ] profile_report.sqlite[2/8] [======33%                   ] profile_report.sqlite[2/8] [======34%                   ] profile_report.sqlite[2/8] [======35%                   ] profile_report.sqlite[2/8] [=======36%                  ] profile_report.sqlite[2/8] [=======37%                  ] profile_report.sqlite[2/8] [=======38%                  ] profile_report.sqlite[2/8] [=======39%                  ] profile_report.sqlite[2/8] [========40%                 ] profile_report.sqlite[2/8] [========41%                 ] profile_report.sqlite[2/8] [========42%                 ] profile_report.sqlite[2/8] [=========43%                ] profile_report.sqlite[2/8] [=========44%                ] profile_report.sqlite[2/8] [=========45%                ] profile_report.sqlite[2/8] [=========46%                ] profile_report.sqlite[2/8] [==========47%               ] profile_report.sqlite[2/8] [==========48%               ] profile_report.sqlite[2/8] [==========49%               ] profile_report.sqlite[2/8] [===========50%              ] profile_report.sqlite[2/8] [===========51%              ] profile_report.sqlite[2/8] [===========52%              ] profile_report.sqlite[2/8] [===========53%              ] profile_report.sqlite[2/8] [============54%             ] profile_report.sqlite[2/8] [============55%             ] profile_report.sqlite[2/8] [============56%             ] profile_report.sqlite[2/8] [============57%             ] profile_report.sqlite[2/8] [=============58%            ] profile_report.sqlite[2/8] [=============59%            ] profile_report.sqlite[2/8] [=============60%            ] profile_report.sqlite[2/8] [==============61%           ] profile_report.sqlite[2/8] [==============62%           ] profile_report.sqlite[2/8] [==============63%           ] profile_report.sqlite[2/8] [==============64%           ] profile_report.sqlite[2/8] [===============65%          ] profile_report.sqlite[2/8] [===============66%          ] profile_report.sqlite[2/8] [===============67%          ] profile_report.sqlite[2/8] [================68%         ] profile_report.sqlite[2/8] [================69%         ] profile_report.sqlite[2/8] [================70%         ] profile_report.sqlite[2/8] [================71%         ] profile_report.sqlite[2/8] [=================72%        ] profile_report.sqlite[2/8] [=================73%        ] profile_report.sqlite[2/8] [=================74%        ] profile_report.sqlite[2/8] [==================75%       ] profile_report.sqlite[2/8] [==================76%       ] profile_report.sqlite[2/8] [==================77%       ] profile_report.sqlite[2/8] [==================78%       ] profile_report.sqlite[2/8] [===================79%      ] profile_report.sqlite[2/8] [===================80%      ] profile_report.sqlite[2/8] [===================81%      ] profile_report.sqlite[2/8] [===================82%      ] profile_report.sqlite[2/8] [====================83%     ] profile_report.sqlite[2/8] [====================84%     ] profile_report.sqlite[2/8] [====================85%     ] profile_report.sqlite[2/8] [=====================86%    ] profile_report.sqlite[2/8] [=====================87%    ] profile_report.sqlite[2/8] [=====================88%    ] profile_report.sqlite[2/8] [=====================89%    ] profile_report.sqlite[2/8] [======================90%   ] profile_report.sqlite[2/8] [======================91%   ] profile_report.sqlite[2/8] [======================92%   ] profile_report.sqlite[2/8] [=======================93%  ] profile_report.sqlite[2/8] [=======================94%  ] profile_report.sqlite[2/8] [=======================95%  ] profile_report.sqlite[2/8] [=======================96%  ] profile_report.sqlite[2/8] [========================97% ] profile_report.sqlite[2/8] [========================98% ] profile_report.sqlite[2/8] [========================99% ] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite
+SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data.
+[3/8] Executing 'nvtx_sum' stats report
+[4/8] Executing 'osrt_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
+ --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
+     55.2        730128649         16  45633040.6  26820737.5      1423  289243599   71896110.3  poll                  
+     44.2        583894130       1617    361097.2     85711.0      1036   21523206    1366756.0  ioctl                 
+      0.2          2889414         43     67195.7     13768.0      6551    1783848     270644.6  mmap64                
+      0.2          2185559          1   2185559.0   2185559.0   2185559    2185559          0.0  writev                
+      0.1           674119        118      5712.9      5081.0      1593      21596       3566.4  open64                
+      0.0           641353         10     64135.3     68869.5     24351     125865      35979.6  sem_timedwait         
+      0.0           544613        110      4951.0      2730.0      1005      63163       7202.7  fopen                 
+      0.0           318022          2    159011.0    159011.0    145779     172243      18712.9  pthread_create        
+      0.0           186230         13     14325.4      7970.0      2042      92103      23865.6  mmap                  
+      0.0           166282         11     15116.5      2277.0      1003     136872      40472.2  read                  
+      0.0            94287          1     94287.0     94287.0     94287      94287          0.0  pthread_cond_wait     
+      0.0            83055         11      7550.5      8191.0      4463      10457       2104.4  write                 
+      0.0            64940         33      1967.9      1410.0      1003       7914       1532.4  fclose                
+      0.0            64210          7      9172.9      9471.0      1080      21441       7047.1  fflush                
+      0.0            29258          1     29258.0     29258.0     29258      29258          0.0  fgets                 
+      0.0            28034          6      4672.3      5518.0      1233       6373       1951.9  open                  
+      0.0            18949          3      6316.3      6365.0      3919       8665       2373.4  munmap                
+      0.0            15052          5      3010.4      1779.0      1287       7850       2747.5  fwrite                
+      0.0            11502          1     11502.0     11502.0     11502      11502          0.0  connect               
+      0.0            11498          2      5749.0      5749.0      5008       6490       1047.9  socket                
+      0.0            11081          3      3693.7      3759.0      1860       5462       1801.9  pipe2                 
+      0.0             7859          7      1122.7      1066.0      1003       1417        145.0  fcntl                 
+      0.0             5869          2      2934.5      2934.5      2184       3685       1061.4  pthread_cond_broadcast
+      0.0             5356          1      5356.0      5356.0      5356       5356          0.0  fread                 
+      0.0             2275          1      2275.0      2275.0      2275       2275          0.0  bind                  
+
+[5/8] Executing 'cuda_api_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)                Name               
+ --------  ---------------  ---------  ----------  ---------  --------  ---------  -----------  ---------------------------------
+     97.6        466928783          5  93385756.6     5477.0      3995  466790440  208739570.3  cudaMalloc                       
+      1.0          4683757          5    936751.4   274816.0     49882    2417095    1104288.5  cudaMemcpy                       
+      0.5          2476208          1   2476208.0  2476208.0   2476208    2476208          0.0  cudaLaunchKernel                 
+      0.5          2325002          1   2325002.0  2325002.0   2325002    2325002          0.0  cudaDeviceSynchronize            
+      0.2          1119916          5    223983.2    28689.0      6410     923241     395735.0  cudaFree                         
+      0.2           968121          1    968121.0   968121.0    968121     968121          0.0  cudaGetDeviceProperties_v2_v12000
+      0.0            25180          2     12590.0    12590.0      8814      16366       5340.1  cudaEventRecord                  
+      0.0            11778          2      5889.0     5889.0       793      10985       7206.8  cudaEventCreate                  
+      0.0             2432          2      1216.0     1216.0       434       1998       1105.9  cudaEventDestroy                 
+      0.0              755          1       755.0      755.0       755        755          0.0  cuModuleGetLoadingMode           
+
+[6/8] Executing 'cuda_gpu_kern_sum' stats report
+
+ Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
+ --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
+    100.0            10272          1   10272.0   10272.0     10272     10272          0.0  void <unnamed>::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,…
+
+[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
+
+ Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation          
+ --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
+     66.6            84833      1   84833.0   84833.0     84833     84833          0.0  [CUDA memcpy Device-to-Host]
+     33.4            42624      4   10656.0    3536.0      1984     33568      15310.3  [CUDA memcpy Host-to-Device]
+
+[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
+
+ Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
+ ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
+      2.097      1     2.097     2.097     2.097     2.097        0.000  [CUDA memcpy Device-to-Host]
+      0.542      4     0.135     0.009     0.000     0.524        0.259  [CUDA memcpy Host-to-Device]
+
+Generated:
+    /home/qtc_yu/nf4_project/profile_report.nsys-rep
+    /home/qtc_yu/nf4_project/profile_report.sqlite
+~~~
+
+## Step5-BandwidthCalc
+- Time: 2026-03-10 22:27:35
+- Status: SUCCESS
+- Command: python3 -c "
+import struct, os, time
+
+# Theoretical A100 HBM2e bandwidth: ~1935 GB/s
+# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q)
+#                   + num_groups*2 bytes (absmax2) + 256*2 bytes (code2)
+# Our kernel writes: numel * 2 bytes (fp16 output)
+
+rows, cols, blocksize = 1024, 1024, 64
+numel = rows * cols
+num_pairs  = (numel + 1) // 2
+num_blocks = (numel + blocksize - 1) // blocksize
+num_groups = (num_blocks + 255) // 256
+
+bytes_read  = num_pairs + num_blocks + num_groups * 2 + 256 * 2
+bytes_write = numel * 2
+total_bytes = bytes_read + bytes_write
+
+print(f'Data movement analysis (1024x1024, fp16 output):')
+print(f'  Read  packed_weights : {num_pairs/1024:.0f} KB')
+print(f'  Read  absmax_q       : {num_blocks/1024:.0f} KB')
+print(f'  Read  absmax2+code2  : {(num_groups*2+512)/1024:.2f} KB')
+print(f'  Write fp16 output    : {bytes_write/1024/1024:.1f} MB')
+print(f'  Total data movement  : {total_bytes/1024/1024:.2f} MB')
+print(f'  A100 peak bandwidth  : 1935 GB/s')
+print(f'  Theoretical min time : {total_bytes/1935e9*1000:.3f} ms')
+print(f'  (if nsys shows >2x this, there is optimization headroom)')
+"
+
+~~~text
+Data movement analysis (1024x1024, fp16 output):
+  Read  packed_weights : 512 KB
+  Read  absmax_q       : 16 KB
+  Read  absmax2+code2  : 0.62 KB
+  Write fp16 output    : 2.0 MB
+  Total data movement  : 2.52 MB
+  A100 peak bandwidth  : 1935 GB/s
+  Theoretical min time : 0.001 ms
+  (if nsys shows >2x this, there is optimization headroom)
+~~~
+
+## Step0-CheckEnv
+- Time: 2026-03-11 16:56:22
+- Status: SUCCESS
+- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+NVCC=/usr/local/cuda/bin/nvcc
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Jan_15_19:20:09_PST_2025
+Cuda compilation tools, release 12.8, V12.8.61
+Build cuda_12.8.r12.8/compiler.35404655_0
+nvcc OK
+Python 3.12.3
+~~~
+
+## Step1-GenerateData
+- Time: 2026-03-11 16:56:23
+- Status: SUCCESS
+- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin
+
+~~~text
+Generating data: 1024x1024 (numel=1048576)
+  blocksize=64
+  num_pairs=524288
+  num_blocks=16384
+  num_groups=64
+Saved to sample_nf4.bin
+~~~
+
+## Step2-BuildCUDA
+- Time: 2026-03-11 16:56:26
+- Status: SUCCESS
+- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu
+
+~~~text
+
+~~~
+
+## Step3-RunDequant-GPU
+- Time: 2026-03-11 16:56:29
+- Status: SUCCESS
+- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.4392 ms  |  Bandwidth : 1.08169 GB/s  (0.0559014% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.024768 ms  |  Bandwidth : 106.527 GB/s  (5.50528% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 98.4819x
+[v4] Kernel time : 0.017472 ms  |  Bandwidth : 151.011 GB/s  (7.80419% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 139.606x
+[v4 speedup vs v3]: 1.41758x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+~~~
+
+## Step4-Profile-nsys
+- Time: 2026-03-11 16:56:44
+- Status: SUCCESS
+- Command: nsys profile             -o profile_report             -f true             --stats=true             --cuda-memory-usage=true             ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.44163 ms  |  Bandwidth : 1.08061 GB/s  (0.0558457% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.028192 ms  |  Bandwidth : 93.5891 GB/s  (4.83665% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 86.6073x
+[v4] Kernel time : 0.021504 ms  |  Bandwidth : 122.696 GB/s  (6.3409% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 113.543x
+[v4 speedup vs v3]: 1.31101x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+Collecting data...
+Generating '/tmp/nsys-report-d5e1.qdstrm'
+[1/8] [0%                          ] profile_report.nsys-rep[1/8] [0%                          ] profile_report.nsys-rep[1/8] [7%                          ] profile_report.nsys-rep[1/8] [=========44%                ] profile_report.nsys-rep[1/8] [===================81%      ] profile_report.nsys-rep[1/8] [====================84%     ] profile_report.nsys-rep[1/8] [=====================88%    ] profile_report.nsys-rep[1/8] [=====================89%    ] profile_report.nsys-rep[1/8] [=======================94%  ] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep
+[2/8] [0%                          ] profile_report.sqlite[2/8] [1%                          ] profile_report.sqlite[2/8] [2%                          ] profile_report.sqlite[2/8] [3%                          ] profile_report.sqlite[2/8] [4%                          ] profile_report.sqlite[2/8] [5%                          ] profile_report.sqlite[2/8] [6%                          ] profile_report.sqlite[2/8] [7%                          ] profile_report.sqlite[2/8] [8%                          ] profile_report.sqlite[2/8] [9%                          ] profile_report.sqlite[2/8] [10%                         ] profile_report.sqlite[2/8] [11%                         ] profile_report.sqlite[2/8] [12%                         ] profile_report.sqlite[2/8] [13%                         ] profile_report.sqlite[2/8] [14%                         ] profile_report.sqlite[2/8] [=15%                        ] profile_report.sqlite[2/8] [=16%                        ] profile_report.sqlite[2/8] [=17%                        ] profile_report.sqlite[2/8] [==18%                       ] profile_report.sqlite[2/8] [==19%                       ] profile_report.sqlite[2/8] [==20%                       ] profile_report.sqlite[2/8] [==21%                       ] profile_report.sqlite[2/8] [===22%                      ] profile_report.sqlite[2/8] [===23%                      ] profile_report.sqlite[2/8] [===24%                      ] profile_report.sqlite[2/8] [====25%                     ] profile_report.sqlite[2/8] [====26%                     ] profile_report.sqlite[2/8] [====27%                     ] profile_report.sqlite[2/8] [====28%                     ] profile_report.sqlite[2/8] [=====29%                    ] profile_report.sqlite[2/8] [=====30%                    ] profile_report.sqlite[2/8] [=====31%                    ] profile_report.sqlite[2/8] [=====32%                    ] profile_report.sqlite[2/8] [======33%                   ] profile_report.sqlite[2/8] [======34%                   ] profile_report.sqlite[2/8] [======35%                   ] profile_report.sqlite[2/8] [=======36%                  ] profile_report.sqlite[2/8] [=======37%                  ] profile_report.sqlite[2/8] [=======38%                  ] profile_report.sqlite[2/8] [=======39%                  ] profile_report.sqlite[2/8] [========40%                 ] profile_report.sqlite[2/8] [========41%                 ] profile_report.sqlite[2/8] [========42%                 ] profile_report.sqlite[2/8] [=========43%                ] profile_report.sqlite[2/8] [=========44%                ] profile_report.sqlite[2/8] [=========45%                ] profile_report.sqlite[2/8] [=========46%                ] profile_report.sqlite[2/8] [==========47%               ] profile_report.sqlite[2/8] [==========48%               ] profile_report.sqlite[2/8] [==========49%               ] profile_report.sqlite[2/8] [===========50%              ] profile_report.sqlite[2/8] [===========51%              ] profile_report.sqlite[2/8] [===========52%              ] profile_report.sqlite[2/8] [===========53%              ] profile_report.sqlite[2/8] [============54%             ] profile_report.sqlite[2/8] [============55%             ] profile_report.sqlite[2/8] [============56%             ] profile_report.sqlite[2/8] [============57%             ] profile_report.sqlite[2/8] [=============58%            ] profile_report.sqlite[2/8] [=============59%            ] profile_report.sqlite[2/8] [=============60%            ] profile_report.sqlite[2/8] [==============61%           ] profile_report.sqlite[2/8] [==============62%           ] profile_report.sqlite[2/8] [==============63%           ] profile_report.sqlite[2/8] [==============64%           ] profile_report.sqlite[2/8] [===============65%          ] profile_report.sqlite[2/8] [===============66%          ] profile_report.sqlite[2/8] [===============67%          ] profile_report.sqlite[2/8] [================68%         ] profile_report.sqlite[2/8] [================69%         ] profile_report.sqlite[2/8] [================70%         ] profile_report.sqlite[2/8] [================71%         ] profile_report.sqlite[2/8] [=================72%        ] profile_report.sqlite[2/8] [=================73%        ] profile_report.sqlite[2/8] [=================74%        ] profile_report.sqlite[2/8] [==================75%       ] profile_report.sqlite[2/8] [==================76%       ] profile_report.sqlite[2/8] [==================77%       ] profile_report.sqlite[2/8] [==================78%       ] profile_report.sqlite[2/8] [===================79%      ] profile_report.sqlite[2/8] [===================80%      ] profile_report.sqlite[2/8] [===================81%      ] profile_report.sqlite[2/8] [===================82%      ] profile_report.sqlite[2/8] [====================83%     ] profile_report.sqlite[2/8] [====================84%     ] profile_report.sqlite[2/8] [====================85%     ] profile_report.sqlite[2/8] [=====================86%    ] profile_report.sqlite[2/8] [=====================87%    ] profile_report.sqlite[2/8] [=====================88%    ] profile_report.sqlite[2/8] [=====================89%    ] profile_report.sqlite[2/8] [======================90%   ] profile_report.sqlite[2/8] [======================91%   ] profile_report.sqlite[2/8] [======================92%   ] profile_report.sqlite[2/8] [=======================93%  ] profile_report.sqlite[2/8] [=======================94%  ] profile_report.sqlite[2/8] [=======================95%  ] profile_report.sqlite[2/8] [=======================96%  ] profile_report.sqlite[2/8] [========================97% ] profile_report.sqlite[2/8] [========================98% ] profile_report.sqlite[2/8] [========================99% ] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite
+SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data.
+[3/8] Executing 'nvtx_sum' stats report
+[4/8] Executing 'osrt_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
+ --------  ---------------  ---------  ----------  -----------  --------  ---------  -----------  ----------------------
+     51.9       2939595733         38  77357782.4  100114525.5    199227  233531412   47228401.9  poll                  
+     48.0       2717151410       1616   1681405.6      25058.5      1088  735265052   25629204.6  ioctl                 
+      0.0          2301582         43     53525.2      12216.0      5070    1352339     204348.0  mmap64                
+      0.0          2299925          1   2299925.0    2299925.0   2299925    2299925          0.0  writev                
+      0.0           751805         26     28915.6       1624.0      1005     704577     137812.6  fclose                
+      0.0           657983         10     65798.3      60865.0     33136     114236      25911.3  sem_timedwait         
+      0.0           622080        118      5271.9       4083.0      1938      28789       3942.8  open64                
+      0.0           581144        140      4151.0       1980.0      1005      63631       6740.3  fopen                 
+      0.0           295241          2    147620.5     147620.5    131424     163817      22905.3  pthread_create        
+      0.0           175301         12     14608.4       1975.5      1033     144614      40999.9  read                  
+      0.0           160422         13     12340.2       6899.0      1634      80761      20947.7  mmap                  
+      0.0           154848          1    154848.0     154848.0    154848     154848          0.0  pthread_cond_wait     
+      0.0            74526         11      6775.1       7032.0      2977      10332       2308.0  write                 
+      0.0            50630          6      8438.3       7145.5      2686      18430       5594.4  fflush                
+      0.0            42885          1     42885.0      42885.0     42885      42885          0.0  fgets                 
+      0.0            33466          6      5577.7       5723.5      2735       7544       1791.0  open                  
+      0.0            28800          5      5760.0       2331.0      1554      20204       8082.4  fwrite                
+      0.0            15908          3      5302.7       4291.0      2813       8804       3121.0  munmap                
+      0.0            13623          3      4541.0       4190.0      2060       7373       2673.8  pipe2                 
+      0.0            12326          2      6163.0       6163.0      5102       7224       1500.5  socket                
+      0.0            10255          2      5127.5       5127.5      1508       8747       5118.7  pthread_cond_broadcast
+      0.0            10135          1     10135.0      10135.0     10135      10135          0.0  connect               
+      0.0             5859          5      1171.8       1141.0      1072       1320         93.5  fcntl                 
+      0.0             5162          1      5162.0       5162.0      5162       5162          0.0  fread                 
+      0.0             2403          1      2403.0       2403.0      2403       2403          0.0  bind                  
+
+[5/8] Executing 'cuda_api_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)   Max (ns)   StdDev (ns)                 Name               
+ --------  ---------------  ---------  -----------  ---------  --------  ----------  ------------  ---------------------------------
+     99.1       2675449092          5  535089818.4     6354.0      5265  2675288598  1196407490.6  cudaMalloc                       
+      0.5         12855125          5    2571025.0  1721985.0     51727     8553405     3497750.8  cudaMemcpy                       
+      0.3          6915921          3    2305307.0  2371967.0   2170818     2373136      116472.4  cudaEventSynchronize             
+      0.1          2694946          3     898315.3    41029.0      6375     2647542     1514973.8  cudaLaunchKernel                 
+      0.0          1057945          1    1057945.0  1057945.0   1057945     1057945           0.0  cudaGetDeviceProperties_v2_v12000
+      0.0           467458          5      93491.6    39587.0      6203      291255      120389.5  cudaFree                         
+      0.0            40459          6       6743.2     5736.0      3726       13570        3564.7  cudaEventRecord                  
+      0.0            17044          6       2840.7      826.5       578       11807        4445.4  cudaEventCreate                  
+      0.0             8885          1       8885.0     8885.0      8885        8885           0.0  cudaDeviceSynchronize            
+      0.0             4650          6        775.0      677.0       426        1255         329.2  cudaEventDestroy                 
+      0.0             1010          1       1010.0     1010.0      1010        1010           0.0  cuModuleGetLoadingMode           
+
+[6/8] Executing 'cuda_gpu_kern_sum' stats report
+
+ Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
+ --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
+     37.2            13344          1   13344.0   13344.0     13344     13344          0.0  void <unnamed>::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float…
+     35.2            12640          1   12640.0   12640.0     12640     12640          0.0  void <unnamed>::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float…
+     27.6             9888          1    9888.0    9888.0      9888      9888          0.0  void <unnamed>::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,…
+
+[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
+
+ Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation          
+ --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
+     63.8            84897      1   84897.0   84897.0     84897     84897          0.0  [CUDA memcpy Device-to-Host]
+     36.2            48192      4   12048.0    3456.0      1984     39296      18182.8  [CUDA memcpy Host-to-Device]
+
+[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
+
+ Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
+ ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
+      2.097      1     2.097     2.097     2.097     2.097        0.000  [CUDA memcpy Device-to-Host]
+      0.542      4     0.135     0.009     0.000     0.524        0.259  [CUDA memcpy Host-to-Device]
+
+Generated:
+    /home/qtc_yu/nf4_project/profile_report.nsys-rep
+    /home/qtc_yu/nf4_project/profile_report.sqlite
+~~~
+
+## Step5-BandwidthCalc
+- Time: 2026-03-11 16:56:44
+- Status: SUCCESS
+- Command: python3 -c "
+import struct, os, time
+
+# Theoretical A100 HBM2e bandwidth: ~1935 GB/s
+# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q)
+#                   + num_groups*2 bytes (absmax2) + 256*2 bytes (code2)
+# Our kernel writes: numel * 2 bytes (fp16 output)
+
+rows, cols, blocksize = 1024, 1024, 64
+numel = rows * cols
+num_pairs  = (numel + 1) // 2
+num_blocks = (numel + blocksize - 1) // blocksize
+num_groups = (num_blocks + 255) // 256
+
+bytes_read  = num_pairs + num_blocks + num_groups * 2 + 256 * 2
+bytes_write = numel * 2
+total_bytes = bytes_read + bytes_write
+
+print(f'Data movement analysis (1024x1024, fp16 output):')
+print(f'  Read  packed_weights : {num_pairs/1024:.0f} KB')
+print(f'  Read  absmax_q       : {num_blocks/1024:.0f} KB')
+print(f'  Read  absmax2+code2  : {(num_groups*2+512)/1024:.2f} KB')
+print(f'  Write fp16 output    : {bytes_write/1024/1024:.1f} MB')
+print(f'  Total data movement  : {total_bytes/1024/1024:.2f} MB')
+print(f'  A100 peak bandwidth  : 1935 GB/s')
+print(f'  Theoretical min time : {total_bytes/1935e9*1000:.3f} ms')
+print(f'  (if nsys shows >2x this, there is optimization headroom)')
+"
+
+~~~text
+Data movement analysis (1024x1024, fp16 output):
+  Read  packed_weights : 512 KB
+  Read  absmax_q       : 16 KB
+  Read  absmax2+code2  : 0.62 KB
+  Write fp16 output    : 2.0 MB
+  Total data movement  : 2.52 MB
+  A100 peak bandwidth  : 1935 GB/s
+  Theoretical min time : 0.001 ms
+  (if nsys shows >2x this, there is optimization headroom)
+~~~
+
+## Step0-CheckEnv
+- Time: 2026-03-11 17:27:31
+- Status: SUCCESS
+- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+NVCC=/usr/local/cuda/bin/nvcc
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Jan_15_19:20:09_PST_2025
+Cuda compilation tools, release 12.8, V12.8.61
+Build cuda_12.8.r12.8/compiler.35404655_0
+nvcc OK
+Python 3.12.3
+~~~
+
+## Step1-GenerateData
+- Time: 2026-03-11 17:27:31
+- Status: SUCCESS
+- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin
+
+~~~text
+Generating data: 1024x1024 (numel=1048576)
+  blocksize=64
+  num_pairs=524288
+  num_blocks=16384
+  num_groups=64
+Saved to sample_nf4.bin
+~~~
+
+## Step2-BuildCUDA
+- Time: 2026-03-11 17:27:34
+- Status: SUCCESS
+- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu
+
+~~~text
+
+~~~
+
+## Step3-RunDequant-GPU
+- Time: 2026-03-11 17:27:37
+- Status: SUCCESS
+- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.4385 ms  |  Bandwidth : 1.082 GB/s  (0.0559176% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.024736 ms  |  Bandwidth : 106.665 GB/s  (5.5124% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 98.5809x
+[v4] Kernel time : 0.017632 ms  |  Bandwidth : 149.641 GB/s  (7.73337% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 138.299x
+[v4 speedup vs v3]: 1.4029x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+~~~
+
+## Step4-Profile-nsys
+- Time: 2026-03-11 17:27:52
+- Status: SUCCESS
+- Command: nsys profile             -o profile_report             -f true             --stats=true             --cuda-memory-usage=true             ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.44227 ms  |  Bandwidth : 1.08033 GB/s  (0.0558311% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.027648 ms  |  Bandwidth : 95.4306 GB/s  (4.93181% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 88.3345x
+[v4] Kernel time : 0.020544 ms  |  Bandwidth : 128.43 GB/s  (6.6372% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 118.88x
+[v4 speedup vs v3]: 1.34579x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+Collecting data...
+Generating '/tmp/nsys-report-e3f0.qdstrm'
+[1/8] [0%                          ] profile_report.nsys-rep[1/8] [0%                          ] profile_report.nsys-rep[1/8] [7%                          ] profile_report.nsys-rep[1/8] [======33%                   ] profile_report.nsys-rep[1/8] [=============59%            ] profile_report.nsys-rep[1/8] [=================74%        ] profile_report.nsys-rep[1/8] [=====================88%    ] profile_report.nsys-rep[1/8] [=======================94%  ] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep
+[2/8] [0%                          ] profile_report.sqlite[2/8] [1%                          ] profile_report.sqlite[2/8] [2%                          ] profile_report.sqlite[2/8] [3%                          ] profile_report.sqlite[2/8] [4%                          ] profile_report.sqlite[2/8] [5%                          ] profile_report.sqlite[2/8] [6%                          ] profile_report.sqlite[2/8] [7%                          ] profile_report.sqlite[2/8] [8%                          ] profile_report.sqlite[2/8] [9%                          ] profile_report.sqlite[2/8] [10%                         ] profile_report.sqlite[2/8] [11%                         ] profile_report.sqlite[2/8] [12%                         ] profile_report.sqlite[2/8] [13%                         ] profile_report.sqlite[2/8] [14%                         ] profile_report.sqlite[2/8] [=15%                        ] profile_report.sqlite[2/8] [=16%                        ] profile_report.sqlite[2/8] [=17%                        ] profile_report.sqlite[2/8] [==18%                       ] profile_report.sqlite[2/8] [==19%                       ] profile_report.sqlite[2/8] [==20%                       ] profile_report.sqlite[2/8] [==21%                       ] profile_report.sqlite[2/8] [===22%                      ] profile_report.sqlite[2/8] [===23%                      ] profile_report.sqlite[2/8] [===24%                      ] profile_report.sqlite[2/8] [====25%                     ] profile_report.sqlite[2/8] [====26%                     ] profile_report.sqlite[2/8] [====27%                     ] profile_report.sqlite[2/8] [====28%                     ] profile_report.sqlite[2/8] [=====29%                    ] profile_report.sqlite[2/8] [=====30%                    ] profile_report.sqlite[2/8] [=====31%                    ] profile_report.sqlite[2/8] [=====32%                    ] profile_report.sqlite[2/8] [======33%                   ] profile_report.sqlite[2/8] [======34%                   ] profile_report.sqlite[2/8] [======35%                   ] profile_report.sqlite[2/8] [=======36%                  ] profile_report.sqlite[2/8] [=======37%                  ] profile_report.sqlite[2/8] [=======38%                  ] profile_report.sqlite[2/8] [=======39%                  ] profile_report.sqlite[2/8] [========40%                 ] profile_report.sqlite[2/8] [========41%                 ] profile_report.sqlite[2/8] [========42%                 ] profile_report.sqlite[2/8] [=========43%                ] profile_report.sqlite[2/8] [=========44%                ] profile_report.sqlite[2/8] [=========45%                ] profile_report.sqlite[2/8] [=========46%                ] profile_report.sqlite[2/8] [==========47%               ] profile_report.sqlite[2/8] [==========48%               ] profile_report.sqlite[2/8] [==========49%               ] profile_report.sqlite[2/8] [===========50%              ] profile_report.sqlite[2/8] [===========51%              ] profile_report.sqlite[2/8] [===========52%              ] profile_report.sqlite[2/8] [===========53%              ] profile_report.sqlite[2/8] [============54%             ] profile_report.sqlite[2/8] [============55%             ] profile_report.sqlite[2/8] [============56%             ] profile_report.sqlite[2/8] [============57%             ] profile_report.sqlite[2/8] [=============58%            ] profile_report.sqlite[2/8] [=============59%            ] profile_report.sqlite[2/8] [=============60%            ] profile_report.sqlite[2/8] [==============61%           ] profile_report.sqlite[2/8] [==============62%           ] profile_report.sqlite[2/8] [==============63%           ] profile_report.sqlite[2/8] [==============64%           ] profile_report.sqlite[2/8] [===============65%          ] profile_report.sqlite[2/8] [===============66%          ] profile_report.sqlite[2/8] [===============67%          ] profile_report.sqlite[2/8] [================68%         ] profile_report.sqlite[2/8] [================69%         ] profile_report.sqlite[2/8] [================70%         ] profile_report.sqlite[2/8] [================71%         ] profile_report.sqlite[2/8] [=================72%        ] profile_report.sqlite[2/8] [=================73%        ] profile_report.sqlite[2/8] [=================74%        ] profile_report.sqlite[2/8] [==================75%       ] profile_report.sqlite[2/8] [==================76%       ] profile_report.sqlite[2/8] [==================77%       ] profile_report.sqlite[2/8] [==================78%       ] profile_report.sqlite[2/8] [===================79%      ] profile_report.sqlite[2/8] [===================80%      ] profile_report.sqlite[2/8] [===================81%      ] profile_report.sqlite[2/8] [===================82%      ] profile_report.sqlite[2/8] [====================83%     ] profile_report.sqlite[2/8] [====================84%     ] profile_report.sqlite[2/8] [====================85%     ] profile_report.sqlite[2/8] [=====================86%    ] profile_report.sqlite[2/8] [=====================87%    ] profile_report.sqlite[2/8] [=====================88%    ] profile_report.sqlite[2/8] [=====================89%    ] profile_report.sqlite[2/8] [======================90%   ] profile_report.sqlite[2/8] [======================91%   ] profile_report.sqlite[2/8] [======================92%   ] profile_report.sqlite[2/8] [=======================93%  ] profile_report.sqlite[2/8] [=======================94%  ] profile_report.sqlite[2/8] [=======================95%  ] profile_report.sqlite[2/8] [=======================96%  ] profile_report.sqlite[2/8] [========================97% ] profile_report.sqlite[2/8] [========================98% ] profile_report.sqlite[2/8] [========================99% ] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite
+SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data.
+[3/8] Executing 'nvtx_sum' stats report
+[4/8] Executing 'osrt_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
+ --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
+     76.2       2851207041       1615   1765453.3    119973.0      1311  501696372   21175201.6  ioctl                 
+     23.5        880003946         16  55000246.6  22579085.5      1102  470726611  114966907.9  poll                  
+      0.1          2850024         43     66279.6     10821.0      5307    1906548     288230.1  mmap64                
+      0.0          1831231          1   1831231.0   1831231.0   1831231    1831231          0.0  writev                
+      0.0          1321864        131     10090.6      2920.0      1001     667826      58152.8  fopen                 
+      0.0           717257        118      6078.4      5004.0      1440      18302       3177.1  open64                
+      0.0           593912         10     59391.2     60245.0     20423      97386      27074.9  sem_timedwait         
+      0.0           533611         53     10068.1      1697.0      1012     435840      59614.8  fclose                
+      0.0           286950          2    143475.0    143475.0    126651     160299      23792.7  pthread_create        
+      0.0           138102         14      9864.4      1553.5      1053     110024      28873.8  read                  
+      0.0           136459         13     10496.8      6042.0      1803      62598      15888.9  mmap                  
+      0.0            97277          1     97277.0     97277.0     97277      97277          0.0  pthread_cond_wait     
+      0.0            67131         11      6102.8      6157.0      3404       8443       1890.5  write                 
+      0.0            48672          5      9734.4     10347.0      6606      12032       2486.1  fflush                
+      0.0            29335          1     29335.0     29335.0     29335      29335          0.0  fgets                 
+      0.0            25553          5      5110.6      4460.0      2276       8396       2474.8  open                  
+      0.0            14788          5      2957.6      1638.0      1153       8174       2950.3  fwrite                
+      0.0            10659          3      3553.0      3854.0      1473       5332       1947.0  pipe2                 
+      0.0            10605          2      5302.5      5302.5      4458       6147       1194.3  socket                
+      0.0            10389          2      5194.5      5194.5      2003       8386       4513.5  pthread_cond_broadcast
+      0.0            10091          3      3363.7      3285.0      3273       3533        146.8  munmap                
+      0.0             8491          1      8491.0      8491.0      8491       8491          0.0  connect               
+      0.0             4117          1      4117.0      4117.0      4117       4117          0.0  fread                 
+      0.0             2696          2      1348.0      1348.0      1084       1612        373.4  fcntl                 
+      0.0             2675          1      2675.0      2675.0      2675       2675          0.0  bind                  
+
+[5/8] Executing 'cuda_api_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)                Name               
+ --------  ---------------  ---------  ----------  ---------  --------  ---------  -----------  ---------------------------------
+     96.8        436058160          5  87211632.0     5193.0      4069  435177141  194518991.7  cudaMalloc                       
+      1.6          7040692          3   2346897.3  2380570.0   2261570    2398552      74440.6  cudaEventSynchronize             
+      0.7          3312570          5    662514.0   378321.0     42763    2418461     994452.7  cudaMemcpy                       
+      0.6          2602670          3    867556.7    31432.0      5296    2565942    1470902.9  cudaLaunchKernel                 
+      0.2           984173          1    984173.0   984173.0    984173     984173          0.0  cudaGetDeviceProperties_v2_v12000
+      0.1           368056          5     73611.2    20222.0      5300     228095      96551.4  cudaFree                         
+      0.0            30610          6      5101.7     4715.5      3057       8427       1872.8  cudaEventRecord                  
+      0.0            16552          6      2758.7      624.5       372      13271       5160.2  cudaEventCreate                  
+      0.0             5037          1      5037.0     5037.0      5037       5037          0.0  cudaDeviceSynchronize            
+      0.0             2815          6       469.2      472.0       245        780        193.1  cudaEventDestroy                 
+      0.0             1281          1      1281.0     1281.0      1281       1281          0.0  cuModuleGetLoadingMode           
+
+[6/8] Executing 'cuda_gpu_kern_sum' stats report
+
+ Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
+ --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
+     36.6            12928          1   12928.0   12928.0     12928     12928          0.0  void <unnamed>::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float…
+     35.4            12512          1   12512.0   12512.0     12512     12512          0.0  void <unnamed>::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float…
+     28.0             9888          1    9888.0    9888.0      9888      9888          0.0  void <unnamed>::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,…
+
+[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
+
+ Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation          
+ --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
+     68.9            85409      1   85409.0   85409.0     85409     85409          0.0  [CUDA memcpy Device-to-Host]
+     31.1            38528      4    9632.0    3376.0      1984     29792      13468.8  [CUDA memcpy Host-to-Device]
+
+[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
+
+ Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
+ ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
+      2.097      1     2.097     2.097     2.097     2.097        0.000  [CUDA memcpy Device-to-Host]
+      0.542      4     0.135     0.009     0.000     0.524        0.259  [CUDA memcpy Host-to-Device]
+
+Generated:
+    /home/qtc_yu/nf4_project/profile_report.nsys-rep
+    /home/qtc_yu/nf4_project/profile_report.sqlite
+~~~
+
+## Step5-BandwidthCalc
+- Time: 2026-03-11 17:27:52
+- Status: SUCCESS
+- Command: python3 -c "
+import struct, os, time
+
+# Theoretical A100 HBM2e bandwidth: ~1935 GB/s
+# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q)
+#                   + num_groups*2 bytes (absmax2) + 256*2 bytes (code2)
+# Our kernel writes: numel * 2 bytes (fp16 output)
+
+rows, cols, blocksize = 1024, 1024, 64
+numel = rows * cols
+num_pairs  = (numel + 1) // 2
+num_blocks = (numel + blocksize - 1) // blocksize
+num_groups = (num_blocks + 255) // 256
+
+bytes_read  = num_pairs + num_blocks + num_groups * 2 + 256 * 2
+bytes_write = numel * 2
+total_bytes = bytes_read + bytes_write
+
+print(f'Data movement analysis (1024x1024, fp16 output):')
+print(f'  Read  packed_weights : {num_pairs/1024:.0f} KB')
+print(f'  Read  absmax_q       : {num_blocks/1024:.0f} KB')
+print(f'  Read  absmax2+code2  : {(num_groups*2+512)/1024:.2f} KB')
+print(f'  Write fp16 output    : {bytes_write/1024/1024:.1f} MB')
+print(f'  Total data movement  : {total_bytes/1024/1024:.2f} MB')
+print(f'  A100 peak bandwidth  : 1935 GB/s')
+print(f'  Theoretical min time : {total_bytes/1935e9*1000:.3f} ms')
+print(f'  (if nsys shows >2x this, there is optimization headroom)')
+"
+
+~~~text
+Data movement analysis (1024x1024, fp16 output):
+  Read  packed_weights : 512 KB
+  Read  absmax_q       : 16 KB
+  Read  absmax2+code2  : 0.62 KB
+  Write fp16 output    : 2.0 MB
+  Total data movement  : 2.52 MB
+  A100 peak bandwidth  : 1935 GB/s
+  Theoretical min time : 0.001 ms
+  (if nsys shows >2x this, there is optimization headroom)
+~~~
+
+## Step6-InstallBnB
+- Time: 2026-03-11 17:27:54
+- Status: SUCCESS
+- Command: pip install bitsandbytes || true
+
+~~~text
+
+[notice] A new release of pip is available: 24.3.1 -> 26.0.1
+[notice] To update, run: python -m pip install --upgrade pip
+error: externally-managed-environment
+
+× This environment is externally managed
+╰─> To install Python packages system-wide, try apt install
+    python3-xyz, where xyz is the package you are trying to
+    install.
+    
+    If you wish to install a non-Debian-packaged Python package,
+    create a virtual environment using python3 -m venv path/to/venv.
+    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
+    sure you have python3-full installed.
+    
+    If you wish to install a non-Debian packaged Python application,
+    it may be easiest to use pipx install xyz, which will manage a
+    virtual environment for you. Make sure you have pipx installed.
+    
+    See /usr/share/doc/python3.12/README.venv for more information.
+
+note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
+hint: See PEP 668 for the detailed specification.
+~~~
+
+## Step6-BenchmarkBnB
+- Time: 2026-03-11 17:27:56
+- Status: SUCCESS
+- Command: python3 benchmark_vs_bnb.py
+
+~~~text
+bitsandbytes not installed. Run: pip install bitsandbytes
+~~~
+
+## Step0-CheckEnv
+- Time: 2026-03-11 18:01:42
+- Status: SUCCESS
+- Command: echo NVCC=/usr/local/cuda/bin/nvcc && /usr/local/cuda/bin/nvcc --version && echo 'nvcc OK' && python3 --version
+
+~~~text
+NVCC=/usr/local/cuda/bin/nvcc
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Jan_15_19:20:09_PST_2025
+Cuda compilation tools, release 12.8, V12.8.61
+Build cuda_12.8.r12.8/compiler.35404655_0
+nvcc OK
+Python 3.12.3
+~~~
+
+## Step1-GenerateData
+- Time: 2026-03-11 18:01:42
+- Status: SUCCESS
+- Command: python3 generate_nf4_bin.py --rows 1024 --cols 1024 --blocksize 64 --output sample_nf4.bin
+
+~~~text
+Generating data: 1024x1024 (numel=1048576)
+  blocksize=64
+  num_pairs=524288
+  num_blocks=16384
+  num_groups=64
+Saved to sample_nf4.bin
+~~~
+
+## Step2-BuildCUDA
+- Time: 2026-03-11 18:01:45
+- Status: SUCCESS
+- Command: /usr/local/cuda/bin/nvcc -O3 -std=c++17 -arch=sm_80 -lineinfo -o ./nf4_dequant main.cpp dequant_kernel.cu
+
+~~~text
+
+~~~
+
+## Step3-RunDequant-GPU
+- Time: 2026-03-11 18:01:48
+- Status: SUCCESS
+- Command: ./nf4_dequant sample_nf4.bin fp16 sample_out.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.44122 ms  |  Bandwidth : 1.0808 GB/s  (0.0558552% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.023552 ms  |  Bandwidth : 112.027 GB/s  (5.78952% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 103.652x
+[v4] Kernel time : 0.017408 ms  |  Bandwidth : 151.566 GB/s  (7.83288% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 140.235x
+[v4 speedup vs v3]: 1.35294x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+~~~
+
+## Step4-Profile-nsys
+- Time: 2026-03-11 18:02:00
+- Status: SUCCESS
+- Command: nsys profile             -o profile_report             -f true             --stats=true             --cuda-memory-usage=true             ./nf4_dequant sample_nf4.bin fp16 sample_out_profile.bin
+
+~~~text
+Using device 0: NVIDIA A100-SXM4-80GB (Compute Capability 8.0)
+Loaded: 1024x1024 blocksize=64 offset=0.0429335
+GPU launch: numel=1048576 pairs=524288 blocks=16384 groups=64
+[v2] Kernel time : 2.44509 ms  |  Bandwidth : 1.07909 GB/s  (0.0557668% of A100 peak 1935 GB/s)
+[v3] Kernel time : 0.027808 ms  |  Bandwidth : 94.8815 GB/s  (4.90344% of A100 peak 1935 GB/s)
+[v3 speedup vs v2]: 87.9275x
+[v4] Kernel time : 0.0208 ms  |  Bandwidth : 126.849 GB/s  (6.55552% of A100 peak 1935 GB/s)
+[v4 speedup vs v2]: 117.552x
+[v4 speedup vs v3]: 1.33692x  |  occupancy block=128 min_grid=1296
+MAE (v4 GPU vs CPU ref): 2.25737e-05  ✓ PASS
+rows=1024 cols=1024 blocksize=64 mae=2.25737e-05
+Collecting data...
+Generating '/tmp/nsys-report-3e9a.qdstrm'
+[1/8] [0%                          ] profile_report.nsys-rep[1/8] [0%                          ] profile_report.nsys-rep[1/8] [7%                          ] profile_report.nsys-rep[1/8] [==========47%               ] profile_report.nsys-rep[1/8] [=====================87%    ] profile_report.nsys-rep[1/8] [=====================88%    ] profile_report.nsys-rep[1/8] [=======================94%  ] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep[1/8] [========================100%] profile_report.nsys-rep
+[2/8] [0%                          ] profile_report.sqlite[2/8] [1%                          ] profile_report.sqlite[2/8] [2%                          ] profile_report.sqlite[2/8] [3%                          ] profile_report.sqlite[2/8] [4%                          ] profile_report.sqlite[2/8] [5%                          ] profile_report.sqlite[2/8] [6%                          ] profile_report.sqlite[2/8] [7%                          ] profile_report.sqlite[2/8] [8%                          ] profile_report.sqlite[2/8] [9%                          ] profile_report.sqlite[2/8] [10%                         ] profile_report.sqlite[2/8] [11%                         ] profile_report.sqlite[2/8] [12%                         ] profile_report.sqlite[2/8] [13%                         ] profile_report.sqlite[2/8] [14%                         ] profile_report.sqlite[2/8] [=15%                        ] profile_report.sqlite[2/8] [=16%                        ] profile_report.sqlite[2/8] [=17%                        ] profile_report.sqlite[2/8] [==18%                       ] profile_report.sqlite[2/8] [==19%                       ] profile_report.sqlite[2/8] [==20%                       ] profile_report.sqlite[2/8] [==21%                       ] profile_report.sqlite[2/8] [===22%                      ] profile_report.sqlite[2/8] [===23%                      ] profile_report.sqlite[2/8] [===24%                      ] profile_report.sqlite[2/8] [====25%                     ] profile_report.sqlite[2/8] [====26%                     ] profile_report.sqlite[2/8] [====27%                     ] profile_report.sqlite[2/8] [====28%                     ] profile_report.sqlite[2/8] [=====29%                    ] profile_report.sqlite[2/8] [=====30%                    ] profile_report.sqlite[2/8] [=====31%                    ] profile_report.sqlite[2/8] [=====32%                    ] profile_report.sqlite[2/8] [======33%                   ] profile_report.sqlite[2/8] [======34%                   ] profile_report.sqlite[2/8] [======35%                   ] profile_report.sqlite[2/8] [=======36%                  ] profile_report.sqlite[2/8] [=======37%                  ] profile_report.sqlite[2/8] [=======38%                  ] profile_report.sqlite[2/8] [=======39%                  ] profile_report.sqlite[2/8] [========40%                 ] profile_report.sqlite[2/8] [========41%                 ] profile_report.sqlite[2/8] [========42%                 ] profile_report.sqlite[2/8] [=========43%                ] profile_report.sqlite[2/8] [=========44%                ] profile_report.sqlite[2/8] [=========45%                ] profile_report.sqlite[2/8] [=========46%                ] profile_report.sqlite[2/8] [==========47%               ] profile_report.sqlite[2/8] [==========48%               ] profile_report.sqlite[2/8] [==========49%               ] profile_report.sqlite[2/8] [===========50%              ] profile_report.sqlite[2/8] [===========51%              ] profile_report.sqlite[2/8] [===========52%              ] profile_report.sqlite[2/8] [===========53%              ] profile_report.sqlite[2/8] [============54%             ] profile_report.sqlite[2/8] [============55%             ] profile_report.sqlite[2/8] [============56%             ] profile_report.sqlite[2/8] [============57%             ] profile_report.sqlite[2/8] [=============58%            ] profile_report.sqlite[2/8] [=============59%            ] profile_report.sqlite[2/8] [=============60%            ] profile_report.sqlite[2/8] [==============61%           ] profile_report.sqlite[2/8] [==============62%           ] profile_report.sqlite[2/8] [==============63%           ] profile_report.sqlite[2/8] [==============64%           ] profile_report.sqlite[2/8] [===============65%          ] profile_report.sqlite[2/8] [===============66%          ] profile_report.sqlite[2/8] [===============67%          ] profile_report.sqlite[2/8] [================68%         ] profile_report.sqlite[2/8] [================69%         ] profile_report.sqlite[2/8] [================70%         ] profile_report.sqlite[2/8] [================71%         ] profile_report.sqlite[2/8] [=================72%        ] profile_report.sqlite[2/8] [=================73%        ] profile_report.sqlite[2/8] [=================74%        ] profile_report.sqlite[2/8] [==================75%       ] profile_report.sqlite[2/8] [==================76%       ] profile_report.sqlite[2/8] [==================77%       ] profile_report.sqlite[2/8] [==================78%       ] profile_report.sqlite[2/8] [===================79%      ] profile_report.sqlite[2/8] [===================80%      ] profile_report.sqlite[2/8] [===================81%      ] profile_report.sqlite[2/8] [===================82%      ] profile_report.sqlite[2/8] [====================83%     ] profile_report.sqlite[2/8] [====================84%     ] profile_report.sqlite[2/8] [====================85%     ] profile_report.sqlite[2/8] [=====================86%    ] profile_report.sqlite[2/8] [=====================87%    ] profile_report.sqlite[2/8] [=====================88%    ] profile_report.sqlite[2/8] [=====================89%    ] profile_report.sqlite[2/8] [======================90%   ] profile_report.sqlite[2/8] [======================91%   ] profile_report.sqlite[2/8] [======================92%   ] profile_report.sqlite[2/8] [=======================93%  ] profile_report.sqlite[2/8] [=======================94%  ] profile_report.sqlite[2/8] [=======================95%  ] profile_report.sqlite[2/8] [=======================96%  ] profile_report.sqlite[2/8] [========================97% ] profile_report.sqlite[2/8] [========================98% ] profile_report.sqlite[2/8] [========================99% ] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite[2/8] [========================100%] profile_report.sqlite
+SKIPPED: /home/qtc_yu/nf4_project/profile_report.sqlite does not contain NV Tools Extension (NVTX) data.
+[3/8] Executing 'nvtx_sum' stats report
+[4/8] Executing 'osrt_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)   Min (ns)   Max (ns)   StdDev (ns)           Name         
+ --------  ---------------  ---------  -----------  ----------  --------  ----------  -----------  ----------------------
+     51.0       2753430245         15  183562016.3  26656737.0      1631  2323704767  592888638.6  poll                  
+     48.8       2631466158       1617    1627375.5     22843.0      1007   498039698   19268618.1  ioctl                 
+      0.1          3693979         43      85906.5     13029.0      7274     2609045     394811.2  mmap64                
+      0.0          1912594          1    1912594.0   1912594.0   1912594     1912594          0.0  writev                
+      0.0           704959        118       5974.2      4701.5      1527       19800       3408.4  open64                
+      0.0           650469        134       4854.2      2232.0      1016       58307       7481.4  fopen                 
+      0.0           590786         10      59078.6     60078.5     34958       91181      18134.5  sem_timedwait         
+      0.0           498560         33      15107.9      1839.0      1016      438830      76071.9  fclose                
+      0.0           309886          2     154943.0    154943.0    129716      180170      35676.4  pthread_create        
+      0.0           200881         13      15452.4      1431.0      1033      167912      45873.3  read                  
+      0.0           165861          6      27643.5     13995.5      6489       86753      30575.8  fflush                
+      0.0           145824         13      11217.2      7556.0      1987       55220      13864.4  mmap                  
+      0.0            87185          1      87185.0     87185.0     87185       87185          0.0  pthread_cond_wait     
+      0.0            69919         11       6356.3      6637.0      3504        8158       1675.0  write                 
+      0.0            29613          1      29613.0     29613.0     29613       29613          0.0  fgets                 
+      0.0            24208          5       4841.6      5185.0      3032        5932       1231.7  open                  
+      0.0            18867          3       6289.0      4879.0      3101       10887       4080.0  munmap                
+      0.0            13821          3       4607.0      4305.0      2910        6606       1866.4  pipe2                 
+      0.0            12873          4       3218.3      1830.0      1056        8157       3333.9  fwrite                
+      0.0            10708          2       5354.0      5354.0      5340        5368         19.8  socket                
+      0.0             7847          1       7847.0      7847.0      7847        7847          0.0  connect               
+      0.0             6510          2       3255.0      3255.0      1904        4606       1910.6  pthread_cond_broadcast
+      0.0             5520          4       1380.0      1235.0      1097        1953        387.6  fcntl                 
+      0.0             4893          1       4893.0      4893.0      4893        4893          0.0  fread                 
+      0.0             2160          1       2160.0      2160.0      2160        2160          0.0  bind                  
+
+[5/8] Executing 'cuda_api_sum' stats report
+
+ Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)                Name               
+ --------  ---------------  ---------  ----------  ---------  --------  ---------  -----------  ---------------------------------
+     96.8        480994048          5  96198809.6     5851.0      4040  480806247  215002105.9  cudaMalloc                       
+      1.4          7030660          3   2343553.3  2396114.0   2234617    2399929      94360.9  cudaEventSynchronize             
+      0.9          4717894          5    943578.8   344488.0     44010    2418249    1094250.2  cudaMemcpy                       
+      0.5          2634583          3    878194.3    29781.0      4915    2599887    1491081.4  cudaLaunchKernel                 
+      0.2           962535          1    962535.0   962535.0    962535     962535          0.0  cudaGetDeviceProperties_v2_v12000
+      0.1           351933          5     70386.6    22088.0      5190     232265      96551.6  cudaFree                         
+      0.0            31362          6      5227.0     4506.0      2791       9480       2448.1  cudaEventRecord                  
+      0.0            13758          6      2293.0      599.5       334       9501       3613.2  cudaEventCreate                  
+      0.0             5317          1      5317.0     5317.0      5317       5317          0.0  cudaDeviceSynchronize            
+      0.0             2872          6       478.7      446.5       230        963        266.5  cudaEventDestroy                 
+      0.0             1317          1      1317.0     1317.0      1317       1317          0.0  cuModuleGetLoadingMode           
+
+[6/8] Executing 'cuda_gpu_kern_sum' stats report
+
+ Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
+ --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
+     36.3            12768          1   12768.0   12768.0     12768     12768          0.0  void <unnamed>::dequant_kernel_v3<__half>(const unsigned char *, const unsigned char *, const float…
+     35.5            12512          1   12512.0   12512.0     12512     12512          0.0  void <unnamed>::dequant_kernel_v4<__half>(const unsigned char *, const unsigned char *, const float…
+     28.2             9920          1    9920.0    9920.0      9920      9920          0.0  void <unnamed>::dequant_kernel<__half>(const unsigned char *, const unsigned char *, const float *,…
+
+[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
+
+ Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation          
+ --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
+     67.1            84929      1   84929.0   84929.0     84929     84929          0.0  [CUDA memcpy Device-to-Host]
+     32.9            41568      4   10392.0    3376.0      2016     32800      14964.0  [CUDA memcpy Host-to-Device]
+
+[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
+
+ Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
+ ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
+      2.097      1     2.097     2.097     2.097     2.097        0.000  [CUDA memcpy Device-to-Host]
+      0.542      4     0.135     0.009     0.000     0.524        0.259  [CUDA memcpy Host-to-Device]
+
+Generated:
+    /home/qtc_yu/nf4_project/profile_report.nsys-rep
+    /home/qtc_yu/nf4_project/profile_report.sqlite
+~~~
+
+## Step5-BandwidthCalc
+- Time: 2026-03-11 18:02:00
+- Status: SUCCESS
+- Command: python3 -c "
+import struct, os, time
+
+# Theoretical A100 HBM2e bandwidth: ~1935 GB/s
+# Our kernel reads: num_pairs bytes (packed) + num_blocks bytes (absmax_q)
+#                   + num_groups*2 bytes (absmax2) + 256*2 bytes (code2)
+# Our kernel writes: numel * 2 bytes (fp16 output)
+
+rows, cols, blocksize = 1024, 1024, 64
+numel = rows * cols
+num_pairs  = (numel + 1) // 2
+num_blocks = (numel + blocksize - 1) // blocksize
+num_groups = (num_blocks + 255) // 256
+
+bytes_read  = num_pairs + num_blocks + num_groups * 2 + 256 * 2
+bytes_write = numel * 2
+total_bytes = bytes_read + bytes_write
+
+print(f'Data movement analysis (1024x1024, fp16 output):')
+print(f'  Read  packed_weights : {num_pairs/1024:.0f} KB')
+print(f'  Read  absmax_q       : {num_blocks/1024:.0f} KB')
+print(f'  Read  absmax2+code2  : {(num_groups*2+512)/1024:.2f} KB')
+print(f'  Write fp16 output    : {bytes_write/1024/1024:.1f} MB')
+print(f'  Total data movement  : {total_bytes/1024/1024:.2f} MB')
+print(f'  A100 peak bandwidth  : 1935 GB/s')
+print(f'  Theoretical min time : {total_bytes/1935e9*1000:.3f} ms')
+print(f'  (if nsys shows >2x this, there is optimization headroom)')
+"
+
+~~~text
+Data movement analysis (1024x1024, fp16 output):
+  Read  packed_weights : 512 KB
+  Read  absmax_q       : 16 KB
+  Read  absmax2+code2  : 0.62 KB
+  Write fp16 output    : 2.0 MB
+  Total data movement  : 2.52 MB
+  A100 peak bandwidth  : 1935 GB/s
+  Theoretical min time : 0.001 ms
+  (if nsys shows >2x this, there is optimization headroom)
+~~~
+
+## Step6-InstallBnB
+- Time: 2026-03-11 18:02:09
+- Status: SUCCESS
+- Command: pip install bitsandbytes --break-system-packages || pip install --user bitsandbytes || true
+
+~~~text
+Defaulting to user installation because normal site-packages is not writeable
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/dill-0.3.9-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/opt_einsum-3.4.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.12.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_thunder-0.2.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/nvfuser-0.2.23a0+6627725-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
+Collecting bitsandbytes
+  Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
+Requirement already satisfied: torch<3,>=2.3 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (2.6.0a0+ecf3bae40a.nv25.1)
+Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (1.26.4)
+Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.12/dist-packages (from bitsandbytes) (23.2)
+Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.16.1)
+Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (4.12.2)
+Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.4.2)
+Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (3.1.4)
+Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (2024.10.0)
+Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (70.3.0)
+Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch<3,>=2.3->bitsandbytes) (1.13.1)
+Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch<3,>=2.3->bitsandbytes) (1.3.0)
+Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch<3,>=2.3->bitsandbytes) (3.0.2)
+Downloading bitsandbytes-0.49.2-py3-none-manylinux_2_24_x86_64.whl (60.7 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.7/60.7 MB 13.5 MB/s eta 0:00:00
+Installing collected packages: bitsandbytes
+Successfully installed bitsandbytes-0.49.2
+
+[notice] A new release of pip is available: 24.3.1 -> 26.0.1
+[notice] To update, run: python -m pip install --upgrade pip
+~~~
+
+## Step6-BenchmarkBnB
+- Time: 2026-03-11 18:02:21
+- Status: SUCCESS
+- Command: python3 benchmark_vs_bnb.py
+
+~~~text
+Benchmarking bitsandbytes on NVIDIA A100-SXM4-80GB...
+Warmup...
+Benchmarking...
+bitsandbytes dequantize_4bit (8192x8192, nf4, blocksize=64):
+  Time: 0.341 ms
+  Bandwidth: 492.41 GB/s
+~~~