Conversation
When a kernel has multiple top-level for loops, each becomes a separate GPU kernel. This adds a `name` parameter to `qd.loop_config()` that propagates through the compilation pipeline and appears in the compiled kernel names, making them identifiable in profilers and traces. Example: `qd.loop_config(name="compute_forces")` produces kernel names like `my_kernel_c0_abc_0_compute_forces_range_for` instead of the generic `my_kernel_c0_abc_0_range_for`. The loop_name also appears in IR dumps (QD_DUMP_IR=1) as `loop_name=...` on offloaded range_for/struct_for statements. Includes 6 unit tests covering correctness, name propagation in IR dumps, no-leak between loops, mixed named/unnamed, and serialize combo.
The clone() methods for RangeForStmt, StructForStmt, and OffloadedStmt were not copying the loop_name field. Since the LLVM codegen clones each offloaded task before compilation, the user-provided loop name was lost.
| parallelize=None, | ||
| block_dim_adaptive=True, | ||
| bit_vectorize=False, | ||
| name=None, |
| parallelize (int): The number of threads to use on CPU | ||
| block_dim_adaptive (bool): Whether to allow backends set block_dim adaptively, enabled by default | ||
| bit_vectorize (bool): Whether to enable bit vectorization of struct fors on quant_arrays. | ||
| name (str): Optional name for this loop, used in GPU kernel names for profiling and debugging. |
| if bit_vectorize: | ||
| _bit_vectorize() | ||
|
|
||
| if name is not None: |
There was a problem hiding this comment.
Ok, so this is part of the qd.loop_config function that we are putting over for loops in the python scope. like
qd.loop_config(name='foo',...)
|
|
||
| auto task_kernel_name = fmt::format( | ||
| "{}_{}_{}{}", kernel_name, task_codegen_id, stmt->task_name(), suffix); | ||
| auto task_kernel_name = |
There was a problem hiding this comment.
note to self: loop name is coming from OffloadedStmt stmt
| task_name_(fmt::format("{}_t{:02d}", | ||
| params.ti_kernel_name, | ||
| params.task_id_in_kernel)) { | ||
| task_name_(params.task_ir->loop_name.empty() |
There was a problem hiding this comment.
note to self: loop name coming from params.task_ir , but this is also an OffloadedStmt *, same as in codegen_llvm.cpp, so both get from OffloadedStmt.
| strictly_serialized(o.strictly_serialized), | ||
| mem_access_opt(o.mem_access_opt), | ||
| block_dim(o.block_dim) { | ||
| block_dim(o.block_dim), |
There was a problem hiding this comment.
note to self: FrontEndForStmt
|
|
||
| void FrontendForStmt::init_config(Arch arch, const ForLoopConfig &config) { | ||
| is_bit_vectorized = config.is_bit_vectorized; | ||
| strictly_serialized = config.strictly_serialized; |
There was a problem hiding this comment.
note to self: FrontendForStmt initailzied from ForLoopConfig
| MemoryAccessOptions mem_access_opt; | ||
| int block_dim{0}; | ||
| bool uniform{false}; | ||
| std::string loop_name{""}; |
There was a problem hiding this comment.
note to self: this is ForLoopConfig
There was a problem hiding this comment.
ForLoopConfig struct definition
| bool strictly_serialized; | ||
| MemoryAccessOptions mem_access_opt; | ||
| int block_dim; | ||
| std::string loop_name; |
There was a problem hiding this comment.
FrontendForStmt defintion.
There was a problem hiding this comment.
note to self: FrontendForStmt constructors take in a ForLoopConfig.
| for_loop_dec_.config.block_dim = v; | ||
| } | ||
|
|
||
| void set_loop_name(const std::string &loop_name) { |
There was a problem hiding this comment.
note to self: this is in class ASTBuilder 🤔
There was a problem hiding this comment.
note to self: for_loop_dec_ is a ForLoopDecoratorRecorder.
There was a problem hiding this comment.
note to self: and it conatins a config, which is of type ForLoopConfig
There was a problem hiding this comment.
so in astbuilder, calling set_loop_name sets the loop name on the ForLoopConfig contained int the ForLoopDecoratorRecorder
| int block_dim, | ||
| bool strictly_serialized, | ||
| std::string range_hint) | ||
| std::string range_hint, |
There was a problem hiding this comment.
note to self: RangeForStmt
Replace tempfile.TemporaryDirectory with the tmp_path fixture in the two offline IR dump tests. Made-with: Cursor
Per-test qd.reset() resets the default compile config, so saving and restoring debug_dump_path is unnecessary; keep restoring QD_DUMP_IR only. Made-with: Cursor
Use monkeypatch.setenv so the env var is restored automatically instead of a manual finally block. Made-with: Cursor
Add a third range-for with loop_config(block_dim=64) but no name; assert exactly two loop_name= annotations and each expected name appears once. Made-with: Cursor
Remove test_loop_config_name_correctness; assert fill_a/fill_b ndarray values in test_loop_config_name_ir_dump alongside c. Made-with: Cursor
Covered by test_loop_config_name_ir_dump (named + unnamed-for IR and values). Made-with: Cursor
Leak behavior is covered by IR dump tests (count loop_name=) and ndarray checks. Made-with: Cursor
Single- vs two-name leak cases are already covered by IR counts in test_loop_config_name_ir_dump. Made-with: Cursor
Serialize path range-fors are printed without OffloadedStmt loop_name=; include the stored loop name in the RangeForStmt line so after_offload dumps stay searchable. Made-with: Cursor
- Write PTX beside IR dumps under compile_config.debug_dump_path (with QD_DUMP_IR set) instead of hardcoded /tmp/ptx; match QUADRANTS_LOAD_PTX path. - Add QD_DUMP_SPIRV=1 to emit *_before_opt.spirv / *_after_opt.spirv text without requiring QD_DUMP_IR. Made-with: Cursor
- test_loop_config_name_cuda_ptx_dump: QD_DUMP_IR, asserts .version/.target in .ptx under debug_dump_path (CUDA arch only; skips when unsupported). - test_loop_config_name_spirv_dump: QD_DUMP_SPIRV on metal/vulkan (vulkan excluded on Darwin), checks OpCapability in *_after_opt.spirv. Made-with: Cursor
Made-with: Cursor
| spirv_files = list(tmp_path.glob("*_after_opt.spirv")) | ||
| assert len(spirv_files) > 0, f"No *_after_opt.spirv under {tmp_path}" | ||
| combined = "\n".join(f.read_text() for f in spirv_files) | ||
| assert "OpCapability" in combined, combined |
There was a problem hiding this comment.
irrelvant assert, and missing loop name assert
Offloaded task TaskAttributes::name already encodes loop_name (e.g. kernel_0_t00_myloop). Use it for *_before_opt.spirv / *_after_opt.spirv filenames so dumps match PTX-style discoverability. SPIR-V disassembly does not contain that string; assert on the dumped filename. Loop_config test adds name spirv_fill_loop and drops OpCapability check. Made-with: Cursor
| ptx_files = list(tmp_path.glob("*.ptx")) | ||
| assert len(ptx_files) > 0, f"No .ptx files under debug_dump_path {tmp_path}" | ||
| combined = "\n".join(f.read_text() for f in ptx_files) | ||
| assert ".version" in combined, combined |
There was a problem hiding this comment.
irrelvant assert, and missing loop name assert
Made-with: Cursor
| auto *root = params_.ir_root->as<Block>(); | ||
|
|
||
| const char *dump_ir_env = std::getenv(DUMP_IR_ENV.data()); | ||
| const char *dump_spirv_env = std::getenv(DUMP_SPIRV_ENV.data()); |
There was a problem hiding this comment.
dont think we want this.
|
|
||
| TaskCodegen cgen(tp); | ||
| auto task_res = cgen.run(); | ||
| const std::string &spirv_dump_basename = task_res.task_attribs.name; |
|
I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c72d21a51b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| : fmt::format("{}_{}_{}_{}{}", kernel_name, task_codegen_id, | ||
| stmt->loop_name, stmt->task_name(), suffix); |
There was a problem hiding this comment.
Sanitize loop_name before embedding it in task symbols
Building LLVM task function names directly from stmt->loop_name allows characters that are later rewritten by CUDA JIT name sanitization (jit_cuda.cpp convert()), but the launcher still looks up kernels using the original unsanitized OffloadedTask::name (kernel_launcher.cpp). If a user sets qd.loop_config(name=...) with characters like -, space, or /, PTX symbols are renamed while launch-time lookup uses the old name, causing kernel lookup failures at runtime for CUDA.
Useful? React with 👍 / 👎.
| offloaded->body->insert(std::move(s->body->statements[j])); | ||
| } | ||
| offloaded->range_hint = s->range_hint; | ||
| offloaded->loop_name = s->loop_name; | ||
| root_block->insert(std::move(offloaded)); | ||
| } else if (auto st = stmt->cast<StructForStmt>()) { | ||
| assemble_serial_statements(); |
There was a problem hiding this comment.
🔴 When loop_config(serialize=True, name="foo") is used, the user-supplied name is silently dropped from compiled LLVM/PTX function names. The loop_name is correctly stored on the inner RangeForStmt (visible in IR dumps), but is never copied onto the outer serial OffloadedStmt that codegen uses to derive function names. The existing test test_loop_config_name_serialize_ir_dump only checks the *after_offload* IR text — where the inner RangeForStmt does print loop_name=serial_sum — so the gap in compiled kernel naming is undetected.
Extended reasoning...
What the bug is
When a user writes qd.loop_config(serialize=True, name="serial_sum") followed by a for loop, they expect the compiled GPU/CPU task function to include serial_sum in its name for profiling/debugging. The name is silently dropped: the resulting compiled function is named something like kernel_0_serial, not kernel_0_serial_sum_serial.
The code path that triggers it
In offload.cpp, the top-level offloading loop (starting line 93) branches on:
if (auto s = stmt->cast<RangeForStmt>(); s && !s->strictly_serialized) {
// ... creates range_for OffloadedStmt and copies s->loop_name (line 137)
} else {
pending_serial_statements->body->insert(std::move(stmt)); // line 165
}When strictly_serialized=true, the condition s && !s->strictly_serialized is false. The RangeForStmt (which carries loop_name) is inserted into pending_serial_statements->body. The serial OffloadedStmt (pending_serial_statements) never has its own loop_name field set — it remains an empty string.
Why existing code does not prevent it
The non-serialized range_for path at line 137 does offloaded->loop_name = s->loop_name; correctly. The emit_struct_for path at line 261 likewise does offloaded_struct_for->loop_name = for_stmt->loop_name;. The serialized path is the only case missing this assignment.
Impact
LLVM codegen (codegen_llvm.cpp lines 2020–2025) and SPIR-V codegen (spirv_codegen.cpp line 67) both read stmt->loop_name from the OffloadedStmt to construct the task function name. Because that field stays empty for the serial case, any call to loop_config(serialize=True, name=...) produces a compiled function whose name does not contain the user-supplied string. Profiling tools that match by kernel name, and any user relying on named serial kernels, get silently wrong results.
Why the test does not catch it
The test test_loop_config_name_serialize_ir_dump searches for "loop_name=serial_sum" in files matching *after_offload*. After offloading, the IR structure is:
serial OffloadedStmt { // loop_name = "" (never set)
body {
RangeForStmt loop_name=serial_sum { // printed here by ir_printer.cpp
...
}
}
}
The IR printer for RangeForStmt does print loop_name=serial_sum, so the text appears in the dump and the assertion passes. But the actual compiled function name is derived from the outer OffloadedStmt::loop_name, which is empty.
The batching ambiguity concern (refutation addressed)
A refutation argues this is intentional because multiple serialized loops may be batched into one serial OffloadedStmt, making it ambiguous which inner loop name to propagate. This is a valid architectural concern. However, the argument does not change the observable failure: the feature documentation states name is "used in GPU kernel names for profiling and debugging", and there is no caveat that it is silently ignored for serialized loops. More importantly, the PR adds a test (test_loop_config_name_serialize_ir_dump) whose name and assertions give the impression the name-propagation feature works for serialized loops — it does not, and the test gap allows this to go undetected. At minimum, the test should either (a) explicitly assert the compiled function name does NOT contain the user-supplied name (documenting the intentional behavior), or (b) be accompanied by a note in the docstring that name has no effect on compiled function names when serialize=True.
Step-by-step proof
- User calls
qd.loop_config(serialize=True, name="serial_sum") set_loop_name("serial_sum")stores the string infor_loop_dec_.config.loop_namebegin_frontend_range_forcreatesFrontendForStmtwithloop_name="serial_sum"andstrictly_serialized=truelower_ast.cppcreatesRangeForStmtwithloop_name="serial_sum"andstrictly_serialized=true- In
offload.cppline 96:s && !s->strictly_serialized→true && !true→ false, so the else branch executes - Line 165: the
RangeForStmtis moved intopending_serial_statements->body—pending_serial_statements->loop_nameis never set codegen_llvm.cppline 2020:stmt->loop_name.empty()is true for the serialOffloadedStmt→ name format is"{kernel}_{id}_serial", omitting"serial_sum"entirely
| } | ||
|
|
||
| void visit(RangeForStmt *for_stmt) override { | ||
| print("{} : {}for in range({}, {}) {}{}{{", for_stmt->name(), | ||
| const std::string loop_name_suffix = | ||
| for_stmt->loop_name.empty() | ||
| ? std::string() | ||
| : fmt::format(" loop_name={}", for_stmt->loop_name); | ||
| print("{} : {}for in range({}, {}) {}{}{}{{", for_stmt->name(), | ||
| for_stmt->reversed ? "reversed " : "", for_stmt->begin->name(), | ||
| for_stmt->end->name(), | ||
| for_stmt->is_bit_vectorized ? "(bit_vectorized) " : "", | ||
| block_dim_info(for_stmt->block_dim)); | ||
| block_dim_info(for_stmt->block_dim), loop_name_suffix); | ||
| for_stmt->body->accept(this); | ||
| print("}}"); | ||
| dbg_info_printer_(for_stmt); |
There was a problem hiding this comment.
🟡 The visit(StructForStmt*) printer in ir_printer.cpp was not updated to print loop_name, even though this PR adds the field to StructForStmt and updates both visit(RangeForStmt*) and visit(OffloadedStmt*) to print it. Pre-offload IR dumps for struct-for loops with a user-defined name will silently omit the name, creating an inconsistency across loop types. To fix, apply the same loop_name_suffix pattern used in visit(RangeForStmt*) to visit(StructForStmt*).
Extended reasoning...
What the bug is
This PR adds a loop_name field to StructForStmt (statements.h:1063), propagates it through lower_ast.cpp (line 234) and offload.cpp (line 261), and updates visit(RangeForStmt*) in ir_printer.cpp (lines 427-439) and visit(OffloadedStmt*) (lines 690-706) to print loop_name. However, visit(StructForStmt*) at lines 442-451 was NOT updated and still prints only the snode name, bit-vectorized flag, scratch-pad info, and block_dim—never loop_name.
The specific code path
When a user writes qd.loop_config(name="my_struct_loop") before a struct-for loop, the name flows: ForLoopConfig.loop_name → FrontendForStmt.loop_name → StructForStmt.loop_name (via lower_ast) → OffloadedStmt.loop_name (via offload). The pre-offload IR dump calls visit(StructForStmt*), which omits the name entirely.
Why existing code does not prevent it
visit(StructForStmt*) was not touched in this PR's diff. The field exists on the C++ object but the printer simply never reads for_stmt->loop_name. There is no compile-time check enforcing consistency between what fields a statement carries and what the printer outputs.
Impact
This is a debug/tooling regression only. The actual functionality—loop name appearing in GPU kernel symbols—is unaffected because that reads from OffloadedStmt, which was correctly updated. However, anyone using QD_DUMP_IR to inspect the pre-offload IR for a struct-for loop with a name will see no loop_name= annotation, making it impossible to confirm the name was captured at that IR stage. This is inconsistent with RangeForStmt, which does show the name.
Step-by-step proof
- User calls
qd.loop_config(name="fill_snode")before a struct-for loop. lower_astcreates aStructForStmtand setsnew_for->loop_name = "fill_snode".- Pre-offload IR dump calls
IRPrinter::visit(StructForStmt*). Current output:$0 : struct for in dense block_dim=adaptive none{. Noloop_name=fill_snode. - After offload,
OffloadedStmtcorrectly printsloop_name=fill_snodebecause that path was updated. - By contrast, a
RangeForStmtwith the same name would printloop_name=fill_snodeeven at the pre-offload stage.
How to fix
Apply the same pattern as visit(RangeForStmt*): compute a loop_name_suffix string and append it to the format string in visit(StructForStmt*).
| task_ir_(params.task_ir), | ||
| compiled_structs_(params.compiled_structs), | ||
| ctx_attribs_(params.ctx_attribs), | ||
| task_name_(fmt::format("{}_t{:02d}", | ||
| params.ti_kernel_name, | ||
| params.task_id_in_kernel)) { | ||
| task_name_(params.task_ir->loop_name.empty() | ||
| ? fmt::format("{}_t{:02d}", | ||
| params.ti_kernel_name, | ||
| params.task_id_in_kernel) | ||
| : fmt::format("{}_t{:02d}_{}", | ||
| params.ti_kernel_name, | ||
| params.task_id_in_kernel, | ||
| params.task_ir->loop_name)) { | ||
| allow_undefined_visitor = true; | ||
| invoke_default_visitor = true; | ||
|
|
There was a problem hiding this comment.
🟡 The loop_name string is embedded raw into dump file paths without sanitization in both the SPIR-V (spirv_codegen.cpp:64-77) and LLVM/CUDA (jit_cuda.cpp:94-128) backends. If a user passes name="foo/bar" to loop_config(), the SPIR-V dump silently fails to write because std::filesystem::path::operator/ interprets the embedded slash as a subdirectory separator, and in the CUDA path the _before_ptx.ll dump uses an unsanitized name while the .ptx dump uses the post-convert() sanitized name, producing inconsistent filenames for the same kernel. Both issues only manifest with QD_DUMP_IR=1 and a loop name containing path-separator characters; fix by sanitizing loop_name before embedding it in any dump path (e.g. replace / with _).
Extended reasoning...
Root cause
This PR introduces loop_name as a user-provided string that is embedded directly into two different dump-file naming paths. Neither path validates the string before using it as a filesystem path component.
SPIR-V path (spirv_codegen.cpp:64-77)
task_name_ is constructed as "{kernel}_t{id:02d}_{loop_name}" when loop_name is non-empty. This string becomes spirv_dump_basename (line 2507 of the modified file), which is then used in:
std::filesystem::path filename =
ir_dump_dir / (spirv_dump_basename + "_before_opt.spirv");std::filesystem::path::operator/ does not treat the right-hand operand as a plain filename — it parses it for path separators. So if loop_name = "foo/bar", spirv_dump_basename becomes "kernel_0_t01_foo/bar" and the resulting path is ir_dump_dir/kernel_0_t01_foo/bar_before_opt.spirv. Only ir_dump_dir itself is created (line 2482); the subdirectory foo is never created, so the std::ofstream open silently fails. The if (std::ofstream out_file(filename); out_file) guard suppresses any error message, meaning the developer gets no dump and no diagnostic.
LLVM/CUDA path (jit_cuda.cpp:94-128)
The PR also embeds loop_name into LLVM function names (e.g. kernel_0_my-loop_range_for) in codegen_llvm.cpp. In add_module() in jit_cuda.cpp, moduleToDumpName(M.get()) is called at line ~98 before compile_module_to_ptx(M) runs. Inside compile_module_to_ptx(), convert() sanitizes all function names in-place (lines 241-244): it replaces / with _xx_, @ with _at_, < with _lb_, etc. The .ptx dump at line ~122 calls moduleToDumpName(M.get()) after that mutation, so:
_before_ptx.llfilename uses the original (unsanitized) function name.ptxfilename uses the sanitized function name
For any loop_name containing special characters, the two files are named differently for the same kernel. If loop_name contains /, the .ll path will also silently traverse into a subdirectory that does not exist (same failure mode as SPIR-V).
Proof-of-concept
- User writes
qd.loop_config(name="net/layer")before a kernel loop. loop_name = "net/layer"propagates throughFrontendForStmt→RangeForStmt→OffloadedStmt.- In SPIR-V codegen,
task_name_="my_kernel_t00_net/layer". Dump basename ="my_kernel_t00_net/layer". Path =<dump_dir>/my_kernel_t00_net/layer_before_opt.spirv. Directory<dump_dir>/my_kernel_t00_net/does not exist → open fails silently. - In LLVM codegen, LLVM function name =
"my_kernel_0_net/layer_range_for".moduleToDumpNamereturns"my_kernel_0_net/layer_range_for"..lldump path =<dump_dir>/my_kernel_0_net/layer_range_for_before_ptx.ll→ fails silently. Aftercompile_module_to_ptx,convert()renames to"my_kernel_0_net_xx_layer_range_for"..ptxdump path =<dump_dir>/my_kernel_0_net_xx_layer_range_for.ptx→ succeeds but mismatches.
Why existing code does not prevent this
The Python-side loop_config() (misc.py:712-713) passes the name through without any validation. ForLoopDecoratorRecorder::reset() merely clears the string; there is no sanitization anywhere in the pipeline. The LLVM backend has convert() for its symbols, but this runs on the LLVM IR symbol for PTX correctness, not as a filename sanitizer, and it runs too late for the .ll dump. The SPIR-V backend has no equivalent sanitization at all.
Impact
Both issues are nit-level: they only affect the debug dump feature (QD_DUMP_IR=1) and only when the user provides a loop name containing path-separator characters. Failures are silent (no crash, no error message). Normal kernel execution is completely unaffected.
Fix
Sanitize loop_name before embedding it in any dump-related file name. A simple approach: when constructing spirv_dump_basename and dumpName (LLVM), replace / (and any other filesystem-unsafe characters) with _. Alternatively, sanitize loop_name once when it is first set (e.g. in ASTBuilder::set_loop_name) so all downstream consumers benefit.
| begin, end, body->clone(), is_bit_vectorized, num_cpu_threads, block_dim, | ||
| strictly_serialized); | ||
| new_stmt->reversed = reversed; | ||
| new_stmt->loop_name = loop_name; | ||
| return new_stmt; | ||
| } | ||
|
|
There was a problem hiding this comment.
🟣 This PR adds new_stmt->loop_name = loop_name to RangeForStmt::clone() but omits the analogous new_stmt->range_hint = range_hint, so cloned statements silently lose their range_hint. Since offload.cpp copies range_hint from RangeForStmt to OffloadedStmt (line 136), any clone that happens before offloading causes the field to arrive empty. This is a pre-existing bug, but the PR directly edited this exact clone() method and is the ideal place to add the missing copy.
Extended reasoning...
What the bug is
RangeForStmt::clone() (statements.cpp lines 244-251) constructs a new statement with only 7 positional arguments, leaving range_hint at its default empty-string value:
auto new_stmt = std::make_unique<RangeForStmt>(
begin, end, body->clone(), is_bit_vectorized, num_cpu_threads, block_dim,
strictly_serialized);
new_stmt->reversed = reversed;
new_stmt->loop_name = loop_name; // added by this PR
// range_hint is never assigned!
return new_stmt;The constructor signature is RangeForStmt(..., std::string range_hint = "", std::string loop_name = ""). The PR correctly passes and stores loop_name but leaves range_hint defaulting to "".
The specific code path that triggers it
range_hint is set by lower_ast.cpp when lowering an external-tensor for loop to a RangeForStmt, with a hint like "arg (0)" (lower_ast.cpp ~line 273). The offload pass at offload.cpp:136 then copies range_hint from the RangeForStmt to the OffloadedStmt. Any clone of the RangeForStmt that occurs between lowering and offloading — notably auto_diff.cpp:1447 which calls clone() to create reversed gradient loops — will produce a statement with an empty range_hint.
Why existing code does not prevent it
The RangeForStmt constructor takes range_hint by value with a default of "". The clone() override bypasses this parameter to avoid re-supplying all arguments, but then neglects to manually copy range_hint the way it copies reversed and loop_name. There is no assertion or validation that would catch a missing range_hint on a cloned node.
Impact
Impact is limited to cases where range_hint is non-empty on a RangeForStmt that gets cloned before offloading. The confirmed case is autodiff: auto_diff.cpp:1447 clones a RangeForStmt to build the reversed gradient loop, and the resulting clone carries an empty range_hint. When that clone is offloaded, offload.cpp:136 reads s->range_hint and propagates the empty string, so the resulting OffloadedStmt loses its range label. Profiling output and IR dumps for differentiated external-tensor loops will be missing the hint, and make_cpu_multithreaded_range_for.cpp will also lose the label for CPU thread tasks.
Step-by-step proof
- User writes
@qd.kernel def k(a: qd.types.ndarray(...)): for i in a: ...and applies autodiff. lower_ast.cpplowers theFrontendForStmttoRangeForStmtwithrange_hint = "arg (0)".auto_diff.cpp:1447callsrange_for_stmt->clone()to build the reversed gradient loop.- Inside
clone(), the newRangeForStmtis constructed withoutrange_hint, sonew_stmt->range_hint == "". offload.cpp:136doesoffloaded->range_hint = s->range_hint— reading the empty string, so the OffloadedStmt has no range hint.- Any profiling tool, IR printer, or downstream pass using
range_hintfor the gradient task gets an empty string instead of"arg (0)".
How to fix it
Add one line immediately after new_stmt->loop_name = loop_name;:
new_stmt->range_hint = range_hint;This is exactly the pattern already used for reversed and loop_name, and matches how StructForStmt::clone() and OffloadedStmt::clone() copy their analogous fields.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough