Skip to content

[Misc] Add named top-level loops#440

Merged
hughperkins merged 20 commits intomainfrom
hp/named-loops
Apr 2, 2026
Merged

[Misc] Add named top-level loops#440
hughperkins merged 20 commits intomainfrom
hp/named-loops

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

When a kernel has multiple top-level for loops, each becomes a separate
GPU kernel. This adds a `name` parameter to `qd.loop_config()` that
propagates through the compilation pipeline and appears in the compiled
kernel names, making them identifiable in profilers and traces.

Example: `qd.loop_config(name="compute_forces")` produces kernel names
like `my_kernel_c0_abc_0_compute_forces_range_for` instead of the
generic `my_kernel_c0_abc_0_range_for`.

The loop_name also appears in IR dumps (QD_DUMP_IR=1) as
`loop_name=...` on offloaded range_for/struct_for statements.

Includes 6 unit tests covering correctness, name propagation in IR
dumps, no-leak between loops, mixed named/unnamed, and serialize combo.
The clone() methods for RangeForStmt, StructForStmt, and OffloadedStmt
were not copying the loop_name field. Since the LLVM codegen clones each
offloaded task before compilation, the user-provided loop name was lost.
parallelize=None,
block_dim_adaptive=True,
bit_vectorize=False,
name=None,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelize (int): The number of threads to use on CPU
block_dim_adaptive (bool): Whether to allow backends set block_dim adaptively, enabled by default
bit_vectorize (bool): Whether to enable bit vectorization of struct fors on quant_arrays.
name (str): Optional name for this loop, used in GPU kernel names for profiling and debugging.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if bit_vectorize:
_bit_vectorize()

if name is not None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this is part of the qd.loop_config function that we are putting over for loops in the python scope. like

qd.loop_config(name='foo',...)


auto task_kernel_name = fmt::format(
"{}_{}_{}{}", kernel_name, task_codegen_id, stmt->task_name(), suffix);
auto task_kernel_name =
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: loop name is coming from OffloadedStmt stmt

task_name_(fmt::format("{}_t{:02d}",
params.ti_kernel_name,
params.task_id_in_kernel)) {
task_name_(params.task_ir->loop_name.empty()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: loop name coming from params.task_ir , but this is also an OffloadedStmt *, same as in codegen_llvm.cpp, so both get from OffloadedStmt.

strictly_serialized(o.strictly_serialized),
mem_access_opt(o.mem_access_opt),
block_dim(o.block_dim) {
block_dim(o.block_dim),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: FrontEndForStmt


void FrontendForStmt::init_config(Arch arch, const ForLoopConfig &config) {
is_bit_vectorized = config.is_bit_vectorized;
strictly_serialized = config.strictly_serialized;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: FrontendForStmt initailzied from ForLoopConfig

MemoryAccessOptions mem_access_opt;
int block_dim{0};
bool uniform{false};
std::string loop_name{""};
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this is ForLoopConfig

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForLoopConfig struct definition

bool strictly_serialized;
MemoryAccessOptions mem_access_opt;
int block_dim;
std::string loop_name;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FrontendForStmt defintion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: FrontendForStmt constructors take in a ForLoopConfig.

for_loop_dec_.config.block_dim = v;
}

void set_loop_name(const std::string &loop_name) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this is in class ASTBuilder 🤔

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: for_loop_dec_ is a ForLoopDecoratorRecorder.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: and it conatins a config, which is of type ForLoopConfig

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in astbuilder, calling set_loop_name sets the loop name on the ForLoopConfig contained int the ForLoopDecoratorRecorder

int block_dim,
bool strictly_serialized,
std::string range_hint)
std::string range_hint,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: RangeForStmt

Replace tempfile.TemporaryDirectory with the tmp_path fixture in the
two offline IR dump tests.

Made-with: Cursor
Per-test qd.reset() resets the default compile config, so saving and
restoring debug_dump_path is unnecessary; keep restoring QD_DUMP_IR only.

Made-with: Cursor
Use monkeypatch.setenv so the env var is restored automatically instead
of a manual finally block.

Made-with: Cursor
Add a third range-for with loop_config(block_dim=64) but no name; assert
exactly two loop_name= annotations and each expected name appears once.

Made-with: Cursor
Remove test_loop_config_name_correctness; assert fill_a/fill_b ndarray
values in test_loop_config_name_ir_dump alongside c.

Made-with: Cursor
Covered by test_loop_config_name_ir_dump (named + unnamed-for IR and values).

Made-with: Cursor
Leak behavior is covered by IR dump tests (count loop_name=) and ndarray checks.

Made-with: Cursor
Single- vs two-name leak cases are already covered by IR counts in
test_loop_config_name_ir_dump.

Made-with: Cursor
Serialize path range-fors are printed without OffloadedStmt loop_name=;
include the stored loop name in the RangeForStmt line so after_offload
dumps stay searchable.

Made-with: Cursor
- Write PTX beside IR dumps under compile_config.debug_dump_path (with
  QD_DUMP_IR set) instead of hardcoded /tmp/ptx; match QUADRANTS_LOAD_PTX path.
- Add QD_DUMP_SPIRV=1 to emit *_before_opt.spirv / *_after_opt.spirv text
  without requiring QD_DUMP_IR.

Made-with: Cursor
- test_loop_config_name_cuda_ptx_dump: QD_DUMP_IR, asserts .version/.target in
  .ptx under debug_dump_path (CUDA arch only; skips when unsupported).
- test_loop_config_name_spirv_dump: QD_DUMP_SPIRV on metal/vulkan (vulkan
  excluded on Darwin), checks OpCapability in *_after_opt.spirv.

Made-with: Cursor
spirv_files = list(tmp_path.glob("*_after_opt.spirv"))
assert len(spirv_files) > 0, f"No *_after_opt.spirv under {tmp_path}"
combined = "\n".join(f.read_text() for f in spirv_files)
assert "OpCapability" in combined, combined
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelvant assert, and missing loop name assert

Offloaded task TaskAttributes::name already encodes loop_name (e.g.
kernel_0_t00_myloop). Use it for *_before_opt.spirv / *_after_opt.spirv
filenames so dumps match PTX-style discoverability. SPIR-V disassembly
does not contain that string; assert on the dumped filename.

Loop_config test adds name spirv_fill_loop and drops OpCapability check.

Made-with: Cursor
ptx_files = list(tmp_path.glob("*.ptx"))
assert len(ptx_files) > 0, f"No .ptx files under debug_dump_path {tmp_path}"
combined = "\n".join(f.read_text() for f in ptx_files)
assert ".version" in combined, combined
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelvant assert, and missing loop name assert

auto *root = params_.ir_root->as<Block>();

const char *dump_ir_env = std::getenv(DUMP_IR_ENV.data());
const char *dump_spirv_env = std::getenv(DUMP_SPIRV_ENV.data());
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont think we want this.


TaskCodegen cgen(tp);
auto task_res = cgen.run();
const std::string &spirv_dump_basename = task_res.task_attribs.name;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hughperkins hughperkins marked this pull request as ready for review April 1, 2026 23:38
@hughperkins
Copy link
Copy Markdown
Collaborator Author

I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c72d21a51b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2024 to +2025
: fmt::format("{}_{}_{}_{}{}", kernel_name, task_codegen_id,
stmt->loop_name, stmt->task_name(), suffix);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sanitize loop_name before embedding it in task symbols

Building LLVM task function names directly from stmt->loop_name allows characters that are later rewritten by CUDA JIT name sanitization (jit_cuda.cpp convert()), but the launcher still looks up kernels using the original unsanitized OffloadedTask::name (kernel_launcher.cpp). If a user sets qd.loop_config(name=...) with characters like -, space, or /, PTX symbols are renamed while launch-time lookup uses the old name, causing kernel lookup failures at runtime for CUDA.

Useful? React with 👍 / 👎.

Comment on lines 134 to 140
offloaded->body->insert(std::move(s->body->statements[j]));
}
offloaded->range_hint = s->range_hint;
offloaded->loop_name = s->loop_name;
root_block->insert(std::move(offloaded));
} else if (auto st = stmt->cast<StructForStmt>()) {
assemble_serial_statements();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When loop_config(serialize=True, name="foo") is used, the user-supplied name is silently dropped from compiled LLVM/PTX function names. The loop_name is correctly stored on the inner RangeForStmt (visible in IR dumps), but is never copied onto the outer serial OffloadedStmt that codegen uses to derive function names. The existing test test_loop_config_name_serialize_ir_dump only checks the *after_offload* IR text — where the inner RangeForStmt does print loop_name=serial_sum — so the gap in compiled kernel naming is undetected.

Extended reasoning...

What the bug is

When a user writes qd.loop_config(serialize=True, name="serial_sum") followed by a for loop, they expect the compiled GPU/CPU task function to include serial_sum in its name for profiling/debugging. The name is silently dropped: the resulting compiled function is named something like kernel_0_serial, not kernel_0_serial_sum_serial.

The code path that triggers it

In offload.cpp, the top-level offloading loop (starting line 93) branches on:

if (auto s = stmt->cast<RangeForStmt>(); s && !s->strictly_serialized) {
    // ... creates range_for OffloadedStmt and copies s->loop_name (line 137)
} else {
    pending_serial_statements->body->insert(std::move(stmt));  // line 165
}

When strictly_serialized=true, the condition s && !s->strictly_serialized is false. The RangeForStmt (which carries loop_name) is inserted into pending_serial_statements->body. The serial OffloadedStmt (pending_serial_statements) never has its own loop_name field set — it remains an empty string.

Why existing code does not prevent it

The non-serialized range_for path at line 137 does offloaded->loop_name = s->loop_name; correctly. The emit_struct_for path at line 261 likewise does offloaded_struct_for->loop_name = for_stmt->loop_name;. The serialized path is the only case missing this assignment.

Impact

LLVM codegen (codegen_llvm.cpp lines 2020–2025) and SPIR-V codegen (spirv_codegen.cpp line 67) both read stmt->loop_name from the OffloadedStmt to construct the task function name. Because that field stays empty for the serial case, any call to loop_config(serialize=True, name=...) produces a compiled function whose name does not contain the user-supplied string. Profiling tools that match by kernel name, and any user relying on named serial kernels, get silently wrong results.

Why the test does not catch it

The test test_loop_config_name_serialize_ir_dump searches for "loop_name=serial_sum" in files matching *after_offload*. After offloading, the IR structure is:

serial OffloadedStmt {           // loop_name = ""  (never set)
  body {
    RangeForStmt loop_name=serial_sum {  // printed here by ir_printer.cpp
      ...
    }
  }
}

The IR printer for RangeForStmt does print loop_name=serial_sum, so the text appears in the dump and the assertion passes. But the actual compiled function name is derived from the outer OffloadedStmt::loop_name, which is empty.

The batching ambiguity concern (refutation addressed)

A refutation argues this is intentional because multiple serialized loops may be batched into one serial OffloadedStmt, making it ambiguous which inner loop name to propagate. This is a valid architectural concern. However, the argument does not change the observable failure: the feature documentation states name is "used in GPU kernel names for profiling and debugging", and there is no caveat that it is silently ignored for serialized loops. More importantly, the PR adds a test (test_loop_config_name_serialize_ir_dump) whose name and assertions give the impression the name-propagation feature works for serialized loops — it does not, and the test gap allows this to go undetected. At minimum, the test should either (a) explicitly assert the compiled function name does NOT contain the user-supplied name (documenting the intentional behavior), or (b) be accompanied by a note in the docstring that name has no effect on compiled function names when serialize=True.

Step-by-step proof

  1. User calls qd.loop_config(serialize=True, name="serial_sum")
  2. set_loop_name("serial_sum") stores the string in for_loop_dec_.config.loop_name
  3. begin_frontend_range_for creates FrontendForStmt with loop_name="serial_sum" and strictly_serialized=true
  4. lower_ast.cpp creates RangeForStmt with loop_name="serial_sum" and strictly_serialized=true
  5. In offload.cpp line 96: s && !s->strictly_serializedtrue && !truefalse, so the else branch executes
  6. Line 165: the RangeForStmt is moved into pending_serial_statements->bodypending_serial_statements->loop_name is never set
  7. codegen_llvm.cpp line 2020: stmt->loop_name.empty() is true for the serial OffloadedStmt → name format is "{kernel}_{id}_serial", omitting "serial_sum" entirely

Comment on lines 425 to 439
}

void visit(RangeForStmt *for_stmt) override {
print("{} : {}for in range({}, {}) {}{}{{", for_stmt->name(),
const std::string loop_name_suffix =
for_stmt->loop_name.empty()
? std::string()
: fmt::format(" loop_name={}", for_stmt->loop_name);
print("{} : {}for in range({}, {}) {}{}{}{{", for_stmt->name(),
for_stmt->reversed ? "reversed " : "", for_stmt->begin->name(),
for_stmt->end->name(),
for_stmt->is_bit_vectorized ? "(bit_vectorized) " : "",
block_dim_info(for_stmt->block_dim));
block_dim_info(for_stmt->block_dim), loop_name_suffix);
for_stmt->body->accept(this);
print("}}");
dbg_info_printer_(for_stmt);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The visit(StructForStmt*) printer in ir_printer.cpp was not updated to print loop_name, even though this PR adds the field to StructForStmt and updates both visit(RangeForStmt*) and visit(OffloadedStmt*) to print it. Pre-offload IR dumps for struct-for loops with a user-defined name will silently omit the name, creating an inconsistency across loop types. To fix, apply the same loop_name_suffix pattern used in visit(RangeForStmt*) to visit(StructForStmt*).

Extended reasoning...

What the bug is

This PR adds a loop_name field to StructForStmt (statements.h:1063), propagates it through lower_ast.cpp (line 234) and offload.cpp (line 261), and updates visit(RangeForStmt*) in ir_printer.cpp (lines 427-439) and visit(OffloadedStmt*) (lines 690-706) to print loop_name. However, visit(StructForStmt*) at lines 442-451 was NOT updated and still prints only the snode name, bit-vectorized flag, scratch-pad info, and block_dim—never loop_name.

The specific code path

When a user writes qd.loop_config(name="my_struct_loop") before a struct-for loop, the name flows: ForLoopConfig.loop_nameFrontendForStmt.loop_nameStructForStmt.loop_name (via lower_ast) → OffloadedStmt.loop_name (via offload). The pre-offload IR dump calls visit(StructForStmt*), which omits the name entirely.

Why existing code does not prevent it

visit(StructForStmt*) was not touched in this PR's diff. The field exists on the C++ object but the printer simply never reads for_stmt->loop_name. There is no compile-time check enforcing consistency between what fields a statement carries and what the printer outputs.

Impact

This is a debug/tooling regression only. The actual functionality—loop name appearing in GPU kernel symbols—is unaffected because that reads from OffloadedStmt, which was correctly updated. However, anyone using QD_DUMP_IR to inspect the pre-offload IR for a struct-for loop with a name will see no loop_name= annotation, making it impossible to confirm the name was captured at that IR stage. This is inconsistent with RangeForStmt, which does show the name.

Step-by-step proof

  1. User calls qd.loop_config(name="fill_snode") before a struct-for loop.
  2. lower_ast creates a StructForStmt and sets new_for->loop_name = "fill_snode".
  3. Pre-offload IR dump calls IRPrinter::visit(StructForStmt*). Current output: $0 : struct for in dense block_dim=adaptive none{. No loop_name=fill_snode.
  4. After offload, OffloadedStmt correctly prints loop_name=fill_snode because that path was updated.
  5. By contrast, a RangeForStmt with the same name would print loop_name=fill_snode even at the pre-offload stage.

How to fix

Apply the same pattern as visit(RangeForStmt*): compute a loop_name_suffix string and append it to the format string in visit(StructForStmt*).

Comment on lines 64 to 77
task_ir_(params.task_ir),
compiled_structs_(params.compiled_structs),
ctx_attribs_(params.ctx_attribs),
task_name_(fmt::format("{}_t{:02d}",
params.ti_kernel_name,
params.task_id_in_kernel)) {
task_name_(params.task_ir->loop_name.empty()
? fmt::format("{}_t{:02d}",
params.ti_kernel_name,
params.task_id_in_kernel)
: fmt::format("{}_t{:02d}_{}",
params.ti_kernel_name,
params.task_id_in_kernel,
params.task_ir->loop_name)) {
allow_undefined_visitor = true;
invoke_default_visitor = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The loop_name string is embedded raw into dump file paths without sanitization in both the SPIR-V (spirv_codegen.cpp:64-77) and LLVM/CUDA (jit_cuda.cpp:94-128) backends. If a user passes name="foo/bar" to loop_config(), the SPIR-V dump silently fails to write because std::filesystem::path::operator/ interprets the embedded slash as a subdirectory separator, and in the CUDA path the _before_ptx.ll dump uses an unsanitized name while the .ptx dump uses the post-convert() sanitized name, producing inconsistent filenames for the same kernel. Both issues only manifest with QD_DUMP_IR=1 and a loop name containing path-separator characters; fix by sanitizing loop_name before embedding it in any dump path (e.g. replace / with _).

Extended reasoning...

Root cause

This PR introduces loop_name as a user-provided string that is embedded directly into two different dump-file naming paths. Neither path validates the string before using it as a filesystem path component.

SPIR-V path (spirv_codegen.cpp:64-77)

task_name_ is constructed as "{kernel}_t{id:02d}_{loop_name}" when loop_name is non-empty. This string becomes spirv_dump_basename (line 2507 of the modified file), which is then used in:

std::filesystem::path filename =
    ir_dump_dir / (spirv_dump_basename + "_before_opt.spirv");

std::filesystem::path::operator/ does not treat the right-hand operand as a plain filename — it parses it for path separators. So if loop_name = "foo/bar", spirv_dump_basename becomes "kernel_0_t01_foo/bar" and the resulting path is ir_dump_dir/kernel_0_t01_foo/bar_before_opt.spirv. Only ir_dump_dir itself is created (line 2482); the subdirectory foo is never created, so the std::ofstream open silently fails. The if (std::ofstream out_file(filename); out_file) guard suppresses any error message, meaning the developer gets no dump and no diagnostic.

LLVM/CUDA path (jit_cuda.cpp:94-128)

The PR also embeds loop_name into LLVM function names (e.g. kernel_0_my-loop_range_for) in codegen_llvm.cpp. In add_module() in jit_cuda.cpp, moduleToDumpName(M.get()) is called at line ~98 before compile_module_to_ptx(M) runs. Inside compile_module_to_ptx(), convert() sanitizes all function names in-place (lines 241-244): it replaces / with _xx_, @ with _at_, < with _lb_, etc. The .ptx dump at line ~122 calls moduleToDumpName(M.get()) after that mutation, so:

  • _before_ptx.ll filename uses the original (unsanitized) function name
  • .ptx filename uses the sanitized function name

For any loop_name containing special characters, the two files are named differently for the same kernel. If loop_name contains /, the .ll path will also silently traverse into a subdirectory that does not exist (same failure mode as SPIR-V).

Proof-of-concept

  1. User writes qd.loop_config(name="net/layer") before a kernel loop.
  2. loop_name = "net/layer" propagates through FrontendForStmtRangeForStmtOffloadedStmt.
  3. In SPIR-V codegen, task_name_ = "my_kernel_t00_net/layer". Dump basename = "my_kernel_t00_net/layer". Path = <dump_dir>/my_kernel_t00_net/layer_before_opt.spirv. Directory <dump_dir>/my_kernel_t00_net/ does not exist → open fails silently.
  4. In LLVM codegen, LLVM function name = "my_kernel_0_net/layer_range_for". moduleToDumpName returns "my_kernel_0_net/layer_range_for". .ll dump path = <dump_dir>/my_kernel_0_net/layer_range_for_before_ptx.ll → fails silently. After compile_module_to_ptx, convert() renames to "my_kernel_0_net_xx_layer_range_for". .ptx dump path = <dump_dir>/my_kernel_0_net_xx_layer_range_for.ptx → succeeds but mismatches.

Why existing code does not prevent this

The Python-side loop_config() (misc.py:712-713) passes the name through without any validation. ForLoopDecoratorRecorder::reset() merely clears the string; there is no sanitization anywhere in the pipeline. The LLVM backend has convert() for its symbols, but this runs on the LLVM IR symbol for PTX correctness, not as a filename sanitizer, and it runs too late for the .ll dump. The SPIR-V backend has no equivalent sanitization at all.

Impact

Both issues are nit-level: they only affect the debug dump feature (QD_DUMP_IR=1) and only when the user provides a loop name containing path-separator characters. Failures are silent (no crash, no error message). Normal kernel execution is completely unaffected.

Fix

Sanitize loop_name before embedding it in any dump-related file name. A simple approach: when constructing spirv_dump_basename and dumpName (LLVM), replace / (and any other filesystem-unsafe characters) with _. Alternatively, sanitize loop_name once when it is first set (e.g. in ASTBuilder::set_loop_name) so all downstream consumers benefit.

Comment on lines 246 to 252
begin, end, body->clone(), is_bit_vectorized, num_cpu_threads, block_dim,
strictly_serialized);
new_stmt->reversed = reversed;
new_stmt->loop_name = loop_name;
return new_stmt;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 This PR adds new_stmt->loop_name = loop_name to RangeForStmt::clone() but omits the analogous new_stmt->range_hint = range_hint, so cloned statements silently lose their range_hint. Since offload.cpp copies range_hint from RangeForStmt to OffloadedStmt (line 136), any clone that happens before offloading causes the field to arrive empty. This is a pre-existing bug, but the PR directly edited this exact clone() method and is the ideal place to add the missing copy.

Extended reasoning...

What the bug is

RangeForStmt::clone() (statements.cpp lines 244-251) constructs a new statement with only 7 positional arguments, leaving range_hint at its default empty-string value:

auto new_stmt = std::make_unique<RangeForStmt>(
    begin, end, body->clone(), is_bit_vectorized, num_cpu_threads, block_dim,
    strictly_serialized);
new_stmt->reversed = reversed;
new_stmt->loop_name = loop_name;  // added by this PR
// range_hint is never assigned!
return new_stmt;

The constructor signature is RangeForStmt(..., std::string range_hint = "", std::string loop_name = ""). The PR correctly passes and stores loop_name but leaves range_hint defaulting to "".

The specific code path that triggers it

range_hint is set by lower_ast.cpp when lowering an external-tensor for loop to a RangeForStmt, with a hint like "arg (0)" (lower_ast.cpp ~line 273). The offload pass at offload.cpp:136 then copies range_hint from the RangeForStmt to the OffloadedStmt. Any clone of the RangeForStmt that occurs between lowering and offloading — notably auto_diff.cpp:1447 which calls clone() to create reversed gradient loops — will produce a statement with an empty range_hint.

Why existing code does not prevent it

The RangeForStmt constructor takes range_hint by value with a default of "". The clone() override bypasses this parameter to avoid re-supplying all arguments, but then neglects to manually copy range_hint the way it copies reversed and loop_name. There is no assertion or validation that would catch a missing range_hint on a cloned node.

Impact

Impact is limited to cases where range_hint is non-empty on a RangeForStmt that gets cloned before offloading. The confirmed case is autodiff: auto_diff.cpp:1447 clones a RangeForStmt to build the reversed gradient loop, and the resulting clone carries an empty range_hint. When that clone is offloaded, offload.cpp:136 reads s->range_hint and propagates the empty string, so the resulting OffloadedStmt loses its range label. Profiling output and IR dumps for differentiated external-tensor loops will be missing the hint, and make_cpu_multithreaded_range_for.cpp will also lose the label for CPU thread tasks.

Step-by-step proof

  1. User writes @qd.kernel def k(a: qd.types.ndarray(...)): for i in a: ... and applies autodiff.
  2. lower_ast.cpp lowers the FrontendForStmt to RangeForStmt with range_hint = "arg (0)".
  3. auto_diff.cpp:1447 calls range_for_stmt->clone() to build the reversed gradient loop.
  4. Inside clone(), the new RangeForStmt is constructed without range_hint, so new_stmt->range_hint == "".
  5. offload.cpp:136 does offloaded->range_hint = s->range_hint — reading the empty string, so the OffloadedStmt has no range hint.
  6. Any profiling tool, IR printer, or downstream pass using range_hint for the gradient task gets an empty string instead of "arg (0)".

How to fix it

Add one line immediately after new_stmt->loop_name = loop_name;:

new_stmt->range_hint = range_hint;

This is exactly the pattern already used for reversed and loop_name, and matches how StructForStmt::clone() and OffloadedStmt::clone() copy their analogous fields.

@hughperkins hughperkins enabled auto-merge (squash) April 2, 2026 01:05
@hughperkins hughperkins merged commit 8b8f12c into main Apr 2, 2026
47 checks passed
@hughperkins hughperkins deleted the hp/named-loops branch April 2, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants