Skip to content

[MISC] Speedup tiled hessian kernel by using direct lower-triangle indexing.#2618

Merged
duburcqa merged 1 commit intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/hess-half1
Mar 30, 2026
Merged

[MISC] Speedup tiled hessian kernel by using direct lower-triangle indexing.#2618
duburcqa merged 1 commit intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/hess-half1

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

For diagonal tiles, iterate over exactly n_lower_tri elements using linear_to_lower_tri instead of n_dofs^2 with an upper-triangle skip. For n_dofs=62 this halves the outer loop iterations (1953 vs 3844) and eliminates warp divergence from the i_d1 >= i_d2 branch.

Description

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Here are the results for this change:

results (16)

Comment on lines -1526 to -1533
pid = tid
numel = n_dofs_tile_row * n_dofs_tile_col
while pid < numel:
i_d1_ = pid // n_dofs_tile_col
i_d2_ = pid % n_dofs_tile_col
i_d1 = i_d1_ + i_d1_start
i_d2 = i_d2_ + i_d2_start
if i_d1 >= i_d2:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the story, doing this was necessary because of a bug on Apple Metal GPU which I think has been fixed since then.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run all the unit tests and all benchmarks on Apple Metal just to be sure?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like:

pytest -sv tests/test_rigid_physics.py --backend gpu

?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passed:

Screenshot 2026-03-28 at 21 59 02

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying now

pytest -sv -m benchmarks tests/test_rigid_benchmarks.py --backend gpu

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests failing only on main, or only on the branch:

  Only on main (1):
  • test_kinematic_contact_probe_box_support[0] — f64 not supported

  Only on branch (2):
  • test_lidar_cache_offset_parallel_env — f64 not supported
  • test_elastomer_displacement_sensor_box_sphere[2] — f64 not supported

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure messages for those t wo above:

FAILED tests/test_sensors.py::test_lidar_cache_offset_parallel_env - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_elastomer_displacement_sensor_box_sphere[2] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The full list of failures for branch :

FAILED tests/test_grad.py::test_differentiable_rigid[gpu] - genesis.GenesisException: Nan grad in qpos or dofs_vel found at step 95
FAILED tests/test_recorders.py::test_plotter - genesis.GenesisException: PyAV is not installed. Please install it with `pip install av`.
FAILED tests/test_recorders.py::test_file_writers - AssertionError: assert '1' in ('False', '0')
FAILED tests/test_recorders.py::test_video_writer - genesis.GenesisException: PyAV is not installed. Please install it with `pip install av`.
FAILED tests/test_sensors.py::test_raycaster_hits[2] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_elastomer_displacement_sensor_sphere_ground[2] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_raycaster_hits[0] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_utils.py::test_geom_numpy_vs_torch_consistency[()] - UserWarning: The operator 'aten::linalg_svd' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:15.)
FAILED tests/test_sensors.py::test_elastomer_displacement_sensor_sphere_ground[0] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_utils.py::test_geom_numpy_vs_torch_consistency[(10, 40, 25)] - UserWarning: An output with one or more elements was resized since it had shape [10, 40, 25, 3, 3], which does not match the required output shape [10000, 3, 3]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/Resize.cpp:38.)
FAILED tests/test_sensors.py::test_contact_sensors_gravity_force[2] - AssertionError: ContactSensor for floor should not detect any contact yet.
assert not tensor(True, device='mps:0')
 +  where tensor(True, device='mps:0') = <built-in method any of Tensor object at 0x152753060>()
 +    where <built-in method any of Tensor object at 0x152753060> = tensor([[1],\n        [0]], device='mps:0', dtype=torch.int32).any
 +      where tensor([[1],\n        [0]], device='mps:0', dtype=torch.int32) = read()
 +        where read = \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 <gs.engine.sensors.contact_force.ContactSensor> \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n'is_built': <bool>: True\n.read
FAILED tests/test_sensors.py::test_lidar_bvh_parallel_env - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_add_and_read_all_registered_sensors - TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
FAILED tests/test_sensors.py::test_temperature_grid_sensor_contact_and_reset[2] - TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
FAILED tests/test_sensors.py::test_elastomer_displacement_sensor_box_sphere[0] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_temperature_grid_simulate_all_link_temps[2] - TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
FAILED tests/test_sensors.py::test_temperature_grid_simulate_all_link_temps[0] - TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
FAILED tests/test_sensors.py::test_lidar_cache_offset_parallel_env - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_elastomer_displacement_sensor_box_sphere[2] - RuntimeError: [spirv_ir_builder.cpp:get_primitive_type@307] Type f64 not supported.
FAILED tests/test_sensors.py::test_contact_sensors_gravity_force[0] - AssertionError: ContactSensor for floor should not detect any contact yet.
assert not tensor(True, device='mps:0')
 +  where tensor(True, device='mps:0') = <built-in method any of Tensor object at 0x158336d90>()
 +    where <built-in method any of Tensor object at 0x158336d90> = tensor([1], device='mps:0', dtype=torch.int32).any
 +      where tensor([1], device='mps:0', dtype=torch.int32) = read()
 +        where read = \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 <gs.engine.sensors.contact_force.ContactSensor> \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n'is_built': <bool>: True\n.read
================================================= 20 failed, 472 passed, 71 skipped, 1 xfailed, 1687 warnings in 883.56s (0:14:43) ===

)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased. when you say reinstall you mean [...]

No, I mean uv pip install -e '.[dev]'

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You want me to rerun either or both of main and/or branch? (45 minutes each...)

@github-actions
Copy link
Copy Markdown

⚠️ Abnormal Benchmark Result Detected ➡️ Report

For diagonal tiles, iterate over exactly n_lower_tri elements using
linear_to_lower_tri instead of n_dofs^2 with an upper-triangle skip.
For n_dofs=62 this halves the outer loop iterations (1953 vs 3844)
and eliminates warp divergence from the i_d1 >= i_d2 branch.
@hughperkins hughperkins marked this pull request as ready for review March 30, 2026 19:06
@hughperkins hughperkins requested a review from YilingQiao as a code owner March 30, 2026 19:06
@duburcqa duburcqa changed the title [MISC] Use direct lower-triangle indexing in hessian tiled kernel. [MISC] Speedup tiled hessian kernel by using direct lower-triangle indexing. Mar 30, 2026
@duburcqa duburcqa merged commit dbd7cbf into Genesis-Embodied-AI:main Mar 30, 2026
22 checks passed
@hughperkins hughperkins deleted the hp/hess-half1 branch March 30, 2026 20:41
@hughperkins
Copy link
Copy Markdown
Collaborator Author

🙌

@github-actions
Copy link
Copy Markdown

⚠️ Abnormal Benchmark Result Detected ➡️ Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants