Qualcomm AI Engine Direct - calibration thread auto-tuning by abhinaykukkadapu · Pull Request #18184 · pytorch/executorch

abhinaykukkadapu · 2026-03-14T19:54:49Z

TL;DR

Calibration overall time has been cut to near ~10-25 minutes compared to previous 2.5h for various models (10x optimization for decode phase). These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.

Thread tuning

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

Host	baseline	Candidates
36-core VM (72 logical)	72	[1, 9, 18, 27, 36, 48, 54, 72]
Same host, 8 cores pinned	8	[1, 2, 3, 4, 5, 6, 8]

Calibration times for few models

Model	Params	SeqMSE	Auto-tune	DECODE Calibration	Minutes
smollm2_135m	135M	0	auto	565.8s	9.4
qwen2_5-0_5b	0.5B	0	auto	736.3s	12.3
qwen2_5-1_5b	1.5B	0	auto	933.6s	15.6
qwen3-0_6b	0.6B	50	auto	961.9s	16.0
gemma3-1b	1B	0	auto	987.8s	16.5
smollm3-3b	3B	0	auto	1,355.0s	22.6
llama3_2-1b_instruct	1B	50	auto	1,434.3s	23.9
llama3_2-3b_instruct	3B	0	auto	1,774.8s	29.6

Llama3.2-1B PPL Validation

Config	word_perplexity	byte_perplexity
Baseline (1000 cands, 72 thr)	15.45	1.696
+ SeqMSE 50 + no PREFILL calib	14.97	1.685
+ Thread auto-tune	15.03	1.687

cc @cccclai @cbilgin @digantdesai @tanvirislam-meta

pytorch-bot · 2026-03-14T19:54:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 02f6db3 with merge base 8bec69b ():

NEW FAILURE - The following job has failed:

pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_linear_model

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-14T19:55:32Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that samples fractions of the available thread ceiling (1/8 through 1.0) via a quick microbenchmark before prepare_pt2e — no observers exist yet, so synthetic benchmark inputs cannot pollute calibration state. Uses sched_getaffinity when available to respect cgroup/taskset constraints. Thread count is scoped to calibration only and restored after decode calibration phase. CLI override via --calibration_num_threads (0 = auto-tune, default). On a 72-vCPU host, auto-tune selects 18-36 threads depending on the workload, yielding 10.1x faster calibration (21.8 min vs 3h40m) with no PPL regression.

cccclai · 2026-03-16T16:36:51Z

examples/qualcomm/oss_scripts/llama/wrappers/llm_wrappers.py

+                original_threads = torch.get_num_threads()
+                torch.set_num_threads(calib_threads)


What does it actually do and mean? How is it different between cpu and gpu? Can we use gpu to calibrate still?

I checked a bit more and this is what claude said

PyTorch uses a heuristic that depends on the environment:

Locally / outside containers: It typically defaults to the number of logical CPU cores (os.cpu_count()), which counts hyperthreaded cores.

In containers / limited environments (like Docker with CPU limits, Kubernetes, or certain cloud VMs): PyTorch tries to respect CPU affinity and cgroup limits, so the thread count may be lower.

With OpenMP: If PyTorch is compiled with OpenMP (common on Linux), the thread count may be governed by OMP_NUM_THREADS, which, if unset, OpenMP often sets to the logical core count.

It seems like this is specific for PyTorch OpenMP built

Curious what Qualcomm folks set up is. @haowhsu-quic

Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.

@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.

How about GPU? Does it make a difference?

pytorch seems to default high thread counts assuming larger workloads.

what is PyTorch logic here?

abhinaykukkadapu added this to ExecuTorch Core Mar 14, 2026

abhinaykukkadapu added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 14, 2026

github-project-automation bot moved this to To triage in ExecuTorch Core Mar 14, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2026

abhinaykukkadapu linked an issue Mar 14, 2026 that may be closed by this pull request

Optimize decode loop in calibration #18065

Closed

abhinaykukkadapu moved this from To triage to In progress in ExecuTorch Core Mar 14, 2026

abhinaykukkadapu marked this pull request as ready for review March 14, 2026 21:38

abhinaykukkadapu requested a review from cccclai as a code owner March 14, 2026 21:38

abhinaykukkadapu requested review from chenweng-quic, haowhsu-quic, shewu-quic and winskuo-quic March 14, 2026 21:38

abhinaykukkadapu mentioned this pull request Mar 14, 2026

Optimize decode loop in calibration #18065

Closed

abhinaykukkadapu force-pushed the calibration-thread-tuning branch from 64f1fb6 to 3b02283 Compare March 15, 2026 23:51

abhinaykukkadapu force-pushed the calibration-thread-tuning branch from 3b02283 to 02f6db3 Compare March 16, 2026 16:11

cccclai reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning

abhinaykukkadapu commented Mar 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

cccclai Mar 16, 2026

Uh oh!

cccclai Mar 16, 2026

Uh oh!

cccclai Mar 16, 2026

Uh oh!

abhinaykukkadapu Mar 16, 2026 •

edited

Loading

Uh oh!

cccclai Mar 16, 2026

Uh oh!

cccclai Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		original_threads = torch.get_num_threads()
		torch.set_num_threads(calib_threads)

Conversation

abhinaykukkadapu commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Thread tuning

Calibration times for few models

Llama3.2-1B PPL Validation

Uh oh!

pytorch-bot bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

❌ 1 New Failure

Uh oh!

github-actions bot commented Mar 14, 2026

This PR needs a release notes: label

Uh oh!

cccclai Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

cccclai Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhinaykukkadapu commented Mar 14, 2026 •

edited

Loading

pytorch-bot bot commented Mar 14, 2026 •

edited

Loading

This PR needs a `release notes:` label

abhinaykukkadapu Mar 16, 2026 •

edited

Loading