Skip to content

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184

Open
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning
Open

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning

Conversation

@abhinaykukkadapu
Copy link
Contributor

@abhinaykukkadapu abhinaykukkadapu commented Mar 14, 2026

TL;DR

Calibration overall time has been cut to near ~10-25 minutes compared to previous 2.5h for various models (10x optimization for decode phase). These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.

Thread tuning

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

Host baseline Candidates
36-core VM (72 logical) 72 [1, 9, 18, 27, 36, 48, 54, 72]
Same host, 8 cores pinned 8 [1, 2, 3, 4, 5, 6, 8]

Calibration times for few models

Model Params SeqMSE Auto-tune DECODE Calibration Minutes
smollm2_135m 135M 0 auto 565.8s 9.4
qwen2_5-0_5b 0.5B 0 auto 736.3s 12.3
qwen2_5-1_5b 1.5B 0 auto 933.6s 15.6
qwen3-0_6b 0.6B 50 auto 961.9s 16.0
gemma3-1b 1B 0 auto 987.8s 16.5
smollm3-3b 3B 0 auto 1,355.0s 22.6
llama3_2-1b_instruct 1B 50 auto 1,434.3s 23.9
llama3_2-3b_instruct 3B 0 auto 1,774.8s 29.6

Llama3.2-1B PPL Validation

Config word_perplexity byte_perplexity
Baseline (1000 cands, 72 thr) 15.45 1.696
+ SeqMSE 50 + no PREFILL calib 14.97 1.685
+ Thread auto-tune 15.03 1.687

cc @cccclai @cbilgin @digantdesai @tanvirislam-meta

@abhinaykukkadapu abhinaykukkadapu added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 14, 2026
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch Core Mar 14, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 02f6db3 with merge base 8bec69b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2026
@abhinaykukkadapu abhinaykukkadapu linked an issue Mar 14, 2026 that may be closed by this pull request
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@abhinaykukkadapu abhinaykukkadapu moved this from To triage to In progress in ExecuTorch Core Mar 14, 2026
@abhinaykukkadapu abhinaykukkadapu marked this pull request as ready for review March 14, 2026 21:38
@abhinaykukkadapu abhinaykukkadapu force-pushed the calibration-thread-tuning branch from 64f1fb6 to 3b02283 Compare March 15, 2026 23:51
AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound.
The default thread count (os.cpu_count()) causes massive OpenMP sync
overhead on multi-core hosts.

Add runtime auto-tuning that samples fractions of the available thread
ceiling (1/8 through 1.0) via a quick microbenchmark before
prepare_pt2e — no observers exist yet, so synthetic benchmark inputs
cannot pollute calibration state. Uses sched_getaffinity when available
to respect cgroup/taskset constraints. Thread count is scoped to
calibration only and restored after decode calibration phase.

CLI override via --calibration_num_threads (0 = auto-tune, default).

On a 72-vCPU host, auto-tune selects 18-36 threads depending on the
workload, yielding 10.1x faster calibration (21.8 min vs 3h40m)
with no PPL regression.
@abhinaykukkadapu abhinaykukkadapu force-pushed the calibration-thread-tuning branch from 3b02283 to 02f6db3 Compare March 16, 2026 16:11
Comment on lines +630 to +631
original_threads = torch.get_num_threads()
torch.set_num_threads(calib_threads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it actually do and mean? How is it different between cpu and gpu? Can we use gpu to calibrate still?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked a bit more and this is what claude said

PyTorch uses a heuristic that depends on the environment:

  • Locally / outside containers: It typically defaults to the number of logical CPU cores (os.cpu_count()), which counts hyperthreaded cores.
  • In containers / limited environments (like Docker with CPU limits, Kubernetes, or certain cloud VMs): PyTorch tries to respect CPU affinity and cgroup limits, so the thread count may be lower.
  • With OpenMP: If PyTorch is compiled with OpenMP (common on Linux), the thread count may be governed by OMP_NUM_THREADS, which, if unset, OpenMP often sets to the logical core count.

It seems like this is specific for PyTorch OpenMP built

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what Qualcomm folks set up is. @haowhsu-quic

Copy link
Contributor Author

@abhinaykukkadapu abhinaykukkadapu Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.

@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about GPU? Does it make a difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch seems to default high thread counts assuming larger workloads.

what is PyTorch logic here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Optimize decode loop in calibration

2 participants