Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
Qualcomm AI Engine Direct - calibration thread auto-tuning#18184abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 02f6db3 with merge base 8bec69b ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
64f1fb6 to
3b02283
Compare
AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that samples fractions of the available thread ceiling (1/8 through 1.0) via a quick microbenchmark before prepare_pt2e — no observers exist yet, so synthetic benchmark inputs cannot pollute calibration state. Uses sched_getaffinity when available to respect cgroup/taskset constraints. Thread count is scoped to calibration only and restored after decode calibration phase. CLI override via --calibration_num_threads (0 = auto-tune, default). On a 72-vCPU host, auto-tune selects 18-36 threads depending on the workload, yielding 10.1x faster calibration (21.8 min vs 3h40m) with no PPL regression.
3b02283 to
02f6db3
Compare
| original_threads = torch.get_num_threads() | ||
| torch.set_num_threads(calib_threads) |
There was a problem hiding this comment.
What does it actually do and mean? How is it different between cpu and gpu? Can we use gpu to calibrate still?
There was a problem hiding this comment.
I checked a bit more and this is what claude said
PyTorch uses a heuristic that depends on the environment:
- Locally / outside containers: It typically defaults to the number of logical CPU cores (os.cpu_count()), which counts hyperthreaded cores.
- In containers / limited environments (like Docker with CPU limits, Kubernetes, or certain cloud VMs): PyTorch tries to respect CPU affinity and cgroup limits, so the thread count may be lower.
- With OpenMP: If PyTorch is compiled with OpenMP (common on Linux), the thread count may be governed by OMP_NUM_THREADS, which, if unset, OpenMP often sets to the logical core count.
It seems like this is specific for PyTorch OpenMP built
There was a problem hiding this comment.
Curious what Qualcomm folks set up is. @haowhsu-quic
There was a problem hiding this comment.
Yeah, in my experiments, the high per iteration time is due to threads waiting at the barrier (you can see the large pillar in the flamegraph from the GH linked issues, it is named mkl_blas_sgemv). This is matrix-vector multiply, specific to decode though as the workloads are smaller due to conv2d kernels, pytorch seems to default high thread counts assuming larger workloads.
@haowhsu-quic can you please pull this PR on top of main (i just merged my coarse + fine pr) and see if tuning works on other vms.
There was a problem hiding this comment.
How about GPU? Does it make a difference?
There was a problem hiding this comment.
pytorch seems to default high thread counts assuming larger workloads.
what is PyTorch logic here?
TL;DR
Calibration overall time has been cut to near ~10-25 minutes compared to previous 2.5h for various models (10x optimization for decode phase). These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.
Thread tuning
AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (
os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via--calibration_num_threads.On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.
Calibration times for few models
Llama3.2-1B PPL Validation
cc @cccclai @cbilgin @digantdesai @tanvirislam-meta