GH-48277: [C++][Parquet] unpack with shuffle algorithm#47994
GH-48277: [C++][Parquet] unpack with shuffle algorithm#47994AntoinePrv wants to merge 81 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
d2743d4 to
6e72467
Compare
a7e4cd9 to
9efa59a
Compare
d01fdba to
b28ea9b
Compare
|
|
f546ed9 to
4f9fbe1
Compare
|
@pitrou apart from R-lint, this is looking pretty good. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit a4bfe8a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8a. There were 37 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem. |
a4bfe8a to
dd3ec0d
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit dd3ec0d. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d. There were 19 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 408ef04. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
@ursabot please benchmark lang=C++ |
|
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
|
Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
@github-actions crossbow submit -g cpp |
|
Revision: b638570 Submitted crossbow builds: ursacomputing/crossbow @ actions-158ce97c66 |
|
Unfortunately the "AMD64 Windows R release" failure looks related to this PR: it's deterministic (I've restarted it twice), it doesn't occur on git main, and the test where it fails/crashes uses Parquet. arrow/r/tests/testthat/test-dplyr-summarize.R Lines 260 to 271 in ebaaf07 |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit b638570. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
Rationale for this change
The current bit-unpacking algorithm (which is implemented as a C++ code generator script in Python) does not fully leverage SIMD operations: all loads and some bitshifts use scalar operations, leaving performance on the table.
What changes are included in this PR?
Devise new bit-unpacking algorithms that fully leverage SIMD operations, for various parameter values of (packed bit width, destination integer width, SIMD register size). Different algorithms are necessary for different parameter values, because of straddling issues with some bit offsets.
Implement these new algorithms entirely in C++ using metaprogramming: the tables necessary for efficient SIMD swizzling and shifting are computed at compile-time using
constexprcode (the exception is AVX-512 which is still using the legacy Python code generation script).Implement low-level generic fallbacks for SIMD operations that are not available in all SIMD instruction sets, such as some flavor of bit-shifting. These fallbacks are also being contributed to xsimd, but have not been merged there yet.
Benchmark results on a AVX2 CPU (AMD Zen 2) on Linux (Ubuntu 24.04):
Parquet decoding:
Parquet reading:
Are these changes tested?
Yes, by the current extensive bit-unpacking and Parquet decoding tests.
Are there any user-facing changes?
No