Skip to content

Arrow IPC binary fetch path for DataFrame execution#1489

Open
Martozar wants to merge 4 commits intogooddata:masterfrom
Martozar:c.mze-cq-105
Open

Arrow IPC binary fetch path for DataFrame execution#1489
Martozar wants to merge 4 commits intogooddata:masterfrom
Martozar:c.mze-cq-105

Conversation

@Martozar
Copy link
Copy Markdown
Contributor

@Martozar Martozar commented Mar 30, 2026

Summary

Adds a native Arrow IPC binary fetch path to gooddata-pandas, providing a faster alternative to the existing JSON-paged AFM path for large result sets.

What changed

gooddata-sdk — binary fetch

  • BareExecutionResponse.read_result_arrow() fetches execution results from the server's binary IPC endpoint and returns a pyarrow.Table.

gooddata-pandas — Arrow→DataFrame conversion

  • DataFrameFactory.for_exec_def_arrow() — new public method that mirrors for_exec_def() but uses the binary path.
  • for_arrow_table() — pure conversion from pa.Table to (pd.DataFrame, DataFrameMetadata), enabling callers to bring their own Arrow data.
  • convert_arrow_table_to_dataframe() — low-level converter that reconstructs row/column MultiIndex, subtotals, primary labels, and types from Arrow field metadata.

Why

The JSON paging path serialises every result to JSON and pages it in chunks — it is CPU-heavy and slow for wide or deep result sets. Arrow IPC transfers binary columnar
data in a single round-trip. End-to-end benchmarks against the GoodData demo workspace show 1.3×–33× speedup depending on table shape, with larger tables benefiting most .

Test coverage

  • 140 unit tests covering: missing metadata keys (all three required keys), self_destruct mode, _build_field_index edge cases (subtotal padding, asymmetric depth), compute_row_totals_indexes with empty dimensions, for_arrow_table correctness across flat/transposed/subtotals/both-dim-totals cases.
  • 47 ground-truth fixture cases generated against the live API and committed to tests/dataframe/fixtures/arrow/, including 3-metric tables, 3-level nested subtotals, multi-aggregation multi-metric tables, and asymmetric totals (different levels/aggregations per metric).
  • IPC test fixture updated to use ipc.new_file to match the server format.

@Martozar Martozar force-pushed the c.mze-cq-105 branch 3 times, most recently from 7453528 to 0380d40 Compare March 30, 2026 09:22
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 91.55405% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.38%. Comparing base (49ea0d5) to head (7dddd3a).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...s/gooddata-pandas/src/gooddata_pandas/dataframe.py 46.42% 15 Missing ⚠️
...ta-sdk/src/gooddata_sdk/compute/model/execution.py 76.47% 4 Missing ⚠️
...es/gooddata-pandas/src/gooddata_pandas/__init__.py 50.00% 3 Missing ⚠️
...data-pandas/src/gooddata_pandas/arrow_convertor.py 99.12% 2 Missing ⚠️
...ddata-sdk/src/gooddata_sdk/compute/model/filter.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1489      +/-   ##
==========================================
+ Coverage   78.13%   78.38%   +0.25%     
==========================================
  Files         228      230       +2     
  Lines       14926    15212     +286     
==========================================
+ Hits        11662    11924     +262     
- Misses       3264     3288      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Martozar Martozar marked this pull request as draft March 30, 2026 10:47
]

[project.optional-dependencies]
arrow = ["pyarrow>=16.1.0"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: this is a new dependency. Consider setting the threshold higher e.g., pyarrow>=23.0.1

@Martozar Martozar changed the title C.mze cq 105 Arrow IPC binary fetch path for DataFrame execution Apr 1, 2026
@Martozar Martozar marked this pull request as ready for review April 1, 2026 11:01
@Martozar Martozar force-pushed the c.mze-cq-105 branch 2 times, most recently from d7fbc76 to 4e99271 Compare April 1, 2026 13:54
Martozar added 4 commits April 1, 2026 15:57
Switch read_result_arrow to explicitly request application/vnd.apache.arrow.stream
via Accept header and pipe the HTTP response directly into ipc.open_stream(),
eliminating the intermediate BytesIO buffer. Update tests accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants