Skip to content

perf: Speed up purge_table with manifest dedup, parallel deletion, and schema caching#3233

Open
damahua wants to merge 1 commit intoapache:mainfrom
damahua:perf/purge-table-parallel-delete
Open

perf: Speed up purge_table with manifest dedup, parallel deletion, and schema caching#3233
damahua wants to merge 1 commit intoapache:mainfrom
damahua:perf/purge-table-parallel-delete

Conversation

@damahua
Copy link
Copy Markdown

@damahua damahua commented Apr 11, 2026

Closes #

Rationale for this change

purge_table is significantly slower than necessary due to three compounding inefficiencies:

  1. Manifest deduplication: When iterating through snapshots to collect manifests, the same manifest file appears in every subsequent snapshot's manifest list. For a table with 200 snapshots, this means 20,100 manifest file opens for only 200 unique files (100× redundancy).

  2. Sequential file deletion: Data files, manifest files, and metadata files are deleted one at a time in sequential loops. The Java reference implementation (CatalogUtil.dropTableData) already deletes files concurrently using a worker thread pool.

  3. Redundant Avro schema parsing: Every manifest file open triggers JSON parsing of the embedded Avro schema, conversion to Iceberg schema, and resolution of a reader tree — even though all manifests of the same type share the identical schema.

Benchmark

Table with 200 snapshots, 200 data files, 200 manifests (generated by appending 200 batches of 20K rows each):

Metric Before After
purge_table mean (N=3) 7.187s ± 0.165s 0.133s ± 0.004s
Speedup 54×
Manifest opens 20,100 200
Schema conversions 21,732 ~2 (cache hits)

Changes

pyiceberg/catalog/__init__.py:

  • delete_data_files(): Deduplicate manifests by manifest_path before iterating. Collect all unique data file paths, then delegate to delete_files().
  • delete_files(): Use ExecutorFactory.get_or_create().map() for concurrent deletion — the same ThreadPoolExecutor pattern already used in plan_files(), to_arrow(), and _write_added_manifest().

pyiceberg/avro/file.py:

  • Cache Avro-to-Iceberg schema conversion in a module-level dict keyed by the raw schema JSON string.
  • Cache resolved reader trees keyed by (file_schema, read_schema, read_types, read_enums).
  • Both caches use threading.Lock with double-checked locking for thread safety across all Python implementations (not just CPython).
  • Reader objects are stateless (read() takes a decoder argument, no mutation of self), so sharing cached readers across calls is safe.

Are these changes tested?

Yes:

  • All 130 existing Avro tests pass
  • All 12 SQL catalog tests pass (including purge-related tests)
  • All 7 catalog base tests pass
  • Benchmarked with N=3 runs on a 200-snapshot table, distributions completely non-overlapping

Are there any user-facing changes?

No API changes. purge_table produces the same result (all files deleted), just faster. The delete_files function now deletes concurrently instead of sequentially, with the same error handling (OSError logged as warning, continues with remaining files).

…elizing file deletion

Three changes to reduce purge_table wall time from ~7s to ~0.13s (54x) on a table with 200 snapshots:

1. Deduplicate manifests by path before iterating in delete_data_files().
   The same manifest appears across many snapshots' manifest lists.
   For 200 snapshots this reduces 20,100 manifest opens to 200.

2. Parallelize file deletion using the existing ExecutorFactory
   ThreadPoolExecutor, matching the pattern already used for manifest
   reading in plan_files() and data file reading in to_arrow().
   This aligns with the Java reference implementation (CatalogUtil.dropTableData)
   which also deletes files concurrently via a worker thread pool.

3. Cache Avro-to-Iceberg schema conversion and reader tree resolution.
   All manifests of the same type share the same Avro schema, but it was
   being JSON-parsed, converted, and resolved into a reader tree on every
   open. Uses explicit threading.Lock for thread safety across all Python
   implementations.
@damahua damahua force-pushed the perf/purge-table-parallel-delete branch from 7ec0217 to 576fd03 Compare April 11, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant