Skip to content

Implement per column compression#3396

Open
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression
Open

Implement per column compression#3396
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression

Conversation

@rahil-c
Copy link

@rahil-c rahil-c commented Feb 16, 2026

Rationale for this change

Issue Raised here: apache/parquet-format#553

The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.

What changes are included in this PR?

  • ParquetProperties: Added ColumnProperty following the same pattern used for dictionary encoding, bloom filters.
  • ColumnChunkPageWriteStore: Added a new constructor that accepts CompressionCodecFactory + ParquetProperties
  • InternalParquetRecordWriter: Added a new constructor accepting CompressionCodecFactory instead of a single BytesInputCompressor.
  • ParquetWriter: Added withCompressionCodec(String,CompressionCodecName) builder method. Updated the core constructor to pass the CompressionCodecFactory through to the writer stack.
  • ParquetOutputFormat: Added ColumnConfigParser entry so per-column compression can be configured via Hadoop config keys (parquet.compression#=CODEC).
  • ParquetRecordWriter: Updated to pass CompressionCodecFactory to InternalParquetRecordWriter.

Are these changes tested?

  • Added test within this pr

Are there any user-facing changes?

Two new public APIs are introduced:

  ParquetWriter.builder(path)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
         // default for all columns
      .withCompressionCodec("embeddings",
  CompressionCodecName.UNCOMPRESSED)  // per-column override
      .build();

  Hadoop configuration (new key pattern):
  parquet.compression#<column_path>=<CODEC_NAME>

cc @julienledem @emkornfield

@rahil-c rahil-c marked this pull request as ready for review February 17, 2026 02:44
@rahil-c rahil-c changed the title [draft] Implement per column compression Implement per column compression Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant