Skip to content

HG-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls#3395

Open
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:getFileStatus
Open

HG-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls#3395
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:getFileStatus

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Feb 16, 2026

Rationale for this change

When reading Parquet files from HDFS, getFileStatus() is called twice for each file:

  1. During footer reading in ParquetFileReader.readAllFootersInParallel()
  2. During split generation in ParquetInputFormat.getSplits()

This creates redundant NameNode RPC calls. For workloads processing thousands of files, this redundancy significantly increases NameNode load and job startup time.
This PR caches FileStatus in the Footer object to eliminate redundant RPC calls, reducing NameNode RPC calls during Parquet file processing.

What changes are included in this PR?

  1. Footer.java: Added FileStatus field with backward-compatible constructors
  2. ParquetFileReader.java: Pass FileStatus when creating Footer objects
  3. ParquetInputFormat.java: Reuse cached FileStatus instead of calling fs.getFileStatus() again
  4. TestFooterFileStatusCaching.java: New test suite with 5 tests proving RPC reduction

Are these changes tested?

Yes. Added comprehensive test suite TestFooterFileStatusCaching with 5 test cases:

  • ✅ Footer stores and returns FileStatus correctly
  • ✅ ParquetFileReader passes FileStatus to Footer
  • ✅ Cached FileStatus is reused (saves 3 RPCs in test)
  • ✅ Complete workflow verification (saves 5 RPCs in test)
  • ✅ Backward compatibility verified

Are there any user-facing changes?

No.

Closes #3394

@wangyum
Copy link
Member Author

wangyum commented Feb 16, 2026

cc @wgtmac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache FileStatus in Footer to reduce redundant NameNode RPC calls

1 participant