[lake/iceberg] Add rest catalog cache#2622
Conversation
| * Config key for table cache TTL in milliseconds. After this duration, the cached table is | ||
| * reloaded on next use. Set to 0 to disable TTL (cache never expires). Default: 5 minutes. | ||
| */ | ||
| public static final String TABLE_CACHE_TTL_MS_KEY = "iceberg.catalog.table-cache-ttl-ms"; |
There was a problem hiding this comment.
I don't think this is actually needed anymore. In Iceberg 1.11.0 this is already solved natively in the REST catalog by having a freshness-aware loading of tables using etags (apache/iceberg#14398). The Iceberg community is planning to release 1.11.0 with that feature within the next few weeks
There was a problem hiding this comment.
Yes, this issue will be resolved once Apache Iceberg 1.11.0 is released, though we are unsure of the exact timeline. Additionally, are there any plans to update Fluss to use Iceberg 1.11.0?
There was a problem hiding this comment.
once 1.11.0 is released I'll update it in Fluss right away. I'm working in the Iceberg community and we're planning to release 1.11.0 within the next few weeks
Purpose
Query performance for the data lake table is very slow compared to querying remote storage. Add per-task lazy caching of Iceberg Catalog and Table inside IcebergLakeSource so that createRecordReader reuses one loadTable for all lake splits in a Flink source task, eliminating O(splits) REST round-trips when using a REST catalog.
Before: N splits → N × (createCatalog + loadTable) → N REST calls per task.
After: N splits → 1 × (createCatalog + loadTable) on first split, then N-1 reuses → 1 REST loadTable per task. With TTL enabled, the cache is refreshed after the TTL period so externally changed table metadata is picked up.
Linked issue: close #2619
Brief change log
Tests
API and Format
Documentation