Skip to content

Table.scan(options=...) silently ignores S3 properties for FileIO during data materialization (to_pandas / to_arrow) #3166

@dongsupkim-onepredict

Description

@dongsupkim-onepredict

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Description:
When passing an options dictionary to Table.scan(options=...), the properties (such as s3.connect-timeout or s3.request-timeout) are accepted by the DataScan object but are never propagated to the underlying FileIO (e.g., PyArrowFileIO) when actual data materialization occurs via methods like to_pandas() or to_arrow().
Because ArrowScan is initialized with the FileIO that was created during catalog instantiation (table.io), any S3-specific configurations provided at the scan level are completely bypassed. This causes operations reading numerous manifest files to fall back to the AWS C++ SDK default timeouts (often 10s-30s), leading to unexpected curlCode: 28 (Timeout was reached) errors even when generous timeouts are explicitly requested in the scan options.

Steps to Reproduce:

# 1. Load catalog with default (or no) S3 timeout properties

from pyiceberg.catalog import load_catalog
catalog = load_catalog("my_catalog", **{
    "uri": "...",
    "s3.endpoint": "..."
})
table = catalog.load_table("my_namespace.my_table")

# 2. Attempt to scan with explicit S3 timeout options

scan_options = {
    "s3.connect-timeout": "600.0",
    "s3.request-timeout": "600.0"
}


# The options are accepted by DataScan...

scan = table.scan(options=scan_options)
# 3. ...but completely ignored during S3 I/O operations (ArrowScan)
# This may throw a timeout error if RGW/S3 latency spikes, ignoring the 600s setting above.
df = scan.to_pandas()

Expected Behavior:

Properties passed via options in Table.scan() should cascade down and either update or override the table.io.properties for the duration of the scan. Specifically, s3.* configurations should be respected by the underlying FileIO (e.g., PyArrowFileIO) when downloading manifest lists or data files.

Actual Behavior:

The options passed to Table.scan() are stored in the DataScan instance but are never passed to the ArrowScan class or the FileIO instance during to_arrow() / to_pandas().
The ArrowScan relies entirely on the unmodified self.io object originally initialized by the catalog:

# In pyiceberg/table/__init__.py -> DataScan.to_arrow()
        return ArrowScan(
            self.table_metadata, 
            self.io,  # <--- options are missing here!
            self.projection(), 
            self.row_filter, 
            self.case_sensitive, 
            self.limit
        ).to_table(self.plan_files())

Environment:

  • PyIceberg Version: 0.11.1 (and earlier)
  • PyArrow Version: 18.0.0
  • Storage: Ceph S3 / Rados Gateway (RGW)

Suggested Fix:

Ideally, DataScan should merge its options with self.io.properties and instantiate a new FileIO, or ArrowScan should be modified to accept the scan-level options and apply them dynamically to the FileSystem instance before reading files.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions