Skip to content

SNOW-3237416: Support structured type schema string#4155

Open
sfc-gh-wshangguan wants to merge 14 commits intomainfrom
wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python
Open

SNOW-3237416: Support structured type schema string#4155
sfc-gh-wshangguan wants to merge 14 commits intomainfrom
wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python

Conversation

@sfc-gh-wshangguan
Copy link
Copy Markdown
Collaborator

@sfc-gh-wshangguan sfc-gh-wshangguan commented Apr 2, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

SNOW-3237416

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  2. Please describe how your code solves the related issue.

Summary

INFER_SCHEMA can now return structured type strings (for example, OBJECT(...), MAP(...), ARRAY(...)) for parquet/json files with nested columns. The legacy parser in _infer_schema_for_file_format was based on simple string splitting and could fail on nested structured types.
This PR:

  • Adds a recursive structured type parser for OBJECT(...), MAP(...), ARRAY(...) with nesting and NOT NULL handling.
  • Adds safer fallback handling for bare structured keywords (OBJECT/MAP/ARRAY) by mapping them to VariantType and using $1:{name} (no cast) to preserve column access.

Problem

When INFER_SCHEMA returns nested structured strings like OBJECT(bio TEXT, social OBJECT(twitter TEXT, linkedin TEXT)), the old parser path can break because it assumes parenthesized values are numeric precision/scale.
Even when parsing succeeds, cast generation for structured strings can be brittle (for example, NOT NULL annotations inside structured casts).

Solution

New parser path (_parse_structured_type_str -> _sf_type_to_type_object):

  • Recursively parses nested structured type strings into Snowpark StructType / ArrayType / MapType with structured semantics.
  • Handles Snowflake metadata aliases (TEXT -> StringType, REAL -> DoubleType, FIXED -> LongType).
  • Falls back to convert_sf_to_sp_type for simple and parameterized scalar types (backward compatible).
    Structured infer-schema identifier handling:
  • For structured types, preserves raw inferred type text while stripping NOT NULL from cast text.
  • For bare structured keywords (older backend behavior), returns VariantType and uses $1:{name} (no cast) to avoid failures and preserve column names.

Changes

Core implementation:

  • src/snowflake/snowpark/dataframe_reader.py
    • Added _extract_paren_content()
    • Added _sf_type_to_type_object()
    • Added _parse_structured_type_str()
    • Updated _infer_schema_for_file_format to use structured parser path
    • Fixed format.lower -> format.lower()
    • Added bare-structured fallback behavior (VariantType + no-cast identifier)
      Supporting wiring/tests:
  • src/snowflake/snowpark/session.py
    • Added session parameter _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA
  • tests/unit/test_dataframe_reader_type_parsing.py
    • Added parser and infer-schema coverage (nested structured, nullability, bare structured fallback, parquet/json path behavior)

Backward Compatibility

  • Simple scalar types (NUMBER, VARCHAR, BOOLEAN, etc.) continue to use existing conversion logic.
  • Legacy behavior remains available when structured infer-schema flag is disabled.
  • Non-structured infer-schema results are unaffected.

@github-actions
Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@sfc-gh-wshangguan sfc-gh-wshangguan changed the title (WIP) first attempt, still need parameter protection? SNOW-3237416: Support structured type schema string Apr 11, 2026
@sfc-gh-wshangguan sfc-gh-wshangguan marked this pull request as ready for review April 11, 2026 00:18
@sfc-gh-wshangguan sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from 8230e3c to d5773e7 Compare April 11, 2026 08:52
@sfc-gh-wshangguan sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from d5773e7 to 129d7ef Compare April 14, 2026 01:08
Comment on lines 98 to 107
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructType,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required imports for types used in the new code. StructField is used on line 219 but not imported. VariantType is used on lines 245, 257, and 1558 but not imported. This will cause NameError at runtime when these functions are called.

from snowflake.snowpark.types import (
    ArrayType,
    DataType,
    DecimalType,
    DoubleType,
    LongType,
    MapType,
    StringType,
    StructField,  # Add this
    StructType,
    VariantType,  # Add this
)
Suggested change
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructType,
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructField,
StructType,
VariantType,

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

depth -= 1
if depth == 0:
return base, type_str[paren_idx + 1 : i]
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in reality if this line of code gets executed and depth != 0, does it mean we have a broken type_str (imbalanced paren) from backend?
in this case shall we error out explicitly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I'll raise an error for this case.

Comment on lines +173 to +176
# Strip NOT NULL before parsing — the caller is responsible for using
# the nullable info. For top-level calls from ARRAY/MAP/OBJECT handlers
# below, NOT NULL is handled before recursing.
type_str, _ = extract_nullable_keyword(type_str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what would happen to the top level nullability, you mentioned some caller would handle it, do you know where that is?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually another nullable column in the result of INFER_SCHEMA, which represents the top level nullability. And the existing snowpark python code (caller) handles the nullable column for top level.

# Columns for r [column_name, type, nullable, expression, filenames, order_id]
column_name, type, nullable, expression = r[0], r[1], r[2], r[3]

raise ValueError(f"'{type_str}' is not a supported type")

base, inner = result
base_upper = base.upper()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we think we can do type_str = type_str.lower() at the beginning of the function such that we can avoid calling lower()/upper() in the following up operations?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might want to keep the current shape: type_str is reused as part of the cast identifier we hand to SQL, and the two normalizations (upper for keyword dispatch, lower+strip-spaces for the DATA_TYPE_STRING_OBJECT_MAPPINGS lookup) are different.

Comment on lines +210 to +212
field_def = field_def.strip()
if not field_def:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what is this case, is it something like ", , ," -- is this considered valid?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to raise an exception for this case.

Comment on lines +244 to +245
if not type_str:
return VariantType()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this possible?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, just a defensive check.

Comment thread src/snowflake/snowpark/session.py Outdated
)

self._use_structured_type_infer_schema: bool = (
self._conn._get_client_side_session_parameter(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this feature available in prod? shall we have integ test for it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently disabled by now. QQ for the snowpark CI, does the test account have the permission to alter Snowflake internal parameters? If so, we can add integ tests.

Comment thread src/snowflake/snowpark/session.py Outdated

self._use_structured_type_infer_schema: bool = (
self._conn._get_client_side_session_parameter(
_PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA, False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we enable this feature by:
alter session set PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA=True ?
And like Adam said, can we add an integ test to test the end to end behavior, we can use mock if it is not available on prod

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. The flag PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA is just to guard the code change in snowpark python. There are other account level parameters to control whether the backend to return structured type schema string or not. So as mentioned https://github.com/snowflakedb/snowpark-python/pull/4155/changes/BASE..129d7ef3c2cc04f98811eb63d3ae522b3d412d3c#r3103032414, we can add integ test if test account used in snowpark is able to alter Snowflake non-public parameter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to another way to avoid the confusion here.

}


def _extract_paren_content(type_str: str) -> Optional[Tuple[str, str]]:
Copy link
Copy Markdown
Collaborator

@sfc-gh-yuwang sfc-gh-yuwang Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like added helper functions should be moved to snowflake/snowpark/_internal/type_utils.py

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.43%. Comparing base (c35990f) to head (3c92300).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4155      +/-   ##
==========================================
+ Coverage   95.15%   95.43%   +0.28%     
==========================================
  Files         171      171              
  Lines       43587    43720     +133     
  Branches     7457     7488      +31     
==========================================
+ Hits        41476    41725     +249     
+ Misses       1291     1220      -71     
+ Partials      820      775      -45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants