SNOW-3237416: Support structured type schema string by sfc-gh-wshangguan · Pull Request #4155 · snowflakedb/snowpark-python

sfc-gh-wshangguan · 2026-04-02T19:28:03Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

Summary

INFER_SCHEMA can now return structured type strings (for example, OBJECT(...), MAP(...), ARRAY(...)) for parquet/json files with nested columns. The legacy parser in _infer_schema_for_file_format was based on simple string splitting and could fail on nested structured types.
This PR:

Adds a recursive structured type parser for OBJECT(...), MAP(...), ARRAY(...) with nesting and NOT NULL handling.
Adds safer fallback handling for bare structured keywords (OBJECT/MAP/ARRAY) by mapping them to VariantType and using $1:{name} (no cast) to preserve column access.

Problem

When INFER_SCHEMA returns nested structured strings like OBJECT(bio TEXT, social OBJECT(twitter TEXT, linkedin TEXT)), the old parser path can break because it assumes parenthesized values are numeric precision/scale.
Even when parsing succeeds, cast generation for structured strings can be brittle (for example, NOT NULL annotations inside structured casts).

Solution

New parser path (_parse_structured_type_str -> _sf_type_to_type_object):

Recursively parses nested structured type strings into Snowpark StructType / ArrayType / MapType with structured semantics.
Handles Snowflake metadata aliases (TEXT -> StringType, REAL -> DoubleType, FIXED -> LongType).
Falls back to convert_sf_to_sp_type for simple and parameterized scalar types (backward compatible).
Structured infer-schema identifier handling:
For structured types, preserves raw inferred type text while stripping NOT NULL from cast text.
For bare structured keywords (older backend behavior), returns VariantType and uses $1:{name} (no cast) to avoid failures and preserve column names.

Changes

Core implementation:

src/snowflake/snowpark/dataframe_reader.py
- Added _extract_paren_content()
- Added _sf_type_to_type_object()
- Added _parse_structured_type_str()
- Updated _infer_schema_for_file_format to use structured parser path
- Fixed format.lower -> format.lower()
- Added bare-structured fallback behavior (VariantType + no-cast identifier)
  Supporting wiring/tests:
src/snowflake/snowpark/session.py
- Added session parameter _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA
tests/unit/test_dataframe_reader_type_parsing.py
- Added parser and infer-schema coverage (nested structured, nullability, bare structured fallback, parquet/json path behavior)

Backward Compatibility

Simple scalar types (NUMBER, VARCHAR, BOOLEAN, etc.) continue to use existing conversion logic.
Legacy behavior remains available when structured infer-schema flag is disabled.
Non-structured infer-schema results are unaffected.

github-actions · 2026-04-10T23:33:01Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

This reverts commit c5ab05de9c185ee004ddde5e6458615b39566686.

…and add tests

graphite-app · 2026-04-14T01:12:49Z

 from snowflake.snowpark.table import Table
 from snowflake.snowpark.types import (
    ArrayType,
+    DataType,
+    DecimalType,
+    DoubleType,
+    LongType,
    MapType,
    StringType,
    StructType,


Missing required imports for types used in the new code. StructField is used on line 219 but not imported. VariantType is used on lines 245, 257, and 1558 but not imported. This will cause NameError at runtime when these functions are called.

from snowflake.snowpark.types import ( ArrayType, DataType, DecimalType, DoubleType, LongType, MapType, StringType, StructField, # Add this StructType, VariantType, # Add this )

Suggested change

from snowflake.snowpark.table import Table

from snowflake.snowpark.types import (

ArrayType,

DataType,

DecimalType,

DoubleType,

LongType,

MapType,

StringType,

StructType,

from snowflake.snowpark.table import Table

from snowflake.snowpark.types import (

ArrayType,

DataType,

DecimalType,

DoubleType,

LongType,

MapType,

StringType,

StructField,

StructType,

VariantType,

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

sfc-gh-aling · 2026-04-16T02:40:35Z

+            depth -= 1
+            if depth == 0:
+                return base, type_str[paren_idx + 1 : i]
+    return None


in reality if this line of code gets executed and depth != 0, does it mean we have a broken type_str (imbalanced paren) from backend?
in this case shall we error out explicitly?

Good suggestion. I'll raise an error for this case.

sfc-gh-aling · 2026-04-16T02:52:38Z

+    # Strip NOT NULL before parsing — the caller is responsible for using
+    # the nullable info. For top-level calls from ARRAY/MAP/OBJECT handlers
+    # below, NOT NULL is handled before recursing.
+    type_str, _ = extract_nullable_keyword(type_str)


I'm wondering what would happen to the top level nullability, you mentioned some caller would handle it, do you know where that is?

There is actually another nullable column in the result of INFER_SCHEMA, which represents the top level nullability. And the existing snowpark python code (caller) handles the nullable column for top level.

snowpark-python/src/snowflake/snowpark/dataframe_reader.py

Lines 1345 to 1346 in c5362e4

# Columns for r [column_name, type, nullable, expression, filenames, order_id]

column_name, type, nullable, expression = r[0], r[1], r[2], r[3]

sfc-gh-aling · 2026-04-16T02:55:33Z

+        raise ValueError(f"'{type_str}' is not a supported type")
+
+    base, inner = result
+    base_upper = base.upper()


do we think we can do type_str = type_str.lower() at the beginning of the function such that we can avoid calling lower()/upper() in the following up operations?

I might want to keep the current shape: type_str is reused as part of the cast identifier we hand to SQL, and the two normalizations (upper for keyword dispatch, lower+strip-spaces for the DATA_TYPE_STRING_OBJECT_MAPPINGS lookup) are different.

sfc-gh-aling · 2026-04-16T03:44:01Z

+            field_def = field_def.strip()
+            if not field_def:
+                continue


curious what is this case, is it something like ", , ," -- is this considered valid?

Changed to raise an exception for this case.

sfc-gh-aling · 2026-04-16T03:49:05Z

+    if not type_str:
+        return VariantType()


is this possible?

Probably not, just a defensive check.

sfc-gh-aling · 2026-04-16T03:51:40Z

        )

+        self._use_structured_type_infer_schema: bool = (
+            self._conn._get_client_side_session_parameter(


is this feature available in prod? shall we have integ test for it?

It's currently disabled by now. QQ for the snowpark CI, does the test account have the permission to alter Snowflake internal parameters? If so, we can add integ tests.

sfc-gh-yuwang · 2026-04-16T17:30:27Z


+        self._use_structured_type_infer_schema: bool = (
+            self._conn._get_client_side_session_parameter(
+                _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA, False


do we enable this feature by:
alter session set PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA=True ?
And like Adam said, can we add an integ test to test the end to end behavior, we can use mock if it is not available on prod

Actually no. The flag PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA is just to guard the code change in snowpark python. There are other account level parameters to control whether the backend to return structured type schema string or not. So as mentioned https://github.com/snowflakedb/snowpark-python/pull/4155/changes/BASE..129d7ef3c2cc04f98811eb63d3ae522b3d412d3c#r3103032414, we can add integ test if test account used in snowpark is able to alter Snowflake non-public parameter.

Updated to another way to avoid the confusion here.

sfc-gh-yuwang · 2026-04-16T17:43:30Z

+}
+
+
+def _extract_paren_content(type_str: str) -> Optional[Tuple[str, str]]:


i feel like added helper functions should be moved to snowflake/snowpark/_internal/type_utils.py

codecov-commenter · 2026-04-17T21:20:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.43%. Comparing base (c35990f) to head (3c92300).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4155      +/-   ##
==========================================
+ Coverage   95.15%   95.43%   +0.28%     
==========================================
  Files         171      171              
  Lines       43587    43720     +133     
  Branches     7457     7488      +31     
==========================================
+ Hits        41476    41725     +249     
+ Misses       1291     1220      -71     
+ Partials      820      775      -45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sfc-gh-wshangguan marked this pull request as draft April 2, 2026 19:28

sfc-gh-wshangguan requested review from a team as code owners April 2, 2026 19:28

sfc-gh-wshangguan requested review from sfc-gh-aalam, sfc-gh-aling, sfc-gh-jdu and sfc-gh-joshi April 2, 2026 19:28

sfc-gh-wshangguan changed the title ~~(WIP) first attempt, still need parameter protection?~~ SNOW-3237416: Support structured type schema string Apr 11, 2026

sfc-gh-wshangguan marked this pull request as ready for review April 11, 2026 00:18

sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from 8230e3c to d5773e7 Compare April 11, 2026 08:52

sfc-gh-wshangguan added 10 commits April 13, 2026 17:18

first attempt, still need parameter protection?

4cbfc6a

tmp commit

db508a2

snapshot commit, needs further work on this commit

d7135ab

refactor

6c694a3

add tests

9ba9fbb

clean code

3efffdb

add json as well

4fa362e

change version for testing

6269b33

Revert "change version for testing"

302d109

This reverts commit c5ab05de9c185ee004ddde5e6458615b39566686.

add support for bare structured type keyword, i.e. no inner details. …

129d7ef

…and add tests

sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from d5773e7 to 129d7ef Compare April 14, 2026 01:08

graphite-app bot reviewed Apr 14, 2026

View reviewed changes

sfc-gh-aling reviewed Apr 16, 2026

View reviewed changes

sfc-gh-yuwang reviewed Apr 16, 2026

View reviewed changes

sfc-gh-wshangguan added 2 commits April 17, 2026 13:23

resolve comments

219ecd1

simplify code

0a9c0cf

add more unit tests

94b4ab9

add disabled integ test

3c92300

	# Columns for r [column_name, type, nullable, expression, filenames, order_id]
	column_name, type, nullable, expression = r[0], r[1], r[2], r[3]

		}


		def _extract_paren_content(type_str: str) -> Optional[Tuple[str, str]]:

Conversation

sfc-gh-wshangguan commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Backward Compatibility

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

graphite-app bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-yuwang Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sfc-gh-wshangguan commented Apr 2, 2026 •

edited

Loading

sfc-gh-yuwang Apr 16, 2026 •

edited

Loading

codecov-commenter commented Apr 17, 2026 •

edited

Loading