SNOW-3237416: Support structured type schema string#4155
SNOW-3237416: Support structured type schema string#4155sfc-gh-wshangguan wants to merge 14 commits intomainfrom
Conversation
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
8230e3c to
d5773e7
Compare
This reverts commit c5ab05de9c185ee004ddde5e6458615b39566686.
d5773e7 to
129d7ef
Compare
| from snowflake.snowpark.table import Table | ||
| from snowflake.snowpark.types import ( | ||
| ArrayType, | ||
| DataType, | ||
| DecimalType, | ||
| DoubleType, | ||
| LongType, | ||
| MapType, | ||
| StringType, | ||
| StructType, |
There was a problem hiding this comment.
Missing required imports for types used in the new code. StructField is used on line 219 but not imported. VariantType is used on lines 245, 257, and 1558 but not imported. This will cause NameError at runtime when these functions are called.
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructField, # Add this
StructType,
VariantType, # Add this
)| from snowflake.snowpark.table import Table | |
| from snowflake.snowpark.types import ( | |
| ArrayType, | |
| DataType, | |
| DecimalType, | |
| DoubleType, | |
| LongType, | |
| MapType, | |
| StringType, | |
| StructType, | |
| from snowflake.snowpark.table import Table | |
| from snowflake.snowpark.types import ( | |
| ArrayType, | |
| DataType, | |
| DecimalType, | |
| DoubleType, | |
| LongType, | |
| MapType, | |
| StringType, | |
| StructField, | |
| StructType, | |
| VariantType, | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
| depth -= 1 | ||
| if depth == 0: | ||
| return base, type_str[paren_idx + 1 : i] | ||
| return None |
There was a problem hiding this comment.
in reality if this line of code gets executed and depth != 0, does it mean we have a broken type_str (imbalanced paren) from backend?
in this case shall we error out explicitly?
There was a problem hiding this comment.
Good suggestion. I'll raise an error for this case.
| # Strip NOT NULL before parsing — the caller is responsible for using | ||
| # the nullable info. For top-level calls from ARRAY/MAP/OBJECT handlers | ||
| # below, NOT NULL is handled before recursing. | ||
| type_str, _ = extract_nullable_keyword(type_str) |
There was a problem hiding this comment.
I'm wondering what would happen to the top level nullability, you mentioned some caller would handle it, do you know where that is?
There was a problem hiding this comment.
There is actually another nullable column in the result of INFER_SCHEMA, which represents the top level nullability. And the existing snowpark python code (caller) handles the nullable column for top level.
snowpark-python/src/snowflake/snowpark/dataframe_reader.py
Lines 1345 to 1346 in c5362e4
| raise ValueError(f"'{type_str}' is not a supported type") | ||
|
|
||
| base, inner = result | ||
| base_upper = base.upper() |
There was a problem hiding this comment.
do we think we can do type_str = type_str.lower() at the beginning of the function such that we can avoid calling lower()/upper() in the following up operations?
There was a problem hiding this comment.
I might want to keep the current shape: type_str is reused as part of the cast identifier we hand to SQL, and the two normalizations (upper for keyword dispatch, lower+strip-spaces for the DATA_TYPE_STRING_OBJECT_MAPPINGS lookup) are different.
| field_def = field_def.strip() | ||
| if not field_def: | ||
| continue |
There was a problem hiding this comment.
curious what is this case, is it something like ", , ," -- is this considered valid?
There was a problem hiding this comment.
Changed to raise an exception for this case.
| if not type_str: | ||
| return VariantType() |
There was a problem hiding this comment.
Probably not, just a defensive check.
| ) | ||
|
|
||
| self._use_structured_type_infer_schema: bool = ( | ||
| self._conn._get_client_side_session_parameter( |
There was a problem hiding this comment.
is this feature available in prod? shall we have integ test for it?
There was a problem hiding this comment.
It's currently disabled by now. QQ for the snowpark CI, does the test account have the permission to alter Snowflake internal parameters? If so, we can add integ tests.
|
|
||
| self._use_structured_type_infer_schema: bool = ( | ||
| self._conn._get_client_side_session_parameter( | ||
| _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA, False |
There was a problem hiding this comment.
do we enable this feature by:
alter session set PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA=True ?
And like Adam said, can we add an integ test to test the end to end behavior, we can use mock if it is not available on prod
There was a problem hiding this comment.
Actually no. The flag PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA is just to guard the code change in snowpark python. There are other account level parameters to control whether the backend to return structured type schema string or not. So as mentioned https://github.com/snowflakedb/snowpark-python/pull/4155/changes/BASE..129d7ef3c2cc04f98811eb63d3ae522b3d412d3c#r3103032414, we can add integ test if test account used in snowpark is able to alter Snowflake non-public parameter.
There was a problem hiding this comment.
Updated to another way to avoid the confusion here.
| } | ||
|
|
||
|
|
||
| def _extract_paren_content(type_str: str) -> Optional[Tuple[str, str]]: |
There was a problem hiding this comment.
i feel like added helper functions should be moved to snowflake/snowpark/_internal/type_utils.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4155 +/- ##
==========================================
+ Coverage 95.15% 95.43% +0.28%
==========================================
Files 171 171
Lines 43587 43720 +133
Branches 7457 7488 +31
==========================================
+ Hits 41476 41725 +249
+ Misses 1291 1220 -71
+ Partials 820 775 -45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
SNOW-3237416
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Summary
INFER_SCHEMAcan now return structured type strings (for example,OBJECT(...),MAP(...),ARRAY(...)) for parquet/json files with nested columns. The legacy parser in_infer_schema_for_file_formatwas based on simple string splitting and could fail on nested structured types.This PR:
OBJECT(...),MAP(...),ARRAY(...)with nesting andNOT NULLhandling.OBJECT/MAP/ARRAY) by mapping them toVariantTypeand using$1:{name}(no cast) to preserve column access.Problem
When
INFER_SCHEMAreturns nested structured strings likeOBJECT(bio TEXT, social OBJECT(twitter TEXT, linkedin TEXT)), the old parser path can break because it assumes parenthesized values are numeric precision/scale.Even when parsing succeeds, cast generation for structured strings can be brittle (for example,
NOT NULLannotations inside structured casts).Solution
New parser path (
_parse_structured_type_str->_sf_type_to_type_object):StructType/ArrayType/MapTypewith structured semantics.TEXT->StringType,REAL->DoubleType,FIXED->LongType).convert_sf_to_sp_typefor simple and parameterized scalar types (backward compatible).Structured infer-schema identifier handling:
NOT NULLfrom cast text.VariantTypeand uses$1:{name}(no cast) to avoid failures and preserve column names.Changes
Core implementation:
src/snowflake/snowpark/dataframe_reader.py_extract_paren_content()_sf_type_to_type_object()_parse_structured_type_str()_infer_schema_for_file_formatto use structured parser pathformat.lower->format.lower()VariantType+ no-cast identifier)Supporting wiring/tests:
src/snowflake/snowpark/session.py_PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMAtests/unit/test_dataframe_reader_type_parsing.pyBackward Compatibility
NUMBER,VARCHAR,BOOLEAN, etc.) continue to use existing conversion logic.