Add identifier expression indexes for high-cardinality dataset types#66
Merged
Add identifier expression indexes for high-cardinality dataset types#66
Conversation
- SqlAlchemySessionProvider.create_identifier_indexes(): creates composite expression indexes on identifier JSONB keys (Postgres only, IF NOT EXISTS) - DatasetStore.create_indexes(): delegates to repository, configured via identifier_index_configs from dataset_types config - `ingestify sync-indexes` CLI command to trigger index creation explicitly (never automatic to avoid locking large tables) - identifier_index: true option in dataset_types config - test-postgres job in test.yml with Postgres 15 service
- IdentifierTransformer now stores and returns declared key_type per key
- register_transformation() accepts optional key_type ('str' or 'int')
- Repository query building uses declared key_type for JSONB cast instead
of inferring from Python value type at runtime
- create_identifier_indexes() generates typed expressions:
(identifier->>'key') for str, ((identifier->>'key')::integer) for int
- main.py passes key_type from config to both transformer and index configs
- Tests updated to use new dict key format {name, key_type}
Limits each index to a single dataset_type, so it is smaller and dataset_type is an implicit condition rather than a post-scan filter.
Two different providers can share the same dataset_type name, so the partial index WHERE clause now matches both provider and dataset_type. Index name uses provider_dataset_type to avoid collisions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ingestify sync-indexesCLI command to trigger index creation explicitly (never automatic to avoid locking large tables)