Skip to content

Feature/string bucket transformation#69

Closed
koenvo wants to merge 7 commits intomainfrom
feature/string-bucket-transformation
Closed

Feature/string bucket transformation#69
koenvo wants to merge 7 commits intomainfrom
feature/string-bucket-transformation

Conversation

@koenvo
Copy link
Copy Markdown
Contributor

@koenvo koenvo commented Apr 7, 2026

No description provided.

koenvo added 7 commits April 5, 2026 21:22
Sources can now wrap a loader function in BatchLoader(loader_fn, batch_size)
and share the instance across DatasetResources that should be batched
together. Ingestify groups those resources, chunks them into groups of
batch_size, wraps each chunk in a BatchTask, and calls the loader_fn once
per batch with lists of file_resources / current_files / dataset_resources.

load_file() now passes dataset_resource to loaders that accept it
(signature introspection with lru_cache, so existing loaders continue to
work without changes).
Different BatchTasks sharing the same loader write/read different
keys (id(file_resource)); CPython dict operations on distinct keys
are atomic under the GIL.
Uses MD5 hash for stable distribution when the value cannot be cast to
int. Integer values continue to use direct modulo.
Prevents special characters, spaces, $, unicode etc. from causing
issues in GCS/S3 paths. Simple values like integers stay readable.
Transliterates unicode (ü→u, é→e), strips non-alphanumeric chars,
takes the first N characters. Falls back to '_' for empty results
(e.g. all-special-char or CJK-only strings).
@koenvo koenvo closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant