I have successfully implemented the variant(array(json)) benchmarking approach for the 100M dataset as requested. This approach stores all 100M JSON objects as a single array in one row, providing a unique way to benchmark JSON array performance.
- Database:
bluesky_100m_variant_array - Table:
bluesky_array_data - Schema:
data Variant(Array(JSON)) - Storage Pattern: 1 row containing an array of 100M JSON objects
| Approach | Rows | Columns | JSON Storage |
|---|---|---|---|
| JSON Baseline | 100M | 1 | Individual JSON objects |
| Variant Direct | 100M | 1 | Individual Variant(JSON) |
| Variant Array | 1 | 1 | Single Array[100M JSON objects] |
- Added
variant_arrayapproach to the benchmark suite - Implemented
load_data_variant_array()method - Added
create_variant_array_queries_100m()method - Enhanced error handling and user guidance
-- Q1: Count by kind using array element access
SELECT toString(arrayElement(data.Array, i).kind) as kind, count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
GROUP BY kind ORDER BY count() DESC;
-- Q2-Q5: Similar pattern with different field access- Validation script using ClickHouse local mode
- Tests schema creation and query syntax
- Verifies array access functionality
- Preparation: Reads
bluesky_100m_combined.jsonl - Aggregation: Combines all JSON objects into a single array
- Storage: Creates
bluesky_100m_array.jsonwith structure:{"data": [json1, json2, ..., json100000000]} - Loading: Inserts as single row using JSONEachRow format
All queries use the same pattern:
SELECT /* fields */
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
WHERE /* conditions using arrayElement(data.Array, i).field */SELECT toString(arrayElement(data.Array, i).kind) as kind, count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
GROUP BY kind ORDER BY count() DESC;SELECT toString(arrayElement(data.Array, i).commit.collection) as collection, count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
WHERE collection != ''
GROUP BY collection ORDER BY count() DESC LIMIT 10;SELECT count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
WHERE toString(arrayElement(data.Array, i).kind) = 'commit';SELECT count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
WHERE toUInt64(arrayElement(data.Array, i).time_us) > 1700000000000000;SELECT toString(arrayElement(data.Array, i).commit.operation) as op,
toString(arrayElement(data.Array, i).commit.collection) as coll, count()
FROM bluesky_100m_variant_array.bluesky_array_data
ARRAY JOIN arrayEnumerate(data.Array) AS i
WHERE op != '' AND coll != ''
GROUP BY op, coll ORDER BY count() DESC LIMIT 5;# After starting ClickHouse server
python3 benchmark_100m.py# Validate with sample data
python3 test_variant_array_fixed.py# Option 1: Local server
export TZ=UTC && cd clickhouse && ./clickhouse server --daemon
# Option 2: System-wide installation
sudo apt-get install clickhouse-server clickhouse-client
sudo systemctl start clickhouse-server- Storage Efficiency: Single row reduces metadata overhead
- Memory Locality: All data in one contiguous structure
- Array Operations: Native ClickHouse array functions
- Memory Usage: Requires loading entire 100M array for queries
- Query Pattern: All operations require ARRAY JOIN
- Scalability: Single-threaded processing for array operations
✅ Complete: All components implemented and ready for benchmarking ✅ Tested: Schema and query syntax validated ✅ Documented: Full implementation details provided ✅ Integrated: Added to existing benchmark suite
The variant(array(json)) approach is now ready to benchmark against the existing JSON Object baseline and Variant Direct approaches, providing insights into how ClickHouse handles massive JSON arrays versus individual JSON records.