Skip to content

Latest commit

 

History

History
154 lines (124 loc) · 4.98 KB

File metadata and controls

154 lines (124 loc) · 4.98 KB

Variant Array Implementation Summary

✅ Implementation Complete

I have successfully implemented the variant(array(json)) benchmarking approach for the 100M dataset as requested. This approach stores all 100M JSON objects as a single array in one row, providing a unique way to benchmark JSON array performance.

🏗️ Architecture

Schema Design

  • Database: bluesky_100m_variant_array
  • Table: bluesky_array_data
  • Schema: data Variant(Array(JSON))
  • Storage Pattern: 1 row containing an array of 100M JSON objects

Data Structure Comparison

Approach Rows Columns JSON Storage
JSON Baseline 100M 1 Individual JSON objects
Variant Direct 100M 1 Individual Variant(JSON)
Variant Array 1 1 Single Array[100M JSON objects]

📝 Files Created/Modified

1. benchmark_100m.py (Enhanced)

  • Added variant_array approach to the benchmark suite
  • Implemented load_data_variant_array() method
  • Added create_variant_array_queries_100m() method
  • Enhanced error handling and user guidance

2. queries_variant_array_100m.sql (New)

-- Q1: Count by kind using array element access
SELECT toString(arrayElement(data.Array, i).kind) as kind, count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
GROUP BY kind ORDER BY count() DESC;

-- Q2-Q5: Similar pattern with different field access

3. test_variant_array_fixed.py (New)

  • Validation script using ClickHouse local mode
  • Tests schema creation and query syntax
  • Verifies array access functionality

🔧 Technical Implementation

Data Loading Process

  1. Preparation: Reads bluesky_100m_combined.jsonl
  2. Aggregation: Combines all JSON objects into a single array
  3. Storage: Creates bluesky_100m_array.json with structure:
    {"data": [json1, json2, ..., json100000000]}
  4. Loading: Inserts as single row using JSONEachRow format

Query Pattern

All queries use the same pattern:

SELECT /* fields */
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
WHERE /* conditions using arrayElement(data.Array, i).field */

🎯 Benchmark Queries

Q1: Count by Kind

SELECT toString(arrayElement(data.Array, i).kind) as kind, count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
GROUP BY kind ORDER BY count() DESC;

Q2: Top Collections

SELECT toString(arrayElement(data.Array, i).commit.collection) as collection, count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
WHERE collection != '' 
GROUP BY collection ORDER BY count() DESC LIMIT 10;

Q3: Filter Commits

SELECT count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
WHERE toString(arrayElement(data.Array, i).kind) = 'commit';

Q4: Time Range Query

SELECT count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
WHERE toUInt64(arrayElement(data.Array, i).time_us) > 1700000000000000;

Q5: Complex Aggregation

SELECT toString(arrayElement(data.Array, i).commit.operation) as op, 
       toString(arrayElement(data.Array, i).commit.collection) as coll, count() 
FROM bluesky_100m_variant_array.bluesky_array_data 
ARRAY JOIN arrayEnumerate(data.Array) AS i 
WHERE op != '' AND coll != '' 
GROUP BY op, coll ORDER BY count() DESC LIMIT 5;

🚀 Usage Instructions

Running the Benchmark

# After starting ClickHouse server
python3 benchmark_100m.py

Testing the Implementation

# Validate with sample data
python3 test_variant_array_fixed.py

ClickHouse Server Setup

# Option 1: Local server
export TZ=UTC && cd clickhouse && ./clickhouse server --daemon

# Option 2: System-wide installation
sudo apt-get install clickhouse-server clickhouse-client
sudo systemctl start clickhouse-server

📊 Expected Performance Characteristics

Advantages

  • Storage Efficiency: Single row reduces metadata overhead
  • Memory Locality: All data in one contiguous structure
  • Array Operations: Native ClickHouse array functions

Considerations

  • Memory Usage: Requires loading entire 100M array for queries
  • Query Pattern: All operations require ARRAY JOIN
  • Scalability: Single-threaded processing for array operations

🎉 Implementation Status

Complete: All components implemented and ready for benchmarking ✅ Tested: Schema and query syntax validated ✅ Documented: Full implementation details provided ✅ Integrated: Added to existing benchmark suite

The variant(array(json)) approach is now ready to benchmark against the existing JSON Object baseline and Variant Direct approaches, providing insights into how ClickHouse handles massive JSON arrays versus individual JSON records.