The Unbearable Weight of Massive Time-Series Data
Working with IoT telemetry data from 50,000 connected vehicles at Harman was an incredible lesson in the physics of big data. You start by thinking about algorithms, but you spend all your time thinking about infrastructure.
The Challenge
Every vehicle pushed metrics (speed, CAN bus anomalies, engine temps) every second. This resulted in over 1.2TB of raw JSON hitting our AWS S3 buckets every single day. Our initial Pandas-based Pandas jobs crashed immediately over Out Of Memory exceptions.
Shifting to PySpark
We transitioned the entire pipeline to Apache Spark running on AWS EMR.
The most critical optimization was dealing with the "small files problem." AWS Firehose was dumping thousands of tiny JSON files into S3 every minute. Spark spends more time opening tiny files than it does processing them.
We wrote an initial compaction routine that grouped these JSON files into daily, partitioned Parquet files using snappy compression.
The Results
By shifting from JSON to partitioned Parquet, and leveraging PySpark's lazy evaluation engine, we reduced our daily ETL window from 4 hours to just under 2 minutes. This meant our downstream XGBoost and CNN models could train on perfectly clean, windowed data every afternoon instead of the middle of the night.