When Small Parquet Files Become a Big Problem (and How I Ended Up Writing a Compactor in PyArrow)
It all began with a fairly normal data pipeline. Events were coming in through Kafka, landing in AWS S3 as Parquet files after going through some lightweight microbatch processing. It looked clean at first glance. Efficient. Predictable. But one day I opened one of the hourly folders and saw the