2 Comments
User's avatar
Bao's avatar
Nov 12Edited

Awesome, I’m having the same performance drag and mounting request costs when scanning numerous small files on S3 that adhere to datetime partitioning. It takes a large number of files just to fill a single partition for each core to process.

I guess unless the total data volume is substantial enough for each partition or, ideally, file to approach the optimal 128 MB size and downstream consumers can leverage column-level filtering, partitioning too early without actual use cases may actually hurt read performance and increase reading cost.

Huong Vuong's avatar

True, small files and table over partition layout are the killers. So, solution is using modern table format like delta, iceberg to leverage the metadata layer which contains your table stats and better pruning