Discussion about this post

User's avatar
Bao's avatar
Nov 12Edited

Awesome, I’m having the same performance drag and mounting request costs when scanning numerous small files on S3 that adhere to datetime partitioning. It takes a large number of files just to fill a single partition for each core to process.

I guess unless the total data volume is substantial enough for each partition or, ideally, file to approach the optimal 128 MB size and downstream consumers can leverage column-level filtering, partitioning too early without actual use cases may actually hurt read performance and increase reading cost.

1 more comment...

No posts

Ready for more?