The usual game plan is pretty straightforward: organize your data smartly and pick the right join moves to keep data shuffling to a minimum. But, let dove headfirst into exploring Spark's I/O bottlenecks and figuring out the real best practices.
Awesome, I’m having the same performance drag and mounting request costs when scanning numerous small files on S3 that adhere to datetime partitioning. It takes a large number of files just to fill a single partition for each core to process.
I guess unless the total data volume is substantial enough for each partition or, ideally, file to approach the optimal 128 MB size and downstream consumers can leverage column-level filtering, partitioning too early without actual use cases may actually hurt read performance and increase reading cost.
True, small files and table over partition layout are the killers. So, solution is using modern table format like delta, iceberg to leverage the metadata layer which contains your table stats and better pruning
Awesome, I’m having the same performance drag and mounting request costs when scanning numerous small files on S3 that adhere to datetime partitioning. It takes a large number of files just to fill a single partition for each core to process.
I guess unless the total data volume is substantial enough for each partition or, ideally, file to approach the optimal 128 MB size and downstream consumers can leverage column-level filtering, partitioning too early without actual use cases may actually hurt read performance and increase reading cost.
True, small files and table over partition layout are the killers. So, solution is using modern table format like delta, iceberg to leverage the metadata layer which contains your table stats and better pruning