A tool to configure degree of parallelism for hybrid layouts

Alberto Abelló, Rana Faisal Munir, Oscar Romero

More info 

  • Description

    Distributed processing frameworks (e.g., Hadoop, Spark) divide the data into multiple partitions and process each partition in separate tasks. These tasks can be executed in parallel and thus speedup the analysis. Furthermore, the advent of hybrid layouts has additionally sped-up the analysis by allowing to read fewer data for certain operations (i.e., projection, selection). Yet, distributed frameworks do not consider the actual data read when creating the tasks to process the partitions. Thus, the number of tasks is always created based on the total file size and not on the actual data being read. However, this may lead in launching more tasks than needed, which in turn may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).

    To allow a more efficient use of resources and reduce the job execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. To show the effectiveness of our approach we prototype it for Apache Parquet and Spark. 

    Our experimental evaluation on TPC-H shows that, on average, our recommended configurations are only 5.6% away from the Pareto front and provide 2.1x speedup against the default solutions.


    Related publications