Justyna  Kałużka

Justyna Kałużka

Master in Innovation and Research in Informatics (MIRI)
MSc Thesis 2015/2016

Enabling in-advance data shipping in Hadoop Distributed File System

Moving large volumes of data through the network represents one of the major challenges for the distributed data processing systems, as it may largely defer the execution of affected data processing tasks. For this reason, current implementations like Hadoop typically favour data locality and transferring computing to the data (i.e., function shipping) instead of transferring data closer to computing nodes (i.e., data shipping). However, in complex, multi-tenant environments, where independent client applications submit their jobs to the cluster, the distribution of the data as well as the workload over the cluster resources can easily become skewed, and thus seamlessly favouring data locality can largely affect the performance of executing jobs.

Hadoop Distributed File System (HDFS) is the most used storage system from the Hadoop stack enabling typical operations like distributed reading and writing. When data locality cannot be obeyed, the reading incurs additional transfer of input data from other nodes in the cluster, which considerably defers the execution of an awaiting task. At the same time, the network is typically less occupied resource inside the cluster, and depending on its physical characteristics independent data transfers may be easily parallelized.

To this end, we propose an extension to the current HDFS implementation with a new functionality that takes advantage of the potentially idle network resources and enables in-advance shipping of data closer to data processing tasks, and hence avoiding unnecessary delays of tasks' execution. This project requires working directly on the source code of the HDFS module, and introducing the extensions that implement the above functionality. Therefore, a thorough analysis of the complex HDFS source code is initially required. Such code analysis will result in a full conceptualization of the processes enacting inside the HDFS during the job execution. Based on the results of this code analysis, we will develop the extensions required to implement the new functionalities inside HDFS. The project will follow the Open-source software development principles, and finally committing the extensions back to the Hadoop community. We believe that such contributions will boost the current scheduling techniques, and stimulate the reconsideration of the current policies of favouring function shipping in Hadoop ecosystem. 

Coauthorship graph