Université Libre de Bruxelles (ULB)Esteban Zimányi, Stijn Vansummeren, Toon Calders
Research Collaboration: MapReduce Data Flow Scheduling
This collaboration aims at providing “proactive” scheduling mechanisms for data-intensive flows across the shared and distributed resources.
We tackle the problem of scheduling data-intensive flows focusing on self-adapting data distribution inside the cluster, based on the provided and/or predicted workload (i.e., both data and function shipping). Timely adapting data distribution to the workload will improve the performance of distributed data-intensive flows that are largely dependent on the locality of input data (e.g., MapReduce).
We consider a typical distributed data processing system (e.g., Hadoop), with different clients submitting data flows for execution (multi-tenancy).
- Improving the utilization, load and data balancing of the cluster resources.
- Maximizing the throughput of a distributed data processing system.
- Maximizing the satisfaction of data flows' Service Level Agreements (SLA).
- Enabling timely self-adapting of the system and the scheduling policies to provide the optimal data flow execution and to guarantee the satisfaction of the data flow's SLAs.
Research Collaboration: Self-Optimizing Data Stream Processing
This collaboration aims at enabling the Lambda-architecture with semantic-aware self-optimizing capabilities for optimal data stream processing.
- Refine the Lambda-architecture in order to provide semantic awareness to raw data.
- Study all characteristics that represent a data stream and that can be drivers of the self-optimizing process. Assess available options, study their interdependence and propose extensions.
- Study available benchmarks capable of varying the characteristics devised.
- Develop self-optimizing capabilities for data stream processing in the architecture.