Daria Glushkova

Home University: Universitat Politècnica de Catalunya (UPC)

Research Interests: Distributed Computing, Business Intelligence

Research Topic: Dynamic Task Scheduling and Data Collocation in Distributed Computing systems

Advisors

Home University: Alberto Abelló, Petar Jovanovic (UPC)

Host University: Wolfgang Lehner (TUD)

EDUCATION

Dec 2016 to present

Doctoral Candidate IT4BI. Universitat Politecnica de Catalunya, Technische Universitat Dresden

Sep 2014 to Sep 2016:

MS, IT4BI. Polytechnic University of Catalonia (UPC), Barcelona, Spain

Thesis title: “MapReduce Performance Models for Hadoop 2.0”

Sep 2003 - February 2009:

Chelyabinsk State University, Faculty of Mathematics (http://www.csu.ru/en),

Chelyabinsk, Russia

Mathematician, Specialist in Computer Security

WORK EXPERIENCE

July 2011 – June 2014

Programmer

“Smart Leads “ Inc.,Geneva Place, Waterfront, P.O. Box 3469, Road Town, British Virgin Islands

July 2009 - June 2013

Assistant, Computer Security and Applied Algebra Department

Chelyabinsk State University, Chelyabinsk, Russia

January 2010 – May 2013

Junior Researcher

Development and implementation of biometric identification module.

Voice identification, Fingerprint identification, identification by Signature

Chelyabinsk State University,Chelyabinsk, Russia

January 2009 – July 2011

Programmer, Game development

“I-Jet” Ltd., Chelyabinsk, Russia

RESEARCH

Distributed data processing systems have emerged as a necessity for processing large-scale data volumes in reasonable time. MapReduce is a programming paradigm for distributed processing of large data sets. MapReduce operates in two steps: Map that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key. Programming in MapReduce is just a matter of adapting an algorithm to this peculiar two-phase processing model. Programs written in this functional style are automatically parallelized and executed on the computer clusters. Apache Hadoop is one of the most popular open-source implementation of MapReduce paradigm. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus should be automatically handled in software by the framework. It provides strong support to fault tolerance, reliability, and scalability for distributed data processing scenarios. In the first version of Hadoop the programming paradigm of the MapReduce and the resource management were tightly coupled. In order to improve the overall performance of Hadoop some requirements were added such as high cluster utilization, high level of reliability/availability, support for programming model diversity, backward compatibility and flexible resource model. The architecture of Hadoop has undergone significant changes: it decouples the programming model from the resource management infrastructure and delegates many scheduling function to per-application components.

The most important difference between Hadoop 1.x and Hadoop 2.0 is that Yet Another Resource Negotiator (YARN) module appeared and changed the architecture significantly. MapReduce has undergone a complete overhaul in Hadoop 2.0. The fundamental idea of YARN is to split up the two major functionalities, resource management and job scheduling/monitoring in order to have a global ResourceManager and application-agnostic ApplicationMaster

To get the minimal possible execution time is vital for all data processing applications. One of the main requirements for optimizing the execution time is to estimate the execution as accurately as possible. For accurately estimation of the execution time, we need to build performance cost models that follow the programming model of the data processing applications. There exist efforts for developing performance models for MapReduce taking into account Hadoop 1.x settings. Considering changes in the architecture of the second version of Hadoop it is necessary to adapt and to tune existing performance cost models. Thus I’m going to concentrate on defining and evaluating the cost models for Hadoop 2.x during my master thesis.

Daria Glushkova

Coauthorship graph