Daria Glushkova

Daria Glushkova

Home University: Universitat Politècnica de Catalunya (UPC) 

Research Interests: Distributed Computing, Business Intelligence 
Research Topic:  Dynamic Task Scheduling and Data Collocation in Distributed Computing systems
 
Advisors
Home University: Alberto Abelló, Petar Jovanovic (UPC)
Host University: Wolfgang Lehner (TUD)
 
EDUCATION
Dec 2016  to present 
Doctoral Candidate IT4BI. Universitat Politecnica de Catalunya, Technische Universitat Dresden
 
 
 
Sep 2014 to Sep 2016:
MS, IT4BI. Polytechnic University of Catalonia (UPC), Barcelona, Spain
Thesis title: “MapReduce Performance Models for Hadoop 2.0
 
Sep 2003 - February 2009
Chelyabinsk State University, Faculty of Mathematics (http://www.csu.ru/en), 
Chelyabinsk, Russia 
Mathematician, Specialist in Computer Security

 

 
WORK EXPERIENCE
July 2011 – June 2014
Programmer
“Smart Leads “ Inc.,Geneva Place, Waterfront, P.O. Box 3469, Road Town, British Virgin Islands
 
July 2009 - June 2013
Assistant, Computer Security and Applied Algebra Department
Chelyabinsk State University, Chelyabinsk, Russia
 
January 2010 – May 2013
Junior Researcher
Development and implementation of biometric identification module.
Voice identification, Fingerprint identification, identification by Signature
Chelyabinsk State University,Chelyabinsk, Russia
 
January 2009 – July 2011
Programmer, Game development
“I-Jet” Ltd.,  Chelyabinsk, Russia
 
RESEARCH
Distributed data processing systems have emerged as a necessity for processing large-scale data volumes in reasonable time. MapReduce is a programming paradigm for distributed processing of large data sets. MapReduce operates in two steps: Map that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key. Programming in MapReduce is just a matter of adapting an algorithm to this peculiar two-phase processing model. Programs written in this functional style are automatically parallelized and executed on the computer clusters. Apache Hadoop is one of the most popular open-source implementation of MapReduce paradigm. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus should be automatically handled in software by the framework. It provides strong support to fault tolerance, reliability, and scalability for distributed data processing scenarios.  In the first version of Hadoop the programming paradigm of the MapReduce and the resource management were tightly coupled. In order to improve the overall performance of Hadoop some requirements were added such as high cluster utilization, high level of reliability/availability, support for programming model diversity, backward compatibility and flexible resource model. The architecture of Hadoop has undergone significant changes: it decouples the programming model from the resource management infrastructure and delegates many scheduling function to per-application components.
The most important difference between Hadoop 1.x and Hadoop 2.0 is that Yet Another Resource Negotiator (YARN) module appeared and changed the architecture significantly. MapReduce has undergone a complete overhaul in Hadoop 2.0. The fundamental idea of YARN is to split up the two major functionalities, resource management and job scheduling/monitoring in order to have a global ResourceManager and application-agnostic ApplicationMaster
To get the minimal possible execution time is vital for all data processing applications. One of the main requirements for optimizing the execution time is to estimate the execution as accurately as possible. For accurately estimation of the execution time, we need to build performance cost models that follow the programming model of the data processing applications. There exist efforts for developing performance models for MapReduce taking into account Hadoop 1.x settings. Considering changes in the architecture of the second version of Hadoop it is necessary to adapt and to tune existing performance cost models. Thus I’m going to concentrate on defining and evaluating the cost models for Hadoop 2.x during my master thesis.
 

Coauthorship graph