Big Data Management

Victor Herrero, Alberto Abelló, Ayman Elserafi, Moditha Hewasinghage, Petar Jovanovic, Rana Faisal Munir, Sergi Nadal, Oscar Romero, Francesc Trull

  • Description

    Companies have always considered data analysis as an important asset to evaluate and to improve their results. In the recent years, however, much emphasis has been put on its possibilities and most companies see it nowadays as a clear competitive advantage. This is due to the shift that digital world is experiencing towards a more data-oriented paradigm (e.g., tons of digital data from social networks, smartphone sensors, web applications, etc.), which in turn, has opened the door for data-driven business models. These are based on data analysis, which allows companies to have better insights into their customers. The benefits of maintaining information that is tailor-made for each customer in a more detailed manner appear in many different ways. For instance, by foreseeing their buying interests to make anticipated personalized offers.

    With such a new context, data management and analysis requirements have been completely updated, giving birth to the Big Data concept. Commonly defined by the three V's, Big Data deals with huge amounts (Volume) of heterogeneous (Variety) data while query performance is not detracted and real-time data analysis is possible (Velocity). Furthermore, it must enable more advanced and sophisticated methods for data analysis. Typical OLAP summarizations are still necessary but systems must be endowed with data mining and machine learning capabilities, too. As a consequence, many alternatives to relational DBMS have bloomed, since the new analytical goals require the introduction of non-relational data models (e.g., graph exploration). NOSQL databases and the Hadoop ecosystem are the most widespread platforms among such existing alternatives. To meet all of those requirements, current solutions usually rely on the usage of a data lake, a newborn concept in the Big Data world which refers to a generic storage system where all sort of raw data can be saved (e.g., a distributed file system) and puts off data processing until it is clear how such data are going to be exploited. However, this becomes its main drawback since this data cooking can only be achieved by means of ad-hoc solutions. This risks the health of a data lake by favouring its turning into a data swamp; a data lake where no control about stored data is maintained. Hence, current Big Data solutions must be obligatory complemented with automation for data governance.

    The aim of this research area is then to revisit data management systems to meet those Big Data requirements. Research efforts have been conducted in several directions. First, traditional database tuning techniques have been rethought to be adopted by the new MapReduce and Apache Spark data processing frameworks from the Hadoop ecosystem. Results clearly showed that the exploitation of the cloud horsepower as an alternative for data processing does not pay off, as secondary indexes and cost-based optimization models still output higher performance improvements in comparison to brute force-based methods. Second, a reference architecture has been proposed to facilitate the design of Big Data systems. This architecture sits on top of the data lake concept and is extended to adopt the mechanisms needed to prevent it from turning into a data swamp. At the moment, an instantiation of such a reference architecture has already been deployed in a real project, where the main successes were the extremely high gain in query performance and the capability of dealing with multisource data in a single integrated view. Third, design in NOSQL databases is also of interest for the group and a research line in this direction is currently being carried out in order to find design patterns for different data models. Finally, the following theses are involved in this research area as well.

    Reserch line: Self-Tuning BI Systems

    For the realization of BI for masses, exploratory BI, self-service BI, and similar concepts, it is necessary to enable all users to self-support themselves in the analytical and maintenance tasks they need to perform. The user-centricity feature of these systems seeks enabling non-technical users to analyze data on demand. Thus, next generation BI systems should provide flexible means for such users to create the desired reports/data analysis. This assumption means that the system should be self-configurable and react to the day-by-day usage. For this reason, continuous monitoring of the system must take place in order to overcome potential bottlenecks of any kind (such as performance, information, design or quality bottlenecks).

    In this work, we propose to monitor the BI system, gather relevant metadata for the assessment of the system, and according to past evidences develop self-tuning features. In order to fulfill this objective several tasks must be undertaken. First, the main storage alternatives must be characterized (also including NoSQL trends). However, this classification should not only be model-based (e.g., relational, key-value, document-stores, graph databases etc., as it usually done) but also consider other decisions such as the system architecture (e.g., hash-based, clustered, in-memory, disk-based), design (e.g., fragmentation and replication capabilities, indexing), optimizations implemented by the query execution engine, etc. Once a clear characterization is done and what factors are relevant to choose between different storage options given a certain workload (i.e., past evidences gathered in the system), the desired output would be a deterministic algorithm (probably cost-based) to enable self-tuning BI systems and, in turn, more user-friendly BI tools that bridge the gap between business needs and IT limitations.

    Research line: Requirement-driven Design and Optimization of Data-Intensive Flows

    The efficient, adaptable and optimal design of data-intensive flows, led by the real business requirements, is critical in order to meet dynamicity of today’s business environments. Currently there are many technologies and tools dealing with the design of data-intensive flows for BI systems, either integrated within the BI platforms or as separate components. However, even though they usually provide intuitive graphical interfaces, such tools usually require a considerable manual effort from the users for translating their business needs into the corresponding designs. Moreover, these tools often do not provide any automated support for the efficient evolution and optimization of data intensive flows, which is crucial when it comes to next generation BI systems. To this end, in this project we propose an end-to-end system for assisting both designers and business users during these difficult tasks. Such system would facilitate early stages of BI projects, when only a few initial requirements should lead the design of BI system’s components (e.g., data stores and data flows) from scratch; as well as the complete design lifecycle when a design of the system’s components must be efficiently accommodated in front of the new or changed business needs. The project studies the automation of the designs of target data stores (e.g., MD schema) and data intensive flows (e.g., ETL) in BI systems from information requirements. First, we propose a module for incremental requirement-driven creation of data intensive flows (CoAl). CoAl considers each requirement separately and iteratively builds the respective data flow design to satisfy all given requirements. At the same time, we considered how the semantic-aware integration of MD schemata could enrich such design process, and thus we propose an ontology-based and requirement driven approach to integration of MD schemata (ORE). Lastly, we study the optimal execution of data intensive flows, focusing on optimal scheduling of data flows on distributed data processing systems.

    Research line: Evaluation of data placement optimizations in in-memory databases

    The memory hierarchy is a concept that sorts the different memory levels at which data can be stored. At the top, there commonly are the CPU registers whereas, at the bottom, there is the hard disk. In each of these levels, a trade-off between performance and capacity arises. Response time in CPU registers is well known to clearly beat performance in hard disks, as the former is a pure digital component while the latter involves mechanical pieces that lead to higher latencies. However, a single hard disk can store up to terabytes of data as opposed to a register, which can only contain few bytes. Typically, in a top-down order, the memory hierarchy comprises the CPU registers, the cache memory (which in turn can be split in several levels), the main memory and the hard disk. In in-memory databases (IMDB), the storage layout totally lies on the cache and the main memory. Additionally, IMDB can usually be distributed databases composed of several nodes that connect through a distributed system, and where each node has its own resources. This deploys a very complex architecture that is crucial for IMDB to be aware of, so that performance is boosted. An intuitive approach for this type of optimizations arises by intelligently placing and distributing data in the different database nodes (data placement optimizations) with respect to the query workload. Performance then benefits from distributedly spreading memory requests across several resources working in parallel. This work, hence, explores and evaluates existing approaches for data placement optimizations in IMDB in the presence of real workloads. The evaluation of such data placement optimizations will take place in the presence of operational, analytical and hybrid workloads. The analysis derived from this evaluation will be used to refine these optimizations so that IMDB are enabled to smartly place their data across the whole system.

    Related publications
    Daria Glushkova, Petar Jovanovic, Alberto Abelló: MapReduce Performance Models for Hadoop 2.x. EDBT/ICDT Workshops 2017
    Ayman Alserafi, Toon Calders, Alberto Abelló, Oscar Romero: DS-Prox: Dataset Proximity Mining for Governing the Data Lake. SISAP 2017
    Sergi Nadal, Alberto Abelló, Oscar Romero, Jovan Varga: Big Data Management Challenges in SUPERSEDE. EDBT/ICDT Workshops 2017
    Sergi Nadal, Oscar Romero, Alberto Abelló, Panos Vassiliadis, Stijn Vansummeren: An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems. EDBT/ICDT Workshops 2017
    Sergi Nadal, Victor Herrero, Oscar Romero, Alberto Abelló, Xavier Franch, Stijn Vansummeren, Danilo Valerio: A software reference architecture for semantic-aware Big Data systems. Information & Software Technology 2017
    Ayman Alserafi, Alberto Abelló, Oscar Romero, Toon Calders: Towards Information Profiling: Data Lake Content Metadata Management. ICDM Workshops 2016
    Victor Herrero, Alberto Abelló, Oscar Romero: NOSQL Design for Analytical Workloads: Variability Matters. ER 2016
    Petar Jovanovic, Oscar Romero, Toon Calders, Alberto Abelló: H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution. ADBIS 2016
    Rana Faisal Munir, Oscar Romero, Alberto Abelló, Besim Bilalli, Maik Thiele, Wolfgang Lehner: ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. MEDI 2016
    Alberto Abelló: Big Data Design. DOLAP 2015
    Oscar Romero, Victor Herrero, Alberto Abelló, Jaume Ferrarons: Tuning small analytics on Big Data: Data partitioning and secondary indexes in the Hadoop ecosystem. Inf. Syst. 2015
    Alberto Abelló: NoSQL: The death of the Star. EDA 2011
    Alberto Abelló, Jaume Ferrarons, Oscar Romero: Building cubes with MapReduce. DOLAP 2011