Data-Intensive Flow
Alberto Abelló, Besim Bilalli, Petar Jovanovic, Sergi Nadal, Oscar RomeroDescription
Data-intensive flows are central processes in today's business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. In general, a data-intensive flow starts by extracting data from individual data sources, transforms, cleans, and conforms extracted data to satisfy certain quality standards and business requirements, and finally delivers the data to end users.
To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data and provide results to users at runtime.
The most critical challenges related to data-intensive flows in next generation BI systems are the automation of the flow design from users' information requirements, and optimizing the deployment and execution of data-intensive flows, hence satisfying agreed quality requirements.
Research line: Requirement-driven Design and Optimization of Data-Intensive Flows
The efficient, adaptable and optimal design of data-intensive flows, led by the real business requirements, is critical in order to meet dynamicity of today’s business environments. Currently there are many technologies and tools dealing with the design of data-intensive flows for BI systems, either integrated within the BI platforms or as separate components. However, even though they usually provide intuitive graphical interfaces, such tools usually require a considerable manual effort from the users for translating their business needs into the corresponding designs. Moreover, these tools often do not provide any automated support for the efficient evolution and optimization of data intensive flows, which is crucial when it comes to next generation BI systems. To this end, in this project we propose an end-to-end system for assisting both designers and business users during these difficult tasks. Such system would facilitate early stages of BI projects, when only a few initial requirements should lead the design of BI system’s components (e.g., data stores and data flows) from scratch; as well as the complete design lifecycle when a design of the system’s components must be efficiently accommodated in front of the new or changed business needs. The project studies the automation of the designs of target data stores (e.g., MD schema) and data intensive flows (e.g., ETL) in BI systems from information requirements. First, we propose a module for incremental requirement-driven creation of data intensive flows (CoAl). CoAl considers each requirement separately and iteratively builds the respective data flow design to satisfy all given requirements. At the same time, we considered how the semantic-aware integration of MD schemata could enrich such design process, and thus we propose an ontology-based and requirement driven approach to integration of MD schemata (ORE). Lastly, we study the optimal execution of data intensive flows, focusing on optimal scheduling of data flows on distributed data processing systems.
Research line: Automating User-Centered Design of Data-Intensive Processes
Business Intelligence (BI) enables an organization to collect and analyse internal and external business data to generate knowledge and business value, and provide decision support at the strategic, tactical, and operational levels. The consolidation of data coming from many sources as a result of managerial and operational business processes, usually referred to as Extract-Transform-Load (ETL) is itself a statically defined business process and knowledge workers have little to no control over the characteristics of the presentable data to which they have access.
There are two main reasons that dictate the reassessment of this stiff approach in context of modern business environments. The first reason is that the service-oriented nature of today’s business combined with the increasing volume of available data make it impossible for an organization to proactively design efficient data management processes that are specific to its internal operational scope. The second reason is that enterprises can benefit significantly from analysing the behaviour of their business processes fostering their optimization.
The aim of this work has been the definition of models, techniques and tools to support the alignment of user requirements with the runtime characteristics of business processes for data management. Hence, we took a first step towards quality-aware ETL process design automation by defining through a systematic literature review a set of ETL process quality characteristics and the relationships between them, as well as by providing quantitative measures for each characteristic. Subsequently, we produced a model that represents ETL process quality characteristics and the dependencies among them and we showcased through the application of a Goal Model with quantitative components (i.e., indicators) how our model can provide the basis for subsequent analysis to reason and make informed ETL design decisions.
In addition, we introduced our holistic view for a quality-aware design of ETL processes by presenting a framework for user-centered declarative ETL. This included the definition of an architecture and methodology for the rapid, incremental, qualitative improvement of ETL process models, promoting automation and reducing complexity, as well as a clear separation of business users (BU) and IT roles where each user is presented with appropriate views and assigned with fitting tasks.
In this direction, we presented a prototype of our tool POIESIS, which can improve the quality of an ETL Process by automatically generating optimization patterns integrated in the ETL flow, resulting to thousands of alternative ETL flows. We applied an iterative model where users are the key participants through well-defined collaborative interfaces and based on estimated measures for different quality characteristics we showcased how our tool facilitates incremental, quantitative improvement of ETL process models.
When it comes to evaluating different quality characteristics of the ETL process design, we proposed an automated data generation framework for evaluating ETL processes (i.e., Bijoux). To this end, we classified the operations based on the part of input data they access for processing, which facilitated Bijoux during data generation processes both for identifying the constraints that specific operation semantics imply over input data, as well as for deciding at which level the data should be generated (e.g., single field, single tuple, complete dataset). Bijoux offers data generation capabilities in a modular and configurable manner, which can be used to evaluate the quality of different parts of an ETL process.
Collectively, these contributions have raised the level of abstraction of ETL process components, revealing their quality characteristics in a granular level and allowing for evaluation and automated (re-)design, taking under consideration BU quality goals.
Related publications
2024 Joseph Giovanelli, Besim Bilalli, Alberto Abelló, Fernando Silva-Coira, Guillermo de Bernardo: Reproducible experiments for generating pre-processing pipelines for AutoETL. Inf. Syst. 2024 2023 Rediana Koçi, Xavier Franch, Petar Jovanovic, Alberto Abelló: Web API evolution patterns: A usage-driven approach. J. Syst. Softw. 2023 2021 Rediana Koçi, Xavier Franch, Petar Jovanovic, Alberto Abelló: PatternLens: Inferring evolutive patterns from web API usage logs. CAiSE Forum 2021 2015 Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner: POIESIS: a Tool for Quality-aware ETL Process Redesign. EDBT 2015 2011 Oscar Romero, Alkis Simitsis, Alberto Abelló: GEM: Requirement-Driven Generation of ETL and Multidimensional Conceptual Designs. DaWaK 2011