Business Intelligence
Alberto Abelló, Petar Jovanovic, Sergi Nadal, Oscar RomeroDescription
Business Intelligence (BI) is the set of collections and tools that empowers an organisation with the capability of collecting and analyzing internal and external data to generate knowledge and value, providing decision support at the strategic, tactical, and operational levels. This traditionally includes the areas of Data Warehousing, OLAP (descriptive analysis), and Data Mining (predicive analysis). In the group, we maily focus on the main two areas.
Data Warehousing refers to the extraction of data from the sources and their storage and management in a common integrated, long-lasting repository with temporal capabilities (both Valid Time and Transaction Time), with the ultimate purpose of analyzing them. Some times, from a theoretical point of view, a Data Warehouse is simply defined as a set of Materialized Views. Optimally selecting and updating such Materialized Views has been an active research area in the last years.
On the other hand, OLAP (standing for On-Line Analytical Processing) tools are those that allow the navigation of data by means of the Multidimensional Model, based on the Data Cube metaphor. Thus, cubes are defined in terms of a Star Schema composed by a Fact subject of analysis and different Dimensions around it that facilitate operations like Roll-up, Drill-down, Slice, Dice, etc. Such conceptual schema can be implemented in different technologies (named ROLAP if the DBMS is Relational), but in any case, it must result in high performance on the aggregation operations (most of the times obtained by precomputing the results of queries).
Finally, it is important to acknowledge the relevance of automation in this context, given that users (typically executives) of these tools are not necessarily experts in Information Technologies. Thus, some efforts are being spent in providing Self-service capabilities that hide the technological complexity underneath.
Research line: Multidimensional Conceptual Modelling
We have proposed YAM², a multidimensional conceptual model for OLAP defined as an extension of UML (Unified Modeling Language). The aim was to benefit from Object-Oriented concepts and relationships to allow the definition of semantically rich multi-star schemas. Thus, the usage of Generalization, Association, Derivation, and Flow relationships (in UML terminology) was studied.
An architecture based on different levels of schemas was also proposed and the characteristics of its different levels defined. The benefits of this architecture are twofold. Firstly, it relates Federated Information Systems with Data Warehousing, so that advances in one area can also be used in the other. Moreover, the Data Mart schemas are defined so that they can be implemented on different Database Management Systems, while still offering a common integrated vision that allows to navigate through the different stars.
The main concepts of any multidimensional model are facts and dimensions. Both were analyzed separately, based on the assumption that relationships between aggregation levels are part-whole (or composition) relationships. Thus, mereology axioms were used on that analysis to prove some properties.
Besides structures, operations and integrity constraints were also defined for YAM². Due to the fact that, a data cube was defined as a function, operations (i.e. Drill-across, ChangeBase, Roll-up, Projection, and Selection) were defined over functions. Regarding the set of integrity constraints, they reflect the importance of summarizability (or aggregability) of measures, and pay special attention to it.Research Line: Automating the Multidimensional Design of Data Warehouses
Previous experiences in the data warehouse field have shown that the data warehouse multidimensional conceptual schema must be derived from a hybrid approach: i.e., by considering both the end-user requirements and the data sources, as first-class citizens. Like in any other system, requirements guarantee that the system devised meets the end-user necessities. In addition, since the data warehouse design task is a reengineering process, it must consider the underlying data sources of the organization: (i) to guarantee that the data warehouse must be populated from data available within the organization, and (ii) to allow the end-user discover unknown additional analysis capabilities.
Several methods for supporting the data warehouse modeling task have been provided. However, they suffer from some significant drawbacks. In short, requirement-driven approaches assume that requirements are exhaustive (and therefore, do not consider the data sources to contain alternative interesting evidences of analysis), whereas data-driven approaches (i.e., those leading the design task from a thorough analysis of the data sources) rely on discovering as much multidimensional knowledge as possible from the data sources. As a consequence, data-driven approaches generate too many results, which misleads the user. Furthermore, the design task automation is essential in this scenario, as it removes the dependency on an expert’s ability to properly apply the method chosen, and the need to analyze the data sources, which is a tedious and time-consuming task (which can be unfeasible when working with large databases). In this sense, current automatable methods follow a data-driven approach, whereas current requirement-driven approaches overlook the process automation, since they tend to work with requirements at a high level of abstraction. Indeed, this scenario is repeated regarding data-driven and requirement-driven stages within current hybrid approaches, which suffer from the same drawbacks than pure data-driven or requirement-driven approaches.
In this research line we introduced two different approaches for automating the multidimensional design of the data warehouse: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Both approaches were devised to overcome the current limitations previously discussed. On the one hand, we rely on the end-user requirements, but we do not decline that the data sources may also contain hidden analysis capabilities that, eventually, may be of interest. Nevertheless, in any case, we do not generate endless chunks of results from the sources. On the contrary, we aim at filtering by means of objective evidences the results obtained by analyzing the sources. Importantly, our approaches consider opposite initial assumptions, but both consider the end-user requirements and the data sources as first-class citizens. Furthermore, we also focus on the automation of the process, to facilitate the designer’s task as much as possible.
Related publications
2024 Alberto Abelló, James Cheney: Eris: efficiently measuring discord in multidimensional sources. VLDB J. 2024 2021 Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi: TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes. Inf. Syst. Frontiers 2021 2018 Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi, Alberto Abelló, Oscar Romero: Interactive multidimensional modeling of linked data for exploratory OLAP. Inf. Syst. 2018 2012 Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló: Requirement-Driven Creation and Deployment of Multidimensional and ETL Designs. ER Workshops 2012 2008 Oscar Romero, Alberto Abelló: MDBE: Automatic Multidimensional Modeling. ER 2008 2000 Alberto Abelló, José Samos, Fèlix Saltor: Benefits of an Object-Oriented Multidimensional Data Model. Objects and Databases 2000