UPC Logo

Publications

2019
  • Anam Haq, Szymon Wilk, Alberto Abelló. Fusion of clinical data: A case study to predict the type of treatment of bone fractures. International Journal of Applied Mathematics and Computer Science 29(1). Sciendo, 2019. Pages 51-67. ISSN: 2083-8492. DOI: 10.2478/amcs-2019-0004
    A prominent characteristic of clinical data is their heterogeneity - such data include structured examination records and laboratory results, unstructured clinical notes, raw and tagged images, and genomic data. This heterogeneity poses a formidable challenge while constructing diagnostic and therapeutic decision models that are currently based on single modalities and are not able to use data in different formats and structures. This limitation may be addressed using data fusion methods. In this paper, we describe a case study where we aimed at developing data fusion models that resulted in various therapeutic decision models for predicting the type of treatment (surgical vs. non-surgical) for patients with bone fractures. We considered six different approaches to integrate clinical data: one fusion model based on combination of data (COD) and five models based on combination of interpretation (COI). Experimental results showed that the decision model constructed following COI fusion models is more accurate than decision models employing COD. Moreover, statistical analysis using the one-way ANOVA test revealed that there were two groups of constructed decision models, each containing the set of three different models. The results highlighted that the behavior of models within a group can be similar, although it may vary between different groups.
  • Yassine Ouhammou, Ladjel Bellatreche, Mirjana Ivanovic, Alberto Abelló. Model and data engineering for advanced data-intensive systems and applications. Computing 101(10). Springer, 2019. Pages 1391-1395. ISSN: 1436-5057. DOI: 10.1007/s00607-019-00726-3
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel. PRESISTANT: Learning based assistant for data pre-processing. Data and Knowledge Engineering, 123. Elsevier, 2019. ISSN: 0169-023X. DOI: 10.1016/j.datak.2019.101727
    Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator can have positive, negative, or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool, PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as Decision Tree (J48), Naive Bayes, PART, Logistic Regression, and Nearest Neighbor (IBk). Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytic tasks.
  • Sergi Nadal, Oscar Romero, Alberto Abelló, Panos Vassiliadis, Stijn Vansummeren. An integration-oriented ontology to govern evolution in Big Data ecosystems. Information Systems 79. Elsevier, 2019. Pages 3-19. ISSN: 0306-4379. DOI: 10.1016/j.is.2018.01.006
    Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semi-automatically adapts the ontology upon new releases. This guarantees ontology-mediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on real-world APIs is performed to validate our approach.
  • Daria Glushkova, Petar Jovanovic, Alberto Abelló. Mapreduce performance model for Hadoop 2.x. Information Systems 79. Elsevier, 2019. Pages 32-43. ISSN: 0306-4379. DOI: 10.1016/j.is.2017.11.006
    MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.
  • Robert Wrembel, Alberto Abelló, Il-Yeol Song. DOLAP data warehouse research over two decades: Trends and challenges. Information Systems 85. Elsevier, 2019. Pages 44-47. ISSN: 0306-4379. DOI: 10.1016/j.is.2019.06.004
    This paper introduces to the Information Systems special issue, including the four best papers submitted to DOLAP 2018. Additionally, the 20th anniversary of DOLAP motivated the analysis of DOLAP topics, as follows. First, the recent 5-years DOLAP topics were confronted with those of VLDB, SIGMOD, and ICDE. Next, the DOLAP topics were analyzed within its 20 years of history. Finally, the analysis is concluded with the list of the most frequent research topics of the aforementioned conferences and still open research problems.
  • Rana Faisal Munir, Alberto Abelló, Oscar Romero, Maik Thiele, Wolfgang Lehner. Automatically Configuring Parallelism for Hybrid Layouts. Short paper in New Trends in Databases and Information Systems (ADBIS). Communications in Computer and Information Science, vol 1064. Springer, 2019. Pages 120-125. ISBN: 978-3-030-30278-8. DOI: 10.1007/978-3-030-30278-8_15
    Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).

    To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.
  • Rediana Koçi, Xavier Franch, Petar Jovanovic, Alberto Abelló. Classification of Changes in API Evolution. Short paper in IEEE 23rd International Enterprise Distributed Object Computing Conference (EDOC). IEEE, 2019. Pages 243-249. ISBN: 978-1-7281-2702-6. ISSN: 2325-6362. DOI: 10.1109/EDOC.2019.00037
    Applications typically communicate with each other, accessing and exposing data and features by using Application Programming Interfaces (APIs). Even though API consumers expect APIs to be steady and well established, APIs are prone to continuous changes, experiencing different evolutive phases through their lifecycle. These changes are of different types, caused by different needs and are affecting consumers in different ways. In this paper, we identify and classify the changes that often happen to APIs, and investigate how all these changes are reflected in the documentation, release notes, issue tracker and API usage logs. The analysis of each step of a change, from its implementation to the impact that it has on API consumers, will help us to have a bigger picture of API evolution. Thus, we review the current state of the art in API evolution and, as a result, we define a classification framework considering both the changes that may occur to APIs and the reasons behind them. In addition, we exemplify the framework using a software platform offering a Web API, called District Health Information System (DHIS2), used collaboratively by several departments of World Health Organization (WHO).
  • Ayman Alserafi, Alberto Abelló, Oscar Romero, Toon Calders. Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. 9th International Conference on Model and Data Engineering (MEDI). LNCS 11815. Springer, 2019. Pages 35-49. ISBN: 978-3-030-32065-2. DOI: 10.1007/978-3-030-32065-2_3
    With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.
  • Sergi Nadal, Alberto Abelló. Integration-Oriented Ontology. Encyclopedia of Big Data Technologies (Editors-in-chief: Sherif Sakr, Albert Zomaya). Springer, 2019. Pages: 1-5. ISBN: 97-3-319-63962-8. DOI: 10.1007/978-3-319-63962-8_13-1
  • Sergi Nadal. Metadata-Driven Data Integration. PhD Thesis, Universitat politécnica de Catalunya. Barcelona, Mayo 2019.
    Data has an undoubtedly impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. In order to carry on these data exploitation tasks, organizations first perform data integration combining data from multiple sources to yield a unified view over them. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. This requires revisiting the traditional integration assumptions to cope with new requirements posed by such data-intensive settings.

    This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities. We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture is as a blueprint to deploy a stack of systems, with metadata a first-class citizen. Next, we propose a graph-based metadata model as formalism for metadata management. We put the focus on supporting schema and data source evolution, a predominant factor the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in current data-intensive ecosystems.
  • Faisal Munir. Storage format selection and optimization for materialized intermediate results in data-intensive flows. PhD Thesis, Universitat politécnica de Catalunya. Barcelona, December 2019.
    Modern organizations produce and collect large volumes of data, that need to be processed repeatedly and quickly for gaining business insights. For such processing, typically, Data-intensive Flows (DIFs) are deployed on distributed processing frameworks. The DIFs of different users have many computation overlaps (i.e., parts of the processing are duplicated), thus wasting computational resources and increasing the overall cost. The output of these computation overlaps (known as intermediate results) can be materialized for reuse, which helps in reducing the cost and saves computational resources if properly done. Furthermore, the way such outputs are materialized must be considered, as different storage layouts (i.e., horizontal, vertical, and hybrid) can be used to reduce the I/O cost.

    In this PhD work, we first propose a novel approach for automatically materializing the intermediate results of DIFs through a multi-objective optimization method, which can tackle multiple and conflicting quality metrics. Next, we study the behavior of different operators of DIFs that are the first to process the loaded materialized results. Based on this study, we devise a rule-based approach, that decides the storage layout for materialized results based on the subsequent operation types. Despite improving the cost in general, the heuristic rules do not consider the amount of data read while making the choice, which could lead to a wrong decision. Thus, we design a cost model that is capable of finding the right storage layout for every scenario. The cost model uses data and workload characteristics to estimate the I/O cost of a materialized intermediate results with different storage layouts and chooses the one which has minimum cost. The results show that storage layouts help to reduce the loading time of materialized results and overall, they improve the performance of DIFs. The thesis also focuses on the optimization of the configurable parameters of hybrid layouts. We propose ATUN-HL (Auto TUNing Hybrid Layouts), which based on the same cost model and given the workload and characteristics of data, finds the optimal values for configurable parameters in hybrid layouts (i.e., Parquet).

    Finally, the thesis also studies the impact of parallelism in DIFs and hybrid layouts. Our proposed cost model helps to devise an approach for fine-tuning the parallelism by deciding the number of tasks and machines to process the data. Thus, the cost model proposed in this thesis, enables in choosing the best possible storage layout for materialized intermediate results, tuning the configurable parameters of hybrid layouts, and estimating the number of tasks and machines for the execution of DIFs.
2018
  • Rana Faisal Munir, Alberto Abelló, Oscar Romero, Maik Thiele, Wolfgang Lehner. ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics. In Proceedings of 22nd European Conference on Advances in Databases and Information Systems (ADBIS), Budapest (Hungary), September 2-5, 2018. LNCS 11019, Springer 2018. ISBN 978-3-319-98397-4200-215. DOI 10.1007/978-3-319-98398-1_14
    Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations. Horizontal partition and dictionary sizes of hybrid layouts are configurable and can directly impact the performance of analytical queries. Hence, their default configuration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hybrid Layouts), which based on a cost model and given the workload and the characteristics of data, finds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its effectiveness. Our experimental evaluation shows that ATUN-HL provides on average 85% of all the potential performance improvement, and 1.2x average speedup against default configuration.
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Rana Faisal Munir, Robert Wrembel. PRESISTANT: Data Pre-processing Assistant. In Proceedings of Information Systems in the Big Data Era (CAiSE Forum), demo session, Tallinn (Estonia), June 11-15, 2018. LNBIP 317, Springer 2018. Pages 57-65. ISBN 978-3-319-92900-2. DOI 10.1007/978-3-319-92901-9_6
    A concrete classification algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, in order to improve the results, datasets need to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and non-experienced users become overwhelmed. Trial and error is not feasible in the presence of big amounts of data. We developed a method and tool-PRESISTANT, with the aim of answering the need for user assistance during data pre-processing. Leveraging ideas from meta-learning, PRESISTANT is capable of assisting the user by recommending pre-processing operators that ultimately improve the classification performance. The user selects a classification algorithm, from the ones considered, and then PRESISTANT proposes candidate transformations to improve the result of the analysis. In the demonstration, participants will experience, at first hand, how PRESISTANT easily and effectively ranks the pre-processing operators.
  • Xavier Franch, Jolita Ralyté, Anna Perini, Alberto Abelló, David Ameller, Jesús Gorroñogoitia, Sergi Nadal, Marc Oriol, Norbert Seyff, Alberto Siena, Angelo Susi. A Situational Approach for the Definition and Tailoring of a Data-Driven Software Evolution Method. In Proceedings of 30th International Conference Advanced Information Systems Engineering (CAiSE), Tallinn (Estonia), June 11-15, 2018. LNCS 10816, Springer 2018. Pages 603-618. ISBN: 978-3-319-91562-3. DOI 10.1007/978-3-319-91563-0_37
    Successful software evolution heavily depends on the selection of the right features to be included in the next release. Such selection is difficult, and companies often report bad experiences about user acceptance. To overcome this challenge, there is an increasing number of approaches that propose intensive use of data to drive evolution. This trend has motivated the SUPERSEDE method, which proposes the collection and analysis of user feedback and monitoring data as the baseline to elicit and prioritize requirements, which are then used to plan the next release. However, every company may be interested in tailoring this method depending on factors like project size, scope, etc. In order to provide a systematic approach, we propose the use of Situational Method Engineering to describe SUPERSEDE and guide its tailoring to a particular context.
  • Sergi Nadal, Alberto Abelló, Oscar Romero, Stijn Vansummeren, Panos Vassiliadis. MDM: Governing Evolution in Big Data Ecosystems. In Proceedings of the 21th International Conference on Extending Database Technology (EDBT), demo session, Vienna (Austria), March 26-29, 2018. OpenProceedings.org, 2018. Pages 682-685. ISBN 978-3-89318-078-3 682-685. ISSN: 2367-2005. DOI 10.5441/002/edbt.2018.84
    On-demand integration of multiple data sources is a critical requirement in many Big Data settings. This has been coined as the data variety challenge, which refers to the complexity of dealing with an heterogeneous set of data sources to enable their integrated analysis. In Big Data settings, data sources are commonly represented by external REST APIs, which provide data in their original format and continously apply changes in their structure (i.e., schema). Thus, data analysts face the challenge to integrate such multiple sources, and then continuosly adapt their analytical processes to changes in the schema. To address this challenges, in this paper, we present the Metadata Management System, shortly MDM, a tool that supports data stewards and analysts to manage the integration and analysis of multiple heterogeneous sources under schema evolution. MDM adopts a vocabulary-based integration-oriented ontology to conceptualize the domain of interest and relies on local-as-view mappings to link it with the sources. MDM provides user-friendly mechanisms to manage the ontology and mappings. Finally, a query rewriting algorithm ensures that queries posed to the ontology are correctly resolved to the sources in the presence of multiple schema versions, a transparent process to data analysts. On-site, we will showcase using real-world examples how MDM facilitates the management of multiple evolving data sources and enables its integrated analysis.
  • Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló. Day-ahead Trading of Aggregated Energy Flexibility. In Proceedings of the Ninth International Conference on Future Energy Systems (e-Energy), Karlsruhe (Germany), June 12-15, 2018. ACM, 2018. Pages 134-138. ISBN: 978-1-4503-5767-8. DOI 10.1145/3208903.3208936
    Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing work, the aggregation technique proposed in this paper produces flexible aggregated loads from EVs taking into account technical market requirements. The flexible aggregated loads can be further traded in the day-ahead market by a Balance Responsible Party (BRP) via the so-called flexible orders. As a result, the BRP can achieve more than 19% cost reduction in energy purchase based on the 2017 real electricity prices from Danish electricity market.
  • Moditha Hewasinghage, Jovan Varga, Alberto Abelló, Esteban Zimányi. Managing Polyglot Systems Metadata with Hypergraphs. In Proceedings of 37th International Conference on Conceptual Modeling (ER), Xi'an (China), October 22-25, 2018. LNCS 11157. Springer, 2018. Pages 463-478. ISBN (printed): 978-3-030-00846-8. ISBN (electronic): 978-3-030-00847-5. DOI 10.1007/978-3-030-00847-5_33
    A single type of data store can hardly fulfill every end-user requirements in the NoSQL world. Therefore, polyglot systems use different types of NoSQL datastores in combination. However, the heterogeneity of the data storage models makes managing the metadata a complex task in such systems, with only a handful of research carried out to address this. In this paper, we propose a hypergraph-based approach for representing the catalog of metadata in a polyglot system. Taking an existing common programming interface to NoSQL systems, we extend and formalize it as hypergraphs for managing metadata. Then, we define design constraints and query transformation rules for three representative data store types. Furthermore, we propose a simple query rewriting algorithm using the catalog itself for these data store types and provide a prototype implementation. Finally, we show the feasibility of our approach on a use case of an existing polyglot system.
  • Marc Oriol, Melanie J. C. Stade, Farnaz Fotrousi, Sergi Nadal, Jovan Varga, Norbert Seyff, Alberto Abelló, Xavier Franch, Jordi Marco, Oleg Schmidt. FAME: Supporting Continuous Requirements Elicitation by Combining User Feedback and Monitoring. 26th International Requirements Engineering Conference (RE), Banff (Canada), August 20-24, 2018. IEEE, 2018. Pages 217-227. ISBN (printed): 978-1-5386-7419-2, ISBN (electronic): 978-1-5386-7418-5. DOI 10.1109/RE.2018.00030
    Context: Software evolution ensures that software systems in use stay up to date and provide value for end-users. However, it is challenging for requirements engineers to continuously elicit needs for systems used by heterogeneous end-users who are out of organisational reach. Objective: We aim at supporting continuous requirements elicitation by combining user feedback and usage monitoring. Online feedback mechanisms enable end-users to remotely communicate problems, experiences, and opinions, while monitoring provides valuable information about runtime events. It is argued that bringing both information sources together can help requirements engineers to understand end-user needs better. Method/Tool: We present FAME, a framework for the combined and simultaneous collection of feedback and monitoring data in web and mobile contexts to support continuous requirements elicitation. In addition to a detailed discussion of our technical solution, we present the first evidence that FAME can be successfully introduced in real-world contexts. Therefore, we deployed FAME in a web application of a German small and medium-sized enterprise (SME) to collect user feedback and usage data. Results/Conclusion: Our results suggest that FAME not only can be successfully used in industrial environments but that bringing feedback and monitoring data together helps the SME to improve their understanding of end-user needs, ultimately supporting continuous requirements elicitation.
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel. Intelligent assistance for data pre-processing. Computer Standards & Interfaces 57. Elsevier, 2018. Pages 101-109. ISSN: 0920-5489. DOI: 10.1016/j.csi.2017.05.004
    A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, a dataset needs to be pre-processed before being mined. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives. As a consequence, non-experienced users become overwhelmed with pre-processing alternatives. In this paper, we show that the problem can be addressed by automating the pre-processing with the support of meta-learning. To this end, we analyzed a wide range of data pre-processing techniques and a set of classification algorithms. For each classification algorithm that we consider and a given dataset, we are able to automatically suggest the transformations that improve the quality of the results of the algorithm on the dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.
  • Rana Faisal Munir, Sergi Nadal, Oscar Romero, Alberto Abelló, Petar Jovanovic, Maik Thiele, Wolfgang Lehner. Intermediate Results Materialization Selection and Format for Data-Intensive Flows. Fundamenta Informaticae 163(2). IOS Press, 2018. Pages 111-138. ISSN: 0169-2968. DOI: 10.3233/FI-2018-1734
    Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions.
  • Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi, Alberto Abelló, Oscar Romero. Interactive multidimensional modeling of linked data for exploratory OLAP. In Information Systems 77. Elsevier, 2018. Pages 86-104. ISSN: 0306-4379. DOI: 10.1016/j.is.2018.06.004
    Exploratory OLAP aims at coupling the precision and detail of corporate data with the information wealth of LOD. While some techniques to create, publish, and query RDF cubes are already available, little has been said about how to contextualize these cubes with situational data in an on-demand fashion. In this paper we describe an approach, called iMOLD, that enables non-technical users to enrich an RDF cube with multidimensional knowledge by discovering aggregation hierarchies in LOD. This is done through a user-guided process that recognizes in the LOD the recurring modeling patterns that express roll-up relationships between RDF concepts, then translates these patterns into aggregation hierarchies to enrich the RDF cube. Two families of aggregation patterns are identified, based on associations and generalization respectively, and the algorithms for recognizing them are described. To evaluate iMOLD in terms of efficiency and effectiveness we compare it with a related approach in the literature, we propose a case study based on DBpedia, and we discuss the results of a test made with real users.
  • Alberto Abelló, Xavier de Palol, Mohand-Saïd Hacid. Approximating the Schema of a Set of Documents by Means of Resemblance. In Journal on Data Semantics 7(2). Springer, 2018. Pages 87-105. ISSN: 1861-2032. DOI: 10.1007/s13740-018-0088-0
    The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.
  • Il-Yeol Song, Alberto Abelló, Robert Wrembel. Proceedings of the 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data co-located with 10th EDBT/ICDT Joint Conference (EDBT/ICDT 2018), Vienna, Austria, March 26-29, 2018. CEUR Workshop Proceedings 2062, CEUR-WS.org, 2018
  • Besim Bilalli. Learning the impact of data pre-processing in data analysis. PhD Thesis, Universitat politécnica de Catalunya. Barcelona, June 2018.
    There is a clear correlation between data availability and data analytics, and hence with the increase of data availability - unavoidable according to Moore’s law, the need for data analytics increases too. This certainly engages many more people, not necessarily experts, to perform analytics tasks. However, the different, challenging, and time consuming steps of the data analytics process, overwhelm non-experts and they require support (e.g., through automation or recommendations).

    A very important and time consuming step that marks itself out of the rest, is the data pre-processing step. Data pre-processing is challenging but at the same time has a heavy impact on the overall analysis. In this regard, previous works have focused on providing user assistance in data pre-processing but without being concerned on its impact on the analysis. Hence, the goal has generally been to enable analysis through data pre-processing and not to improve it. In contrast, this thesis aims at developing methods that provide assistance in data pre-processing with the only goal of improving (e.g., increasing the predictive accuracy of a classifier) the result of the overall analysis.

    To this end, we propose a method and define an architecture that leverages ideas from meta-learning to learn the relationship between transformations (i.e., pre-processing operators) and mining algorithms (i.e., classification algorithms). This eventually enables ranking and recommending transformations according to their potential impact on the analysis.

    To reach this goal, we first study the currently available methods and systems that provide user assistance, either for the individual steps of data analytics or for the whole process altogether. Next, we classify the metadata these different systems use and then specifically focus on the metadata used in meta-learning. We apply a method to study the predictive power of these metadata and we extract and select the metadata that are most relevant.

    Finally, we focus on the user assistance in the pre-processing step. We devise an architecture and build a tool, PRESISTANT, that given a classification algorithm is able to recommend pre-processing operators that once applied, positively impact the final results (e.g., increase the predictive accuracy). Our results show that providing assistance in data pre-processing with the goal of improving the result of the analysis is feasible and also very useful for non-experts. Furthermore, this thesis is a step towards demystifying the non-trivial task of pre-processing that is an exclusive asset in the hands of experts.
  • Rohit Kumar. Temporal Graph Mining and Distributed Processing. PhD Thesis, Universitat politécnica de Catalunya. Barcelona, June 2018.
    With the recent growth of social media platforms and the human desire to interact with the digital world a lot of human-human and human-device interaction data is getting generated every second. With the boom of the Internet of Things (IoT) devices, a lot of device-device interactions are also now on the rise. All these interactions are nothing but a representation of how the underlying network is connecting different entities over time. These interactions when modeled as an interaction network presents a lot of unique opportunities to uncover interesting patterns and to understand the dynamics of the network. Understanding the dynamics of the network is very important because it encapsulates the way we communicate, socialize, consume information and get influenced. To this end, in this PhD thesis, we focus on analyzing an interaction network to understand how the underlying network is being used. We define interaction network as a sequence of time-stamped interactions E over edges of a static graph G=(V, E). Interaction networks can be used to model many real-world networks for example, in a social network or a communication network, each interaction over an edge represents an interaction between two users, e.g., emailing, making a call, re-tweeting, or in case of the financial network an interaction between two accounts to represent a transaction.

    We analyze interaction network under two settings. In the first setting, we study interaction network under a sliding window model. We assume a node could pass information to other nodes if they are connected to them using edges present in a time window. In this model, we study how the importance or centrality of a node evolves over time. In the second setting, we put additional constraints on how information flows between nodes. We assume a node could pass information to other nodes only if there is a temporal path between them. To restrict the length of the temporal paths we consider a time window in this approach as well. We apply this model to solve the time-constrained influence maximization problem. By analyzing the interaction network data under our model we find the top-k most influential nodes. We test our model both on human-human interaction using social network data as well as on location-location interaction using location-based social network(LBSNs) data. In the same setting, we also mine temporal cyclic paths to understand the communication patterns in a network. Temporal cycles have many applications and appear naturally in communication networks where one person posts a message and after a while reacts to a thread of reactions from peers on the post. In financial networks, on the other hand, the presence of a temporal cycle could be indicative of certain types of fraud. We provide efficient algorithms for all our analysis and test their efficiency and effectiveness on real-world data.

    Finally, given that many of the algorithms we study have huge computational demands, we also studied distributed graph processing algorithms. An important aspect of these algorithms is to correctly partition the graph data between different machines. A lot of research has been done on efficient graph partitioning strategies but there is no one good partitioning strategy for all kind of graphs and algorithms. Choosing the best partitioning strategy is nontrivial and is mostly a trial and error exercise. To address this problem we provide a cost model based approach to give a better understanding of how a given partitioning strategy is performing for a given graph and algorithm.
2017
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet. On the predictive power of meta-features in OpenML. In Applied Mathematics and Computer Science, Volume 27(4). Walter de Gruyter, 2017. Pages 697-712. ISSN: 2083-8492. DOI: 10.1515/amcs-2017-0048
    The demand for performing data analysis is steadily rising. As a consequence, people of different profiles (i.e., nonexperienced users) have started to analyze their data. However, this is challenging for them. A key step that poses difficulties and determines the success of the analysis is data mining (model/algorithm selection problem). Meta-learning is a technique used for assisting non-expert users in this step. The effectiveness of meta-learning is, however, largely dependent on the description/characterization of datasets (i.e., meta-features used for meta-learning). There is a need for improving the effectiveness of meta-learning by identifying and designing more predictive meta-features. In this work, we use a method from exploratory factor analysis to study the predictive power of different meta-features collected in OpenML, which is a collaborative machine learning platform that is designed to store and organize meta-data about datasets, data mining algorithms, models and their evaluations. We first use the method to extract latent features, which are abstract concepts that group together meta-features with common characteristics. Then, we study and visualize the relationship of the latent features with three different performance measures of four classification algorithms on hundreds of datasets available in OpenML, and we select the latent features with the highest predictive power. Finally, we use the selected latent features to perform meta-learning and we show that our method improves the meta-learning process. Furthermore, we design an easy to use application for retrieving different meta-data from OpenML as the biggest source of data in this domain.
  • Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner. Frequent patterns in ETL workflows: An empirical approach. In Data Knowledge Engineering, volume 112. Elsevier, 2017. Pages 1-16. ISSN: 0169-023X. DOI: 10.1016/j.datak.2017.08.004
    The complexity of Business Intelligence activities has driven the proposal of several approaches for the effective modeling of Extract-Transform-Load (ETL) processes, based on the conceptual abstraction of their operations. Apart from fostering automation and maintainability, such modeling also provides the building blocks to identify and represent frequently recurring patterns. Despite some existing work on classifying ETL components and functionality archetypes, the issue of systematically mining such patterns and their connection to quality attributes such as performance has not yet been addressed. In this work, we propose a methodology for the identification of ETL structural patterns. We logically model the ETL workflows using labeled graphs and employ graph algorithms to identify candidate patterns and to recognize them on different workflows. We showcase our approach through a use case that is applied on implemented ETL processes from the TPC-DI specification and we present mined ETL patterns. Decomposing ETL processes to identified patterns, our approach provides a stepping stone for the automatic translation of ETL logical models to their conceptual representation and to generate fine-grained cost models at the granularity level of patterns.
  • Sergi Nadal, Victor Herrero, Oscar Romero, Alberto Abelló, Xavier Franch, Stijn Vansummeren, Danilo Valerio. A software reference architecture for semantic-aware Big Data systems. In Information & Software Technology, volume 90. Elsevier, 2017. Pages 75-92. ISSN: 0950-5849. DOI: 10.1016/j.infsof.2017.06.001
    Context: Big Data systems are a class of software systems that ingest, store, process and serve massive amounts of heterogeneous data, from multiple sources. Despite their undisputed impact in current society, their engineering is still in its infancy and companies find it difficult to adopt them due to their inherent complexity. Existing attempts to provide architectural guidelines for their engineering fail to take into account important Big Data characteristics, such as the management, evolution and quality of the data.
    Objective: In this paper, we follow software engineering principles to refine the ?-architecture, a reference model for Big Data systems, and use it as seed to create Bolster, a software reference architecture (SRA) for semantic-aware Big Data systems.
    Method: By including a new layer into the Lambda-architecture, the Semantic Layer, Bolster is capable of handling the most representative Big Data characteristics (i.e., Volume, Velocity, Variety, Variability and Veracity).
    Results: We present the successful implementation of Bolster in three industrial projects, involving five organizations. The validation results show high level of agreement among practitioners from all organizations with respect to standard quality factors.
    Conclusion: As an SRA, Bolster allows organizations to design concrete architectures tailored to their specific needs. A distinguishing feature is that it provides semantic-awareness in Big Data Systems. These are Big Data system implementations that have components to simplify data definition and exploitation. In particular, they leverage metadata (i.e., data describing data) to enable (partial) automation of data exploitation and to aid the user in their decision making processes. This simplification supports the differentiation of responsibilities into cohesive roles enhancing data governance.
  • Vasileios Theodorou, Petar Jovanovic, Alberto Abelló, Emona Nakuçi. Data generator for evaluating ETL process quality. In Information Systemps, volume 63. Elsevier, 2017. Pages 80-100. ISSN: 0306-4379. DOI: 10.1016/j.is.2016.04.005
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.
  • Vasileios Theodorou. Automating User-Centered Design of Data-Intensive Processes. PhD Thesis, Universitat politécnica de Catalunya. Barcelona, January 2017.
    Business Intelligence (BI) enables organizations to collect and analyze internal and external business data to generate knowledge and business value, and provide decision support at the strategic, tactical, and operational levels. The consolidation of data coming from many sources as a result of managerial and operational business processes, usually referred to as Extract- Transform-Load (ETL) is itself a statically defined process and knowledge workers have little to no control over the characteristics of the presentable data to which they have access.
    There are two main reasons that dictate the reassessment of this stiff approach in context of modern business environments. The first reason is that the service-oriented nature of today’s business combined with the increasing volume of available data make it impossible for an organization to proactively design efficient data management processes. The second reason is that enterprises can benefit significantly from analyzing the behavior of their business processes fostering their optimization. Hence, we took a first step towards quality-aware ETL process design automation by defining through a systematic literature review a set of ETL process quality characteristics and the relationships between them, as well as by providing quantitative measures for each characteristic. Subsequently, we produced a model that represents ETL process quality characteristics and the dependencies among them and we showcased through the application of a Goal Model with quantitative components (i.e., indicators) how our model can provide the basis for subsequent analysis to reason and make informed ETL design decisions.
    In addition, we introduced our holistic view for a quality-aware design of ETL processes by presenting a framework for user-centered declarative ETL. This included the definition of an architecture and methodology for the rapid, incremental, qualitative improvement of ETL process models, promoting automation and reducing complexity, as well as a clear separation of business users and IT roles where each user is presented with appropriate views and assigned with fitting tasks. In this direction, we built a tool "POIESIS" which facilitates incremental, quantitative improvement of ETL process models with users being the key participants through well-defined collaborative interfaces.
    When it comes to evaluating different quality characteristics of the ETL process design, we proposed an automated data generation framework for evaluating ETL processes (i.e., Bijoux). To this end, we classified the operations based on the part of input data they access for processing, which facilitated Bijoux during data generation processes both for identifying the constraints that specific operation semantics imply over input data, as well as for deciding at which level the data should be generated (e.g., single field, single tuple, complete dataset). Bijoux offers data generation capabilities in a modular and configurable manner, which can be used to evaluate the quality of different parts of an ETL process.
    Moreover, we introduced a methodology that can apply to concrete contexts, building a repository of patterns and rules. This generated knowledge base can be used during the design and maintenance phases of ETL processes, automatically exposing understandable conceptual representations of the processes and providing useful insight for design decisions.
    Collectively, these contributions have raised the level of abstraction of ETL process components, revealing their quality characteristics in a granular level and allowing for evaluation and automated (re-)design, taking under consideration business users’ quality goals.
  • Emmanouil Valsomatzis. Aggregation Techniques for Energy Flexibility. PhD Thesis, Aalborg University of Technology. Aalborg, December 2017.
    Over the last few years, the cost of energy from renewable resources, such as sunlight and wind, has declined resulting in an increasing use of Renewable Energy Sources (RES). As a result, the energy produced by RES is fed into the power grid while their share is expected to significantly increase in the future.
    However, RES are characterized by power fluctuations and their integration into the power grid might lead to power quality issues, e.g., imbalances. At the same time, new energy hungry devices such as heat-pumps and Electric Vehicles (EVs) become more and more popular. As a result, their demand in power, especially during peak-times, might lead to electrical grid overloads and congestions. In order to confront the new challenges, the power grid is transformed into the so-called Smart Grid. Major role in Smart Grid plays the Demand Response (DR) concept.
    According to DR, Smart Grid better matches energy demand and supply by using energy flexibility. Energy flexibility exists in many individual prosumers (producers and/or consumers). For instance, an owner of an EV plugs-in his EV for more time than it is actually needed. Thus, the EV charging can be timely shifted. The load demanded for charging could be moved to time periods when production from wind turbines is high or away from peak-hours. Thus, RES share is increased and/or the electrical grid operation is improved.
    The Ph.D. project is sponsored by the Danish TotalFlex project (http://totalflex.dk). Main goal of the TotalFlex project is to design and establish a flexibility market framework where flexibility from individual prosumers, e.g., household devices, can be traded among different market actors such as Balance Responsible Parties (BRPs) and distribution system operators. In order for that to be achieved, the TotalFlex project utilizes the flex-offer concept.
    Based on the flex-offer concept, flexibility from individual prosumers is captured and represented by a generic model. However, the flexible loads from individual prosumers capture very small energy amounts and thus cannot be directly traded in the market. Therefore, aggregation becomes essential. The Ph.D. project focuses on developing aggregation techniques for energy flexibilities that will provide the opportunity to individual prosumers to participate in such a flexibility market. First, the thesis introduces several flexibility measurements in order to quantify the flexibility captured by the flex-offer model and compare flexoffers among each other, both on an individual and on an aggregated level. Flexibility is both the input and the output of the aggregation techniques. Aggregation techniques aggregate energy flexibility to achieve their goals and, at the same time, they try to retain as much flexibility as possible to be traded in the market. Thus, second, the thesis describes base-line flex-offer aggregation techniques and presents balance aggregation techniques that focus on balancing out energy supply and demand. Third, since there are cases where electrical grid congestions occur, the thesis presents two constraint-based aggregation techniques. The techniques efficiently aggregate large amounts of flex-offers taking into account physical constraints of the electrical grid. The produced aggregated flex-offers are still flexible and when scheduled, a normal grid operation is achieved. Finally, the thesis examines the financial benefits of the aggregation techniques. It introduces flex-offer aggregation techniques that take into account real market technical requirements. As a result, individual small flexible loads can be indirectly traded in the energy market through aggregation.
    The proposed aggregation techniques for energy flexibilities can contribute to the use of flexibility in the Smart Grid in both current and future market frameworks. The designed techniques can improve the services offered to the prosumers and avoid the very costly upgrades of the distribution network.
  • Rohit Kumar, Alberto Abelló, Toon Calders. Cost Model for Pregel on GraphX. In proceedings of 21st European Conference on Advances in Databases and Information Systems (ADBIS), Nicosia, Cyprus, September 24-27, 2017. Lecture Notes in Computer Science 10509, Springer 2017. Pages 153-166. ISBN 978-3-319-66916-8. DOI: 10.1007/978-3-319-66917-5_11
    The graph partitioning strategy plays a vital role in the overall execution of an algorithm in a distributed graph processing system. Choosing the best strategy is very challenging, as no one strategy is always the best fit for all kinds of graphs or algorithms. In this paper, we help users choosing a suitable partitioning strategy for algorithms based on the Pregel model by providing a cost model for the Pregel implementation in Spark-GraphX. The cost model shows the relationship between four major parameters: (1) input graph (2) cluster configuration (3) algorithm properties and (4) partitioning strategy. We validate the accuracy of the cost model on 17 different combinations of input graph, algorithm, and partition strategy. As such, the cost model can serve as a basis for yet to be developed optimizers for Pregel.
  • Alberto Abelló, Claudia P. Ayala, Carles Farré, Cristina Gómez, Marc Oriol, Oscar Romero. A Data-Driven Approach to Improve the Process of Data-Intensive API Creation and Evolution. Proceedings of the Forum and Doctoral Consortium Papers Presented at the 29th International Conference on Advanced Information Systems Engineering (CAiSE), Essen, Germany, June 12-16, 2017. CEUR Workshop Proceedings 1848, CEUR-WS.org 2017. Pages 1-8. ISSN: 1613-0073
    The market of data-intensive Application Programming Interfaces (APIs) has recently experienced an exponential growth, but the creation and evolution of such APIs is still done ad-hoc, with little automated support and reported deficiencies. These drawbacks hinder the productivity of developers of those APIs and the services built on top of them. In this exploratory paper, we promote a data-driven approach to improve the automatization of data-intensive API creation and evolution. In a release cycle, data coming from API usage and developers will be gathered to compute several indicators whose analysis will guide the planning of the next release. This data will also help to generate complete documentation facilitating APIs adoption by third parties.
  • Daria Glushkova, Petar Jovanovic, Alberto Abelló. MapReduce Performance Models for Hadoop 2.x. In 19th International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), Venice, Italy, March 21-24, 2017. CEUR Workshop Proceedings 1810, CEUR-WS.org 2017. ISSN: 1613-0073
    MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that it may provide reasonably accurate job response time at significantly lower cost than experimental evaluation of real setups.
    In this paper, we tackle the challenge of defining MapReduce performance models for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, the fundamental architectural changes of Hadoop 2.x require that the cost models are also reconsidered. The proposed solution is based on an existing performance model for Hadoop 1.x, but it takes into consideration the architectural changes of Hadoop 2.x and captures the execution flow of a MapReduce job by using queuing network model. This way the cost model adheres to the intra-job synchronization constraints that occur due the contention at shared resources.
    The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup. According to our evaluation results, the proposed model produces estimates of average job response time with error within the range of 11% - 13.5%
  • Sergi Nadal, Oscar Romero, Alberto Abelló, Panos Vassiliadis, Stijn Vansummeren. An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems. In 19th International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), Venice, Italy, March 21-24, 2017. CEUR Workshop Proceedings 1810, CEUR-WS.org 2017. ISSN: 1613-0073
    Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in its original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving, forcing data analysts using it need to adapt their analytical processes after each release. This gets more challenging when aiming to perform an integrated or historical analysis of multiple sources. To cope with such complexity, in this paper we present the Big Data Integration ontology, the core construct for a data governance protocol that systematically annotates and integrates data from multiple sources in its original format. To cope with syntactic evolution in the sources, we present an algorithm that semi-automatically adapts the ontology upon new releases. A functional evaluation on real-world APIs is performed in order to validate our approach.
  • Sergi Nadal, Alberto Abelló, Oscar Romero, Jovan Varga. Big Data Management Challenges in SUPERSEDE. In 1st International Workshop on Big Data Management in European Projects (EuroPro) at EDBT/ICDT, Venice, Italy, March 21-24, 2017. CEUR Workshop Proceedings 1810, CEUR-WS.org 2017. ISSN: 1613-0073
    The H2020 SUPERSEDE (www.supersede.eu) project aims to support decision-making in the evolution and adaptation of software services and applications by exploiting end-user feedback and runtime data, with the overall goal of improving the end-users quality of experience (QoE). Such QoE is defined as the overall performance of a system from the point of view of users, which must consider both feedback and runtime data gathered. End-user’s feedback is extracted from online forums, app stores, social networks and novel direct feedback channels, which connect software applications and service users to developers. Runtime data is primarily gathered by monitoring environmental sensors, infrastructures and usage logs. Hereafter, we discuss our solutions for the main data management challenges in SUPERSEDE.
  • Ayman Alserafi, Toon Calders, Alberto Abelló, Oscar Romero. DS-Prox: Dataset Proximity Mining for Governing the Data Lake. In 10th International Conference Similarity Search and Applications (SISAP), Munich, Germany, October 4-6, 2017. Lecture Notes in Computer Science 10609, Springer 2017. Pages 284-299. ISBN 978-3-319-68473-4. DOI: 10.1007/978-3-319-68474-1_20
    With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
  • Carme Quer, Alberto Abelló, Xavier Burgués, Maria José Casany, Carme Martín, Elena Rodríguez, Oscar Romero, Antoni Urpí. E-assessment of Relational Database skills by means of LEARNSQL. In Proceedings of 9th International Conference on Education and New Learning Technologies (EDULEARN), Barcelona, Spain, July 3-5, 2017. IATED, 2017. Pages 9443-9448. ISBN: 978-84-697-3777-4. ISSN: 2340-1117. DOI: 10.21125/edulearn.2017.0779
    In database related courses, students require analytical, creative, and constructive skills that cannot be assessed via multiple-choice tests or equivalent forms of basic assessment techniques. From a technological point of view, this requires more complex e-assessment systems. LearnSQL (Learning Environment for Automatic Rating of Notions of SQL) is a software tool that allows the automatic and efficient e-learning and e-assessment of relational database skills. It has been used at FIB for 18 semesters with an average of 200 students per semester.
  • Alberto Abelló, Oscar Romero. On-Line Analytical Processing (OLAP). In Encyclopedia of Database Systems (editors-in-chief: Tamer Ozsu & Ling Liu). Springer, 2009 (reedited in 2017). Pages 1949-1954. ISBN: 978-0-387-39940-9 (Reedited: 978-1-4899-7993-3). DOI: 978-1-4899-7993-3
  • Yassine Ouhammou, Mirjana Ivanovic, Alberto Abelló, Ladjel Bellatreche. Proceedings of the 7th International Conference on Model and Data Engineering (MEDI), Barcelona, Spain, October 4-6, 2017. Lecture Notes in Computer Science 10563, Springer 2017. ISBN 978-3-319-66853-6. DOI: 10.1007/978-3-319-66854-3
2016
  • Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner, Maik Thiele. Quality measures for ETL processes: from goals to implementation. In Concurrency and Computation: Practice and Experience, 28(15). John Wiley & Sons, 2016. Pages 3969-3993. ISSN: 1532-0634. DOI: 10.1002/cpe.3729
    Extraction transformation loading (ETL) processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of business process management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated, and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper, we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using goal modeling techniques. We showcase the use of goal modeling for ETL process design through a use case, where we employ the use of a goal model that includes quantitative components (i.e., indicators) for evaluation and analysis of alternative design decisions.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló. Incremental Consolidation of Data-Intensive Multi-Flows. In Transactions on Knowledge and Data Engineering, 28(5). IEEE Press, May 2016. Pages 1203-1216. ISSN: 1041-4347. DOI: 10.1109/TKDE.2016.2515609
    Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl , for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.
  • Petar Jovanovic, Oscar Romero, Alberto Abelló. A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey. In Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX. Lecture Notes in Computer Science 10120. Springer, 2016. Pages 66-107. ISBN (printed): 978-3-662-54036-7. ISBN (online): 978-3-662-54037-4. ISSN: 0302-9743. DOI: 10.1007/978-3-662-54037-4_3
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.
  • Ayman Alserafi, Alberto Abelló, Oscar Romero, Toon Calders. Towards Information Profiling: Data Lake Content Metadata Management. In the 3rd Woorkshop on Data Integration and Applications (DINA) held in conjunction with IEEE International Conference on Data Mining Workshops (ICDMW). Barcelona, December 12-15, 2016. IEEE, 2016. ISBN (online): 978-1-5090-5910-2. ISBN: 978-1-5090-5911-9. DOI: 10.1109/ICDMW.2016.0033
    There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this. We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel. Towards Intelligent Data Analysis: The Metadata Challenge. In International Conference on Internet of Things and Big Data (IoTBD). Rome (Italy), April 23-25, 2016. ScitePress, 2016. Pages 331-338. ISBN: 978-989-758-183-0. DOI: 10.5220/0005876203310338
    Once analyzed correctly, data can yield substantial benefits. The process of analyzing the data and transforming it into knowledge is known as Knowledge Discovery in Databases (KDD). The plethora and subtleties of algorithms in the different steps of KDD, render it challenging. An effective user support is of crucial importance, even more now, when the analysis is performed on Big Data. Metadata is the necessary component to drive the user support. In this paper we study the metadata required to provide user support on every stage of the KDD process. We show that intelligent systems addressing the problem of user assistance in KDD are incomplete in this regard. They do not use the whole potential of metadata to enable assistance during the whole process. We present a comprehensive classification of all the metadata required to provide user support. Furthermore, we present our implementation of a metadata repository for storing and managing this metadata and explain its benefits in a real Big Data analytics project.
  • Petar Jovanovic. Requirement-Driven Design and Optimization of Data-Intensive Flows. PhD Thesis, Universitat Politècnica de Catalunya. Barcelona, September 2016.

    Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today¿s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g., social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e., at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e., ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows.
  • Alberto Abelló, Xavier Burgués, María José Casany, Carme Martín, Maria Carme Quer, M. Elena Rodríguez, Oscar Romero, Antoni Urpí. A software tool for e-assessment of relational database skills. In International Journal of Engineering Education, 32(3). Tempus Publications, February 2016. Pages 1289-1312. ISSN: 0949-149X/91
    The objective of this paper is to present a software tool for the e-assessment of relational database skills of students. The tool is referred to as LearnSQL (Learning Environment for Automatic Rating of Notions of SQL). LearnSQL is able to correct, provide automatic feedback, and grade the responses of relational database exercises. It can assess the acquisition of knowledge and practical skills in relational database that are not assessed by other systems. The paper also reports on the impact of using the tool over the past 8 years by 2500 students.
  • Petar Jovanovic, Oscar Romero, Toon Calders, Alberto Abelló. H-WorD: Supporting Job Scheduling in Hadoop with Workload-Driven Data Redistribution. In 20th East European Conference on Advances in Databases and Information Systems (ADBIS). Prague (Czech Republic), August 28-31, 2016. Lecture Notes in Computer Science 9809, Springer, 2016. Pages 306-320. ISBN: 978-3-319-44038-5. DOI: 10.1007/978-3-319-44039-2_21
    Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we address the challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.
  • Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló, Katja Hose, Laurynas Siksnys. Towards constraint-based aggregation of energy flexibilities. In poster session in Seventh International Conference on Future Energy Systems (e-Energy 2016). Waterloo, ON (Canada), June 21-24, 2016. ACM, 2016. Pages 6:1-6:2. ISBN: 978-1-4503-4417-3. DOI: 10.1145/2939912.2942351
    The aggregation of energy flexibilities enables individual producers and/or consumers with small loads to directly participate in the emerging energy markets. On the other hand, aggregation of such flexibilities might also create problems to the operation of the electrical grid. In this paper, we present the problem of aggregating energy flexibilities taking into account grid capacity limitations and introduce a heuristic aggregation technique. We show through an experimental setup that our proposed technique, compared to a baseline approach, not only leads to a valid unit commitment result that respects the grid constraint, but it also improves the quality of the result.
  • Victor Herrero, Alberto Abelló, Oscar Romero. NOSQL Design for Analytical Workloads: Variability Matters. In 35th International Conference on Conceptual Modeling (ER). Gifu (Japan), November 14-17, 2016. Lecture Notes in Computer Science 9974. Springer, 2016. Pages 50-64. ISBN: 978-3-319-46396-4. DOI: 10.1007/978-3-319-46397-1_4
    Big Data has recently gained popularity and has strongly questioned relational databases as universal storage systems, especially in the presence of analytical workloads. As result, co-relational alternatives, commonly known as NOSQL (Not Only SQL) databases, are extensively used for Big Data. As the primary focus of NOSQL is on performance, NOSQL databases are directly designed at the physical level, and consequently the resulting schema is tailored to the dataset and access patterns of the problem in hand. However, we believe that NOSQL design can also benefit from traditional design approaches. In this paper we present a method to design databases for analytical workloads. Starting from the conceptual model and adopting the classical 3-phase design used for relational databases, we propose a novel design method considering the new features brought by NOSQL and encompassing relational and co-relational design altogether.
  • Rana Faisal Munir, Oscar Romero, Alberto Abelló, Besim Bilalli, Maik Thiele, Wolfgang Lehner. ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. In 6th International Conference on Model and Data Engineering (MEDI). Almería (Spain), September 21-23, 2016. Lecture Notes in Computer Science 9893. Springer, 2016. Pages 42-56. ISBN: 978-3-319-45546-4. DOI: 10.1007/978-3-319-45547-1_4
    Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns. We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18 % better performance than any solution based on a single fixed format.
  • Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel. Automated Data Pre-processing via Meta-learning. In 6th International Conference on Model and Data Engineering (MEDI). Almería (Spain), September 21-23, 2016. Lecture Notes in Computer Science 9893. Springer, 2016. Pages 194-208. ISBN: 978-3-319-45546-4. DOI: 10.1007/978-3-319-45547-1_16
    A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and non-experienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from meta-learning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.
  • Stefano Rizzi, Enrico Gallinucci, Matteo Golfarelli, Alberto Abelló, Oscar Romero. Towards Exploratory OLAP on Linked Data. In 24th Italian Symposium on Advanced Database Systems (SEBD). Ugento, Lecce (Italy), June 19-22, 2016. Matematicamente.it, 2016. Pages 86-93. ISBN: 9788896354889
    In the context of exploratory OLAP, coupling the information wealth of linked data with the precision and detail of corporate data can greatly improve the efectiveness of the decision-making process. In this paper we outline an approach that enables users to extend the hierarchies in their corporate cubes through a user-guided process that explores selected linked data and derives hierarchies from them. This is done by identifying in the linked data the recurring modeling patterns that express roll-up relationships between RDF concepts and translating them into multidimensional knowledge.
  • Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló, Katja Hose. Aggregating energy flexibilities under constraints. In 2016 IEEE International Conference on Smar Grid Communications (SmartGridComm 2016). Sydney (Australia), 6-9 November 2016. IEEE, 2016. Pages 484-490. ISBN: 978-1-5090-4075-9. DOI: 10.1109/SmartGridComm.2016.7778808
    The flexibility of individual energy prosumers (producers and/or consumers) has drawn a lot of attention in recent years. Aggregation of such flexibilities provides prosumers with the opportunity to directly participate in the energy market and at the same time reduces the complexity of scheduling the energy units. However, aggregated flexibility should support normal grid operation. In this paper, we build on the flex-offer (FO) concept to model the inherent flexibility of a prosumer (e.g., a single flexible consumption device such as a clothes washer). An FO captures flexibility in both time and amount dimensions. We define the problem of aggregating FOs taking into account grid power constraints. We also propose two constraint-based aggregation techniques that efficiently aggregate FOs while retaining flexibility. We show through a comprehensive evaluation that our techniques, in contrast to state-of-the-art techniques, respect the constraints imposed by the electrical grid. Moreover, our techniques also reduce the scheduling input size significantly and improve the quality of scheduling results.
  • Esteban Zimányi, Alberto Abelló (Editors). Business Intelligence. Tutorial Lectures of 5th European Summer School in Business Intelligence (eBISS). Barcelona (Spain), July 5-10, 2015. In Lecture Notes in Business Information Processing, 253. Springer, 2016. ISBN: 978-3-319-39242-4. DOI: 10.1007/978-3-319-39243-1
2015
  • Oscar Romero, Victor Herrero, Alberto Abelló, Jaume Ferrarons. Tuning small analytics on Big Data: Data partitioning and secondary indexes in the Hadoop ecosystem. In Information Systems, 54. Pages 336-356. Elsevier, December 2015. ISSN: 0306-4379. DOI: 10.1016/j.is.2014.09.005
    In the recent years the problems of using generic storage (i.e., relational) techniques for very specific applications have been detected and outlined and, as a consequence, some alternatives to Relational DBMSs (e.g., HBase) have bloomed. Most of these alternatives sit on the cloud and benefit from cloud computing, which is nowadays a reality that helps us to save money by eliminating the hardware as well as software fixed costs and just pay per use. On top of this, specific querying frameworks to exploit the brute force in the cloud (e.g., MapReduce) have also been devised. The question arising next tries to clear out if this (rather naive) exploitation of the cloud is an alternative to tuning DBMSs or it still makes sense to consider other options when retrieving data from these settings. In this paper, we study the feasibility of solving OLAP queries with Hadoop (the Apache project implementing MapReduce) while benefiting from secondary indexes and partitioning in HBase. Our main contribution is the comparison of different access plans and the definition of criteria (i.e., cost estimation) to choose among them in terms of consumed resources (namely CPU, bandwidth and I/O).
  • Alberto Abelló, Oscar Romero, Torben Bach Pedersen, Rafael Berlanga Llavori, Victoria Nebot, María José Aramburu Cabo, Alkis Simitsis. Using Semantic Web Technologies for Exploratory OLAP: A Survey. In IEEE Transactions on Knowledge and Data Engineering, 27(2). Pages 571-588. IEEE, February 2015. ISSN: 1041-4347. DOI: 10.1109/TKDE.2014.2330822
    This paper describes the convergence of some of the most influential technologies in the last few years, namely data warehousing (DW), on-line analytical processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no longer be answered on internal data alone, external data must also be discovered (most often on the web), acquired, integrated, and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP. When using external data, an important issue is knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the "passive" semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms. Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies for providing context to data warehouses, and scalability issues.
  • Alberto Abelló. Big Data Design. In 18th International Workshop on Data Warehousing and OLAP (DOLAP). Melbourne (Australia), November 2015. ACM Press, 2015. Pages 35-38. ISBN: 978-1-4503-3785-4. DOI 10.1145/2811222.2811235
    It is widely accepted today that Relational databases are not appropriate in highly distributed shared-nothing architectures of commodity hardware, that need to handle poorly structured heterogeneous data. This has brought the blooming of NoSQL systems with the purpose of mitigating such problem, specially in the presence of analytical workloads. Thus, the change in the data model and the new analytical needs beyond OLAP take us to rethink methods and models to design and manage these newborn repositories. In this paper, we will analyze state of the art and future research directions.
  • Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner. POIESIS: a Tool for Quality-aware ETL Process Redesign. In demonstration session in 18th International Conference on Extending Database Technology (EDBT). Brussels (Belgium), March 2015. Open Proceedings, 2015. Pages 545-548. ISBN 978-3-89318-067-7
    We present a tool, called POIESIS, for automatic ETL process enhancement. ETL processes are essential data-centric activities in modern business intelligence environments and they need to be examined through a viewpoint that concerns their quality characteristics (e.g., data quality, performance, manageability) in the era of Big Data. POIESIS responds to this need by providing a user-centered environment for quality-aware analysis and redesign of ETL flows. It generates thousands of alternative flows by adding flow patterns to the initial flow, in varying positions and combinations, thus creating alternative design options in a multidimensional space of different quality attributes. Through the demonstration of POIESIS we introduce the tool's capabilities and highlight its efficiency, usability and modificability, thanks to its polymosphic design.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló, Héctor Candón, Sergi Nadal. Quarry: Digging Up the Gems of Your Data Treasury. In demonstration session in 18th International Conference on Extending Database Technology (EDBT). Brussels (Belgium), March 2015. Open Proceedings, 2015. Pages 549-552. ISBN 978-3-89318-067-7
    The design lifecycle of a data warehousing (DW) system is primarily led by requirements of its end-users and the complexity of underlying data sources. The process of designing a multidimensional (MD) schema and back-end extract-transform-load (ETL) processes, is a long-term and mostly manual task. As enterprises shift to more real-time and ’on-the-fly’ decision making, business intelligence (BI) systems require automated means for efficiently adapting a physical DW design to frequent changes of business needs. To address this problem, we present Quarry, an end-to-end system for assisting users of various technical skills in managing the incremental design and deployment of MD schemata and ETL processes. Quarry automates the physical design of a DW system from high-level information requirements. Moreover, Quarry provides tools for efficienly accommodating MD schema and ETL process designs to new or changed information needs of its end-users. Finally, Quarry facilitates the deployment of the generated DW design over an extensible list of execution engines. On-site, we will use a variety of examples to show how Quarry facilitates the complexity of the DW design lifecycle.
2014
  • Ruth Raventós, Stephany García, Oscar Romero, Alberto Abelló, and Jaume Viñas. On the Complexity of Requirements Engineering for Decision-Support Systems: The CID Case Study. In Fourth European Business Intelligence Summer School (eBISS'14). Lecture Notes in Business Information Processing, Volume 205. Springer, July 2015. Pages 1-38. ISBN (printed): 978-3-319-17551-5. ISSN (electronic): 978-3-319-17550-8. DOI: 10.1007/978-3-319-17551-5
    The Chagas disease is classified as a life-threatening disease by the World Health Organization (WHO) and is currently causing death to 534,000 people every year. In order to advance with the disease control, the WHO presented a strategy that included the development of the Chagas Information Database (CID) for surveillance to raise awareness about Chagas. CID is defined as a decision-support system to support national and international authorities in both their day-by-day and long-term decision making. The requirements engineering to develop this project was particularly complex and Pohl’s framework was followed. This paper describes the results of applying the framework in this project. Thus, it focuses on the requirements engineering stage. The difficulties found motivated the further study and analysis of the complexity of requirements engineering in decision-support systems and the feasibility of using said framework.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló, Daria Mayorova. A requirement-driven approach to the design and evolution of data warehouses. Information Systems, Volume 44. Pages 94-119. Elsevier, August 2014. ISSN: 0306-4379. DOI: 10.1016/j.is.2014.01.004
    Designing data warehouse (DW) systems in highly dynamic enterprise environments is not an easy task. At each moment, the multidimensional (MD) schema needs to satisfy the set of information requirements posed by the business users. At the same time, the diversity and heterogeneity of the data sources need to be considered in order to properly retrieve needed data. Frequent arrival of new business needs requires that the system is adaptable to changes. To cope with such an inevitable complexity (both at the beginning of the design process and when potential evolution events occur), in this paper we present a semi-automatic method called ORE, for creating DW designs in an iterative fashion based on a given set of information requirements. Requirements are first considered separately. For each requirement, ORE expects the set of possible MD interpretations of the source data needed for that requirement (in a form similar to an MD schema). Incrementally, ORE builds the unified MD schema that satisfies the entire set of requirements and meet some predefined quality objectives. We have implemented ORE and performed a number of experiments to study our approach. We have also conducted a limited-scale case study to investigate its usefulness to designers.
  • Alberto Abelló, Boualem Benatallah, Ladjel Bellatreche (Eds.). Special Issue on: Model and Data Engineering. J. Data Semantics 3(3). Springer, 2014. Pages 141-142. ISSN (printed): 1861-2032. ISBN (electronic): 1861-2040. DOI 10.1007/s13740-013-0033-1

  • Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner. Quality Measures for ETL Processes. 16th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Munich (Germany), September 2-4, 2014. Pages 9-22. Lecture Notes in Computer Science 8646, Springer 2014. ISBN (printed): 978-3-319-10159-0. ISBN (electronic): 978-3-319-10160-6. DOI: 10.1007/978-3-319-10160-6_2
    ETL processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of Business Process Management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using Goal Modeling techniques.
  • Emona Nakuçi, Vasileios Theodorou, Petar Jovanovic, Alberto Abelló. Bijoux: Data Generator for Evaluating ETL Process Quality. In 17th International Workshop on Data Warehousing and OLAP (DOLAP). Shanghai (China), November 2014. ACM Press, 2014. Pages 23-32. ISBN: 978-1-4503-0999-8. DOI: 10.1145/2666158.2666183
    Obtaining the right set of data for evaluating the fulfillment of different quality standards in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. Additionally, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over data, and automatically generates testing datasets. At the same time, it considers different dataset and transformation characteristics (e.g., size, distribution, selectivity, etc.) in order to cover a variety of test scenarios. We report our experimental findings showing the effectiveness and scalability of our approach.
  • Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner. A Framework for User-Centered Declarative ETL. In 17th International Workshop on Data Warehousing and OLAP (DOLAP). Shanghai (China), November 2014. ACM Press, 2014. Pages 67-70. ISBN: 978-1-4503-0999-8. DOI: 10.1145/2666158.2666178
    As business requirements evolve with increasing information density and velocity, there is a growing need for efficiency and automation of Extract-Transform-Load (ETL) processes. Current approaches for the modeling and optimization of ETL processes provide platform-independent optimization solutions for the (semi-)automated transition among different abstraction levels, focusing on cost and performance. However, the suggested representations are not abstract enough to communicate business requirements and the role of the process quality in a user-centered perspective has not yet been adequately examined. In this paper, we introduce a novel methodology for the end-to-end design of ETL processes that takes under consideration both functional and non-functional requirements. Based on existing work, we raise the level of abstraction for the conceptual representation of ETL operations and we show how process quality characteristics can generate specific patterns on the process design.
  • Alberto Abelló, Ramon Bragós, Margarita Cabrera, Antonia Cortés, Àlex Fabra, Josep Fernández, José Lázaro, Jordi Amorós, Neus Arroyo, Francesc Garófano, Daniel González, Aleix Guash, Ferran Recio. Plataforma per a la interoperabilitat de laboratoris virtuals i remots. Revista de Tecnologia, Número 5, 2014. Pages 35-43. ISSN (printed): 1698-2045. ISSN (electronic): 2013-9861. DOI: 10.2436/20.2004.01.14

2013
  • Unleashing the Potential of Big Data. A white paper based on the 2013 World Summit on Big Data and Organization Design. http://www.e-pages.dk/aarhusuniversitet/775/
    "While knowledge is the engine of the economy, Big Data is its fuel." This characterization of Big Data was made by Ms. Neelie Kroes, European Commission Vice President in charge of the digital agenda for Europe. Kroes calls Big Data the "new oil". For traditional industries and the service sector, Big Data will create a huge number of commercial opportunities. For the public sector, Big Data offers a promising route to service improvement and transparency as well as a tool for making infrastructure and other investments. Politicians and policymakers are aware of both the potential and the dangers of Big Data. In 2012, the Obama Administration launched the Big Data Research and Development Initiative in the United States, and the European Commission (EC) is taking steps to remove obstacles to the use of Big Data through legislation, standards setting, and its R&D programmes. Hand-in-hand with new data-protection legislation, the EC wants to formulate an overall cybersecurity strategy to ensure that individual and organizational data are properly used and protected. Alongside harmonized rules for how data is handled, the EC is pushing for standards to allow the interoperability and integration of data. Other government initiatives focus on technological development and infrastructure projects. This White Paper offers ideas and recommendations to further increase the value of Big Data initiatives while protecting against their risks. Governments, universities, and business all have a role to play in this endeavor, and we hope that decision makers will find the paper helpful as they pursue their respective tasks.
  • Alberto Abelló, Jérôme Darmont, Lorena Etcheverry, Matteo Golfarelli, José-Norberto Mazón, Felix Naumann, Torben Bach Pedersen, Stefano Rizzi, Juan Trujillo, Panos Vassiliadis, and Gottfried Vossen. Fusion Cubes: Towards Self-Service Business Intelligence. In International Journal on Data Warehousing and Mining (IJDWM), volume 9, number 2. Idea Group, 2013. Pages 66-88. ISSN: 1548-3924 DOI: 10.4018/jdwm.2013040104

    Self-service business intelligence is about enabling non-expert users to make well-informed decisions by enriching the decision process with situational data, i.e., data that have a narrow focus on a specific business problem and, typically, a short lifespan for a small group of users. Often, these data are not owned and controlled by the decision maker; their search, extraction, integration, and storage for reuse or sharing should be accomplished by decision makers without any intervention by designers or programmers. The goal of this paper is to present the framework we envision to support self-service business intelligence and the related research challenges; the underlying core idea is the notion of fusion cubes, i.e., multidimensional cubes that can be dynamically extended both in their schema and their instances, and in which situational data and metadata are associated with quality and provenance annotations.
  • Oscar Romero, Alberto Abelló. Open Access Semantic Aware Business Intelligence. In Third European Business Intelligence Summer School (eBISS'13). Lecture Notes in Business Information Processing, Volume 172. Pages 121-149. Springer, July 2014. ISSN (printed): 978-3-319-05460-5. ISSN (electronic): 978-3-319-05461-2. DOI: 10.1007/978-3-319-05461-2_4
    The vision of an interconnected and open Web of data is, still, a chimera far from being accomplished. Fortunately, though, one can find several evidences in this direction and despite the technical challenges behind such approach recent advances have shown its feasibility. Semantic-aware formalisms (such as RDF and ontology languages) have been successfully put in practice in approaches such as Linked Data, whereas movements like Open Data have stressed the need of a new open access paradigm to guarantee free access to Web data.

    In front of such promising scenario, traditional business intelligence (BI) techniques and methods have been shown not to be appropriate. BI was born to support decision making within the organizations and the data warehouse, the most popular IT construct to support BI, has been typically nurtured with data either owned or accessible within the organization. With the new linked open data paradigm BI systems must meet new requirements such as providing on-demand analysis tasks over any relevant (either internal or external) data source in right-time. In this paper we discuss the technical challenges behind such requirements, which we refer to as exploratory BI, and envision a new kind of BI system to support this scenario.
  • Carme Martín, Toni Urpí, M. José Casany, Xavier Burgués, Carme Quer, M. Elena Rodríguez and Alberto Abelló. Improving Learning in a Database Course using Collaborative Learning Techniques. In International Journal of Engineering Education (IJEE), volume 29, number 4. Tempus publications, 2013. Pages 1-12. ISSN: 0949-149X

    In the last years the European universities have been adapting their curricula to the new European Higher Education Area, which implies the use of active learning methodologies. In most database courses, project-based learning is the active methodology widely used but the authors of this paper face context constraints against its use. This paper presents a quantitative and qualitative analysis of the results obtained from the use of collaborative learning in both the cross-curricula competences and the subject-specific ones in the "Introduction to Databases" course of the Barcelona School of Informatics. Relevantly, this analysis demonstrates the positive impact this methodology had, allowing to conclude that not only project-based learning fits these kind of courses.
2012
  • Alberto Abelló, Oscar Romero. Ontology driven search of compound IDs. Knowledge and Information Systems, Volume 32, Issue 1. Pages 191-216. Springer, July 2012. ISSN (printed): 0219-1377. ISSN (electronic): 0219-3116. DOI: 10.1007/s10115-011-0418-0
    Object identification is a crucial step in most information systems. Nowadays, we have many different ways to identify entities such as surrogates, keys, and object identifiers. However, not all of them guarantee the entity identity. Many works have been introduced in the literature for discovering meaningful identifiers (i.e., guaranteeing the entity identity according to the semantics of the universe of discourse), but all of them work at the logical or data level and they share some constraints inherent to the kind of approach. Addressing it at the logical level, we may miss some important data dependencies, while the cost to identify data dependencies purely at the data level may not be affordable. In this paper, we propose an approach for discovering meaningful identifiers driven by domain ontologies. In our approach, we guide the process at the conceptual level and we introduce a set of pruning rules for improving the performance by reducing the number of identifier hypotheses generated and to be verified with data. Finally, we also introduce a simulation over a case study to show the feasibility of our method.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló. Integrating ETL Processes from Information Requirements. 14th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Vienna, Austria, September 3-6, 2012. Lecture Notes in Computer Science 7448. Springer, 2012. Pages 65-80. ISBN (printed): 978-3-642-32583-0. ISBN (electronic): 978-3-642-32584-7. DOI 10.1007/978-3-642-32584-7_6
    Data warehouse (DW) design is based on a set of requirements expressed as service level agreements (SLAs) and business level objects (BLOs). Populating a DW system from a set of information sources is realized with extract-transform-load (ETL) processes based on SLAs and BLOs. The entire task is complex, time consuming, and hard to be performed manually. This paper presents our approach to the requirement-driven creation of ETL designs. Each requirement is considered separately and a respective ETL design is produced. We propose an incremental method for consolidating these individual designs and creating an ETL design that satisfies all given requirements. Finally, the design produced is sent to an ETL engine for execution. We illustrate our approach through an example based on TPC-H and report on our experimental findings that show the efectiveness and quality of our approach.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló. ORE: an iterative approach to the design and evolution of multi-dimensional schemas. In 15th International Workshop on Data Warehousing and OLAP (DOLAP). Maui (USA), October 2012. ACM Press, 2012. Pages 1-8. ISBN: 978-1-4503-1721-4. DOI 10.1145/2390045.2390047
    Designing a data warehouse (DW) highly depends on the information requirements of its business users. However, tailoring a DW design that satisfies all business requirements is not an easy task. In addition, complex and evolving business environments result in a continuous emergence of new or changed business needs. Furthermore, for building a correct multidimensional (MD) schema for a DW, the designer should deal with the semantics and heterogeneity of the underlying data sources. To cope with such an inevitable complexity, both at the beginning of the design process and when a potential evolution event occurs, in this paper we present a semi-automatic method, named ORE, for constructing the MD schema in an iterative fashion based on the information requirements. In our approach, we consider each requirement separately and incrementally build the unified MD schema satisfying the entire set of requirements.
  • Petar Jovanovic, Oscar Romero, Alkis Simitsis, Alberto Abelló. Requirement-Driven Creation and Deployment of Multidimensional and ETL Designs. In 31st International Conference on Conceptual Modeling (ER) Workshops. Springer 2012. Pages 391-395. ISBN: 978-3-642-33999-8
    We present our tool, GEM, for assisting designers in the error-prone and time-consuming tasks carried out at the early stages of a data warehousing project. Our tool semi-automatically produces multidimensional (MD) and ETL conceptual designs from a given set of business requirements (like SLAs) and data source descriptions. Subsequently, our tool translates both the MD and ETL conceptual designs produced into physical designs, so they can be further deployed on a DBMS and an ETL engine. In this paper, we describe the system architecture and present our demonstration proposal by means of an example.
  • Alberto Abelló, Ladjel Bellatreche, Boualem Benatallah (Eds.). Model and Data Engineering - 2nd International Conference, MEDI 2012, Poitiers, France, October 3-5, 2012. Proceedings. Lecture Notes in Computer Science 7602, Springer 2012. ISBN (printed): 978-3-642-33608-9. ISBN (electronic): 978-3-642-33609-6. DOI: 10.1007/978-3-642-33609-6
  • José Fernández, Ramón Bragós, Margarita Cabrera, Alberto Abelló, Neus Arroyo, Daniel González, Francesc Garófano, A. Cortés, A. Fabra. Interoperability platform for virtual and remote laboratories. In 9th International Conference on Remote Engineering and Virtual Instrumentation (REV). IEEE, 2012. Pages 1-7. ISBN: 978-1-4673-2542-4
    This communication describes the interoperability platform that has been developed at the Technical University of Catalonia (UPC) to integrate the access to different virtual and remote laboratories. Up to eleven laboratories that belong to the GilabViR group of interest in virtual and remote laboratories in our University have been analyzed to generate a set of specifications and develop the architecture and the applications that would allow its access through the university LMS system. Although the current LMS platform (Atenea) is implemented over Moodle 1.9, the new modules have been developed using Moodle 2.2.1 given that the migration to this version will be done in the next months. The interoperability platform defines new Moodle modules that allow the interconnection between the LMS system and a set of laboratories and provide the intrinsic LMS features (user identification, activity recording, educational materials repository, ...). There are modules that allow the interaction with a web service interface giving access to the laboratory, with Java applet virtual laboratories and others that make possible the link with LabView based remote laboratories, all of them with recording of experiment parameters in SQL databases placed in the experiment servers.
  • Carme Martín, Antoni Urpi, Alberto Abelló, Xavier Burgués, Marí José Casañ, Carme Quer, M. Elena Rodríguez. Avaluació de la incorporació d'activitats d'aprenentatge actiu i cooperatiu a les assignatures de bases de dades de la Facultat d'Informàtica de Barcelona. In VII Congrés Internacional de Docència Universitària i Innovació (CIDUI). 2012. Pages: 1-38. ISBN: 9788499213002
  • Alberto Abelló, and Oscar Romero. Service-Oriented Business Intelligence.. In First European Business Intelligence Summer School (eBISS'11). Lecture Notes in Business Information Processing Volume 96. Springer, 2012. Pages 156-185. ISSN: 1865-1348. ISBN (paper): 978-3-642-27357-5. ISBN (Electronic): 978-3-642-27358-2. DOI: 10.1007/978-3-642-27358-2_8
    The traditional way to manage Information Technologies (IT) in the companies is having a data center, and licensing monolithic applications based on the number of CPUs, allowed connections, etc. This also holds for Business Intelligence environments. Nevertheless, technologies have evolved and today other approaches are possible. Specifically, the service paradigm allows to outsource hardware as well as software in a pay-as-you-go model. In this work, we will introduce the concepts related to this paradigm and analyze how they affect Business Intelligence (BI). We will analyze the specificity of services and present specific techniques to engineering service systems (e.g., Cloud Computing, Service-Oriented Architectures -SOA- and Business Process Modeling -BPM-). Then, we will also analyze to which extent it is possible to consider Business Intelligence just a service and use these same techniques on it. Finally, we store the other way round. Since service companies represent around 70% of the Gross Domestic Product (GDP) in the world, special attention must be paid to their characteristics and how to adapt BI techniques to enhance services.
2011
  • Alberto Abelló, Jaume Ferrarons, Oscar Romero. Building cubes with MapReduce. In 14th International Workshop on Data Warehousing and OLAP (DOLAP). Glasgow (United Kingdom), October 2011. ACM Press, 2011. Pages 18-24. ISBN: 978-1-4503-0963-9. DOI: 10.1145/2064676.2064680
    In the last years, the problems of using generic storage techniques for very specific applications has been detected and outlined. Thus, some alternatives to relational DBMSs (e.g., BigTable) are blooming. On the other hand, cloud computing is already a reality that helps to save money by eliminating the hardware as well as software fixed costs and just pay per use. Indeed, specific software tools to exploit a cloud are also here. The trend in this case is toward using tools based on the MapReduce paradigm developed by Google. In this paper, we explore the possibility of having data in a cloud by using BigTable to store the corporate historical data and MapReduce as an agile mechanism to deploy cubes in ad-hoc Data Marts. Our main contribution is the comparison of three different approaches to retrieve data cubes from BigTable by means of MapReduce and the definition of criteria to choose among them.
  • Oscar Romero, Patrick Marcel, Alberto Abelló, Verónika Peralta, Ladjel Bellatreche. Describing Analytical Sessions Using a Multidimensional Algebra. In 13th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Toulouse, France, August 29-September 2, 2011. Lecture Notes in Computer Science 6862, Springer 2011. Pages 224-239. ISBN: 978-3-642-23543-6. DOI:10.1007/978-3-642-23544-3_17
    Recent efforts to support analytical tasks over relational sources have pointed out the necessity to come up with flexible, powerful means for analyzing the issued queries and exploit them in decisionoriented processes (such as query recommendation or physical tuning). Issued queries should be decomposed, stored and manipulated in a dedicated subsystem. With this aim, we present a novel approach for representing SQL analytical queries in terms of a multidimensional algebra, which better characterizes the analytical efforts of the user. In this paper we discuss how an SQL query can be formulated as a multidimensional algebraic characterization. Then, we discuss how to normalize them in order to bridge (i.e., collapse) several SQL queries into a single characterization (representing the analytical session), according to their logical connections.
  • Oscar Romero, Alkis Simitsis, Alberto Abelló. GEM: Requirement-Driven Generation of ETL and Multidimensional Conceptual Designs. In 13th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Toulouse, France, August 29-September 2, 2011. Lecture Notes in Computer Science 6862. Springer, 2011. Pages 80-95. ISBN: 978-3-642-23543-6. DOI:10.1007/978-3-642-23544-3_7
    At the early stages of a data warehouse design project, the main objective is to collect the business requirements and needs, and translate them into an appropriate conceptual, multidimensional design. Typically, this task is performed manually, through a series of interviews involving two different parties: the business analysts and technical designers. Producing an appropriate conceptual design is an error-prone task that undergoes several rounds of reconciliation and redesigning, until the business needs are satisfied. It is of great importance for the business of an enterprise to facilitate and automate such a process. The goal of our research is to provide designers with a semi-automatic means for producing conceptual multidimensional designs and also, conceptual representation of the extract-transform-load (ETL) processes that orchestrate the data flow from the operational sources to the data warehouse constructs. In particular, we describe a method that combines information about the data sources along with the business requirements, for validating and completing -if necessary- these requirements, producing a multidimensional design, and identifying the ETL operations needed. We present our method in terms of the TPC-DS benchmark and show its applicability and usefulness.
  • Oscar Romero, Alberto Abelló. A Comprehensive Framework on Multidimensional Modeling. In Advances in Conceptual Modeling. Recent Developments and New Directions - ER 2011 Workshopsi (MoRE-BI). Brussels, Belgium, October 31 - November 3, 2011. Lecture Notes in Computer Science 6999. Springer, 2011. Pages 108-117. ISBN: 978-3-642-24573-2. DOI: 10.1007/978-3-642-24574-9_14
    In this paper we discuss what current multidimensional design approaches provide and which are their major flaws. Our contribution lays in a comprehensive framework that does not focus on how these approaches work but what they do provide for usage in real data warehouse projects. So that, we do not aim at comparing current approaches but set up a framework (based on four criteria: the role played by end-user requirements and data sources, the degree of automation achieved and the quality of the output produced) highlighting their drawbacks, and the need for further research on this area.
  • Oscar Romero, Alberto Abelló. Data-Driven Multidimensional Design for OLAP. In poster session in 23rd International Conference Scientific and Statistical Database Management (SSDBM). Portland, OR, USA, July 2011. Lecture Notes in Computer Science 6809. Springer, 2011. Pages 594-595. ISBN: 978-3-642-22350-1. DOI:10.1007/978-3-642-22351-8_51. See poster.
    OLAP is a popular technology to query scientific and statistical databases, but their success heavily depends on a proper design of the underlying multidimensional (MD) databases (i.e., based on the fact / dimension paradigm). Relevantly, different approaches to automatically identify facts are nowadays available, but all MD design methods rely on discovering functional dependencies (FDs) to identify dimensions. However, an unbound FD search generates a combinatorial explosion and accordingly, these methods produce MD schemas with too many dimensions whose meaning has not been analyzed in advance. On the contrary, i) we use the available ontological knowledge to drive the FD search and avoid the combinatorial explosion and ii) only propose dimensions of interest for analysts by performing a statistical study of data.
  • A. Abelló, X. Burgués. Puntuación entre iguales para la evaluación del trabajo en equipo. In XVII Jornadas de Enseñanza Universitaria de la Informática (JENUI), Sevilla (España), July 2011. Pages 73-80. ISBN: 978-84-694-5156-4

    La entrada en el EEES y la adopción de un sistema de evaluación basado en competencias, algunas de ellas no técnicas, hace que nos tengamos que plantear algún tipo de cambio, no solo en la forma de enseñar, sino también en la forma de evaluación. Evaluar, por ejemplo, la actitud ante el trabajo, el trabajo en equipo o la capacidad de innovación mediante un examen resulta a todas luces poco apropiado, si no imposible. Es en este sentido que hemos experimentado durante dos semestres la posibilidad de evaluación entre iguales para la competencia genérica "trabajo en equipo". En este trabajo, presentamos la experiencia y conclusiones extraídas.
  • A. Abelló. NOSQL: The death of the Star. As Invited speaker in VII journées francophones sur les entrepots de Données et Analyses en ligne (EDA), Clermont-Ferrand (France), June 2011. Pages 1-2. Hermann, 2011. ISBN: 978-27056-81-2
    In the last years, the problems of using generic storage techniques for very specific applications has been detected and outlined. Thus, some alternatives to relational DBMSs (e.g. BigTable and C-Store) are blooming. On the other hand, cloud computing is already a reality that helps to save money by eliminating the hardware as well as software fixed costs and just pay per use. Thus, specific software tools to exploit the cloud have also appeared. The trend in this case is to use implementations based on the MapReduce paradigm developed by Google. The basic goal of this talk will be the introduction and the discussion of these ideas from the point of view of Data Warehousing and OLAP. We will see advantages, disadvantages and some possibilities it offers.
  • Oscar Romero, Alberto Abelló. Multidimensional Design Methods for Data Warehousing. Chapter 5 in Integrations of Data Warehousing, Data Mining and Database Technologies: Innovative Approaches. Editors David Taniar, Li Chen. IGI Global, 2011. Pages 78-105. ISBN (printed): 978-1-60960-537-7. ISBN (electronic): 978-1-60960-538-4. DOI: 10.4018/978-1-60960-537-7.ch005
    In the last years, data warehousing systems have gained relevance to support decision making within organizations. The core component of these systems is the data warehouse and nowadays it is widely assumed that the data warehouse design must follow the multidimensional paradigm. Thus, many methods have been presented to support the multidimensional design of the data warehouse.The first methods introduced were requirement-driven but the semantics of the data warehouse (since the data warehouse is the result of homogenizing and integrating relevant data of the organization in a single, detailed view of the organization business) require to also consider the data sources during the design process. Considering the data sources gave rise to several data-driven methods that automate the data warehouse design process, mainly, from relational data sources. Currently, research on multidimensional modeling is still a hot topic and we have two main research lines. On the one hand, new hybrid automatic methods have been introduced proposing to combine data-driven and requirement-driven approaches. These methods focus on automating the whole process and improving the feedback retrieved by each approach to produce better results. On the other hand, some new approaches focus on considering alternative scenarios than relational sources. These methods also consider (semi)-structured data sources, such as ontologies or XML, that have gained relevance in the last years. Thus, they introduce innovative solutions for overcoming the heterogeneity of the data sources. All in all, we discuss the current scenario of multidimensional modeling by carrying out a survey of multidimensional design methods. We present the most relevant methods introduced in the literature and a detailed comparison showing the main features of each approach.
  • Rafael Berlanga, Oscar Romero, Alkis Simitsis, Victoria Nebot, Torben Bach Pedersen, Alberto Abelló, María José Aramburu. Semantic Web Technologies for Business Intelligence . Chapter 14 in Business Intelligence Applications and the Web: Models, Systems, and Technologies. Editors Marta E. Zorrilla, Jose-Norberto Mazón, Óscar Ferrández, Irene Garrigós, Florian Daniel, Juan Trujillo. IGI Global, 2011. Pages 310-339. ISBN (printed): 978-1-61350-038-5. ISBN (electronic): 978-1-61350-039-2. ISBN (perpetual access): 978-1-61350-040-8. DOI: 10.4018/978-1-61350-038-5.ch014
    This chapter describes the convergence of two of the most influential technologies in the last decade, namely business intelligence (BI) and the Semantic Web (SW). Business intelligence is used by almost any enterprise to derive important business-critical knowledge from both internal and (increasingly) external data. When using external data, most often found on the Web, the most important issue is knowing the precise semantics of the data. Without this, the results cannot be trusted. Here, Semantic Web technologies come to the rescue, as they allow semantics ranging from very simple to very complex to be specified for any web-available resource. SW technologies do not only support capturing the "passive" semantics, but also support active inference and reasoning on the data. The chapter first presents a motivating running example, followed by an introduction to the relevant SW foundation concepts. The chapter then goes on to survey the use of SW technologies for data integration, including semantic data annotation and semantics-aware extract, transform, and load processes (ETL). Next, the chapter describes the relationship of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms, and the use of advanced SW reasoning functionality on MD models. Finally, the chapter describes in detail a number of directions for future research, including SW support for intelligent BI querying, using SW technologies for providing context to data warehouses, and scalability issues. The overall conclusion is that SW technologies are very relevant for the future of BI, but that several new developments are needed to reach the full potential.
2010
  • Alberto Abelló, Oscar Romero. Using ontologies to discover fact IDs. In 13th International Workshop on Data Warehousing and OLAP (DOLAP 2010). Toronto (Canada), October 2010. ACM Press, 2010. Pages 3-10. ISBN: 978-1-4503-0383-5. DOI: 10.1145/1871940.1871944
    Object identification is a crucial step in most information systems. Nowadays, we have many different ways to identify entities such as surrogates, keys and object identifiers. However, not all of them guarantee the entity identity. Many works have been introduced in the literature for discovering meaningful IDs, but all of them work at the logical or data level and they share some constraints inherent to the kind of approach. Addressing it at the logical level, we may miss some important data dependencies, while the cost to identify data dependencies at the data level may not be affordable. In this paper, we propose an approach for discovering fact IDs from domain ontologies. In our approach, we guide the process at the conceptual level and we introduce a set of pruning rules for improving the performance by reducing the number of ID hypotheses generated and to be verified with data. Finally, we also introduce a simulation over a case study to show the feasibility of our method.
  • A. Abelló, X. Burgués, M. E. Rodríguez. Utilización de glosarios de Moodle para incentivar la participación y dedicación de los estudiantes. In XVI Jornadas de Enseñanza Universitaria de la Informática (JENUI), Santiago de Compostela (Spain), 2010. Pages 309-316. ISBN: 84-693-3741-7

    La entrada en el EEES y la adopción del nuevo sistema de créditos ECTS, que mide las horas de dedicación del estudiante y no las del profesor, hace que debamos plantearnos nuevos métodos docentes que incentiven, al mismo tiempo que acoten y controlen, la dedicación de los estudiantes fuera del aula. Es en este sentido que hemos experimentado el uso de los glosarios provistos por Moodle para fomentar que los estudiantes repasen en casa la teoría presentada en clase, de forma continuada a lo largo del curso (no únicamente en vísperas del examen final).
  • Xavier Burgués, Carme Quer, Carme Martín, Alberto Abelló, M. José Casany, M. Elena Rodríguez, Toni Urpí. Adapting LEARN-SQL to Database computer supported cooperative learning. In Workshop on Methods and Cases in Computing Education (MCCE). Cadiz (Spain), July 2010.
    LEARN-SQL is a tool that we are using since three years ago in several database courses, and that has shown its positive effects in the learning of different database issues. This tool allows proposing remote questionnaires to students, which are automatically corrected giving them a feed-back and promoting their self-learning and self-assessment of their work. However, this tool as it is currently used does not has the possibility to propose structured exercises to teams that promote their cooperative learning. In this paper, we present our adaptation of the LEARN-SQL tool for allowing some Computer-Supported Collaboration Learning techniques.
  • Carme Martín, Alberto Abelló, Xavier Burgués, M. José Casany, Carme Quer, M. Elena Rodríguez, Toni Urpí. Adaptació d'assignatures de bases de dades a l'EEES. In VII Congreso Internacional de Docencia Universitaria e Innovación (CIDUI). Barcelona (Spain), July 2010.
    Els canvis recents en els plans d´estudis de la UPC i la UOC tenen en compte el nou espai europeu d´educació superior (EEES). Una de les conseqüències directes d´aquests canvis és la necessitat d´afitar i optimitzar el temps dedicat a les activitats d´aprenentatge que requereixen la participació activa de l´estudiant i que es realitzen de manera continuada durant el semestre. A més, l´EEES destaca la importància de les pràctiques, les relacions interpersonals i la capacitat de treballar en equip, suggerint la reducció de classes magistrals i l´augment d´activitats que fomentin tant el treball personal de l´estudiant com el cooperatiu. En l´àmbit de la docència informàtica d´assignatures de bases de dades el problema és especialment complex degut a que els enunciats de les proves no acostumen a tenir una solució única. Nosaltres hem desenvolupat una eina, anomenada LEARN-SQL, l´objectiu de la qual és corregir automàticament qualsevol tipus de sentència SQL (consultes, actualitzacions, procediments emmagatzemats, disparadors, etc ...) i discernir si la resposta aportada per l´estudiant és o no és correcta amb independència de la solució concreta que aquest proposi. D´aquesta manera potenciem l´autoaprenentatge i l´autoavaluació, fent possible la semi-presencialitat supervisada i facilitant l´aprenentatge individualitzat segons les necessitats de cada estudiant. Addicionalment, aquesta eina ajuda als professors a dissenyar les proves d´avaluació, permetent també la opció de revisar qualitativament les solucions aportades pels estudiants. Per últim, el sistema proporciona ajuda als estudiants per a que aprenguin dels seus propis errors, proporcionant retroalimentació de qualitat.
  • Oscar Romero. Automating the multidimensional design of data warehouses. PhD Thesis, Universitat Politècnica de Catalunya. Barcelona, February 2010.

    Previous experiences in the data warehouse field have shown that the data warehouse multidimensional conceptual schema must be derived from a hybrid approach: i.e., by considering both the end-user requirements and the data sources, as first-class citizens. Like in any other system, requirements guarantee that the system devised meets the end-user necessities. In addition, since the data warehouse design task is a reengineering process, it must consider the underlying data sources of the organization: (i) to guarantee that the data warehouse must be populated from data available within the organization, and (ii) to allow the end-user discover unknown additional analysis capabilities.

    Currently, several methods for supporting the data warehouse modeling task have been provided. However, they suffer from some significant drawbacks. In short, requirement-driven approaches assume that requirements are exhaustive (and therefore, do not consider the data sources to contain alternative interesting evidences of analysis), whereas data-driven approaches (i.e., those leading the design task from a thorough analysis of the data sources) rely on discovering as much multidimensional knowledge as possible from the data sources. As a consequence, data-driven approaches generate too many results, which mislead the user. Furthermore, the design task automation is essential in this scenario, as it removes the dependency on an expert's ability to properly apply the method chosen, and the need to analyze the data sources, which is a tedious and timeconsuming task (which can be unfeasible when working with large databases). In this sense, current automatable methods follow a data-driven approach, whereas current requirement-driven approaches overlook the process automation, since they tend to work with requirements at a high level of abstraction. Indeed, this scenario is repeated regarding data-driven and requirement-driven stages within current hybrid approaches, which suffer from the same drawbacks than pure data-driven or requirement-driven approaches.

    In this thesis we introduce two different approaches for automating the multidimensional design of the data warehouse: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Both approaches were devised to overcome the limitations from which current approaches suffer. Importantly, our approaches consider opposite initial assumptions, but both consider the end-user requirements and the data sources as first-class citizens.

    1. MDBE follows a classical approach, in which the end-user requirements are well-known beforehand. This approach benefits from the knowledge captured in the data sources, but guides the design task according to requirements and consequently, it is able to work and handle semantically poorer data sources. In other words, providing high-quality end-user requirements, we can guide the process from the knowledge they contain, and overcome the fact of disposing of bad quality (from a semantical point of view) data sources.

    2. AMDO, as counterpart, assumes a scenario in which the data sources available are semantically richer. Thus, the approach proposed is guided by a thorough analysis of the data sources, which is properly adapted to shape the output result according to the end-user requirements. In this context, disposing of high-quality data sources, we can overcome the fact of lacking of expressive end-user requirements.

    Importantly, our methods establish a combined and comprehensive framework that can be used to decide, according to the inputs provided in each scenario, which is the best approach to follow. For example, we cannot follow the same approach in a scenario where the end-user requirements are clear and well-known, and in a scenario in which the end-user requirements are not evident or cannot be easily elicited (e.g., this may happen when the users are not aware of the analysis capabilities of their own sources). Interestingly, the need to dispose of requirements beforehand is smoothed by the fact of having semantically rich data sources. In lack of that, requirements gain relevance to extract the multidimensional knowledge from the sources.

    So that, we claim to provide two approaches whose combination turns up to be exhaustive with regard to the scenarios discussed in the literature.
  • Oscar Romero, Alberto Abelló. A framework for multidimensional design of data warehouses from ontologiesElsevier). In Data & Knowledge Engineering, Volume 69, Issue 11. Elsevier, 2010. Pages 1138-1157. ISSN: 0169-023X. DOI: 10.1016/j.datak.2010.07.007
    The data warehouse design task needs to consider both the end-user requirements and the organization data sources. For this reason, the data warehouse design has been traditionally considered a reengineering process, guided by requirements, from the data sources.

    Most current design methods available demand highly-expressive end-user requirements as input, in order to carry out the exploration and analysis of the data sources. However, the task to elicit the end-user information requirements might result in a thorough task. Importantly, in the data warehousing context, the analysis capabilities of the target data warehouse depend on what kind of data is available in the data sources. Thus, in those scenarios where the analysis capabilities of the data sources are not (fully) known, it is possible to help the data warehouse designer to identify and elicit unknown analysis capabilities.

    In this paper we introduce a user-centered approach to support the end-user requirements elicitation and the data warehouse multidimensional design tasks. Our proposal is based on a reengineering process that derives the multidimensional schema from a conceptual formalization of the domain. It starts by fully analyzing the data sources to identify, without considering requirements yet, the multidimensional knowledge they capture (i.e., data likely to be analyzed from a multidimensional point of view). Next, we propose to exploit this knowledge in order to support the requirements elicitation task. In this way, we are already conciliating requirements with the data sources, and we are able to fully exploit the analysis capabilities of the sources. Once requirements are clear, we automatically create the data warehouse conceptual schema according to the multidimensional knowledge extracted from the sources.
  • Oscar Romero, Alberto Abelló. Automatic validation of requirements to support multidimensional designElsevier). In Data & Knowledge Engineering, Volume 69, Issue 9. Elsevier, 2010. Pages 917-942. ISSN: 0169-023X. DOI: 10.1016/j.datak.2010.03.006
    It is widely accepted that the conceptual schema of a data warehouse must be structured according to the multidimensional model. Moreover, it has been suggested that the ideal scenario for deriving the multidimensional conceptual schema of the data warehouse would consist of a hybrid approach (i.e., a combination of data-driven and requirement-driven paradigms). Thus, the resulting multidimensional schema would satisfy the end-user requirements and would be conciliated with the data sources. Most current methods follow either a data-driven or requirement-driven paradigm and only a few use a hybrid approach. Furthermore, hybrid methods are unbalanced and do not benefit from all of the advantages brought by each paradigm.

    In this paper we present our approach for multidimensional design. The most relevant step in our framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements. MDBE introduces several advantages over previous approaches, which can be summarized as three main contributions. (i) The MDBE method is a fully automatic approach that handles and analyzes the end-user requirements automatically. (ii) Unlike data-driven methods, we focus on data of interest to the end-user. However, the user may not be aware of all the potential analyses of the data sources and, in contrast to requirement-driven approaches, MDBE can propose new multidimensional knowledge related to concepts already queried by the user. (iii) Finally, MDBE proposes meaningful multidimensional schemas derived from a validation process. Therefore, the proposed schemas are sound and meaningful.
  • Alberto Abelló, Il-Yeol Song. Data warehousing and OLAP (DOLAP'08)Elsevier). In Data & Knowledge Engineering, Volume 69, Issue 1. Elsevier, 2010. Pages 1-2. ISSN: 0169-023X. DOI: 10.1016/j.datak.2009.08.011
2009
  • Oscar Romero, Diego Calvanese, Alberto Abelló, Mariano Rodriguez-Muro. Discovering functional dependencies for multidimensional design. In 12th International Workshop on Data Warehousing and OLAP (DOLAP 2009). Hong Kong (China), November 2009. ACM Press, 2009. Pages 1-8. ISBN: 978-1-60558-801-8
    Nowadays, it is widely accepted that the data warehouse design task should be largely automated. Furthermore, the data warehouse conceptual schema must be structured according to the multidimensional model and as a consequence, the most common way to automatically look for subjects and dimensions of analysis is by discovering functional dependencies (as dimensions functionally depend of the fact) over the data sources. Most advanced methods for automating the design of the data warehouse carry out this process from relational OLTP systems, assuming that a RDBMS is the most common kind of data source we may find, and taking as starting point a relational schema. In contrast, in our approach we propose to rely instead on a conceptual representation of the domain of interest formalized through a domain ontology expressed in the DL-Lite Description Logic. In our approach, we propose an algorithm to discover functional dependencies from the domain ontology that exploits the inference capabilities of DL-Lite, thus fully taking into account the semantics of the domain. We also provide an evaluation of our approach in a real-world scenario.
  • A. Abelló, X. Burgués, M. J. Casany, C. Martín, C. Quer, T. Urpí, M. E. Rodríguez. LEARN-SQL: Herramienta de gestión de ejercicios de SQL con autocorrección. In XV Jornadas de Enseñanza Universitaria de la Informática (JENUI), Barcelona (Spain), 2009. Pages 353-360. ISBN: 978-84-692-2758-9

    Algunas herramientas de autocorrección existen ya en el ámbito de la docencia informática. No obstante en asignaturas de bases de datos el problema es especialmente complejo debido a la gran variedad de tipos de ejercicios (los sistemas existentes se limitan a consultas) y a que éstos no tienen solución única. Nuestro sistema tiene como objetivo corregir automáticamente cualquier tipo de sentencia SQL (consultas, actualizaciones, procedimientos, disparadores, creación de índices, etc.) y discernir si la respuesta aportada por el estudiante es o no correcta con independencia de la solución concreta que éste proponga. En esta comunicación presentaremos específicamente el módulo encargado de la gestión de ejercicios y todas las tipologías de estos que estamos utilizando en la actualidad.
  • Oscar Romero, Alberto Abelló. A Survey of Multidimensional Modeling Methodologies. In International Journal on Data Warehousing and Mining (IJDWM), volume 5, number 2. Idea Group, 2009. Pages 1-23. ISSN: 1548-3924

    Many methodologies have been presented to support the multidimensional design of the data warehouse. First methodologies introduced were requirement-driven but the semantics of a data warehouse require to also consider data sources along the design process. In the following years, data sources gained relevance in multidimensional modeling and gave rise to several data-driven methodologies that automate the data warehouse design process from relational sources. Currently, research on multidimensional modeling is still a hot topic and we have two main research lines. On the one hand, new hybrid automatic methodologies have been introduced proposing to combine data-driven and requirement-driven approaches. On the other hand, new approaches focus on considering other kind of structured data sources that have gained relevance in the last years such as ontologies or XML. In this article we present the most relevant methodologies introduced in the literature and a detailed comparison showing main features of each approach.
2008
  • Il-Yeol Song and Alberto Abelló. ForewordACM). In 11th International Workshop on Data Warehousing and OLAP (DOLAP). Napa (USA), November 2008. ACM Press, 2008. ISBN: 978-1-60558-387-7.

  • Oscar Romero and Alberto Abelló. MDBE: Automatinc Multidimensional ModelingSpringer). In 27th International Conference on Conceptual Modeling (ER). Barcelona (Spain), October 2008. LNCS 5231. Springer, 2008. Pages 534-535. ISSN: 0302-9743.

    The goal of this demonstration is to present MDBE, a tool implementing our methodology for automatically deriving multidimensional schemas from relational sources, bearing in mind the end-user requirements. Our approach starts gathering the end-user information requirements that will be mapped over the data sources as SQL queries. Based on the constraints that a query must preserve to make multidimensional sense, MDBE automatically derives multidimensional schemas which agree with both the input requirements and the data sources.
  • Alberto Abelló, M. Elena Rodríguez, Toni Urpí, Xavier Burgués, M. José Casany, Carme Martín, Carme Quer. LEARN-SQL:Automatic Assessment of SQL Based on IMS QTI SpecificationIEEE). Poster session in 8th International Conference on Advanced Learning Technologies (ICALT). Santander (Spain), July 2008. IEEE, 2008. Pages 592-593. ISBN: 978-0-7695-3167-0. See poster

    In this paper we present LEARN-SQL, a system conforming to the IMS QTI specification that allows on-line learning and assessment of students on SQL skills in an automatic, interactive, informative, scalable and extensible manner.
  • Xavier Burgués, Carme Quer, Alberto Abelló, M. José Casany, Carme Martín, M. Elena Rodríguez, Toni Urpí. Uso de LEARN-SQL en el aprendizaje cooperativo de Bases de Datos. In XIV Jornadas de Enseñanza Universitaria de la Informática (JENUI). Granada (Spain), July 2008. FER fotocomposición, 2008. Pages 359-366. ISBN: 978-84-612-4475-1

    En este artículo se describen los cambios efectuados en algunas asignaturas del área de bases de datos en dos vertientes: organizativa y tecnológica. En la primera, el objetivo principal ha sido la introducción de técnicas de aprendizaje cooperativo. En la segunda, el objetivo ha sido potenciar el autoaprendizaje y el autoevaluación a través de la herramienta LEARN-SQL. Los cambios relacionados con las dos vertientes se han aplicado, hasta el momento, a asignaturas distintas. Para finalizar el artículo, se hace una valoración de los resultados obtenidos, y se trazan las líneas de futuros cambios orientados a la combinación de las dos vertientes.
  • M. José Casany, Carme Martín, Alberto Abelló, Xavier Burgués, Carme Quer, M. Elena Rodríguez, Toni Urpí. LEARN-SQL: A blended learning tool for the database area. In V Congreso Internacional de Docencia Universitaria e Innovación (CIDUI). Lleida (Spain), July 2008. ISBN: 978-84-8458-279-3.

    The academic programs of the UPC and UOC are adapting to the European Credit Transfer System (ECTS). One of the changes introduced in the academic programs of the previous universities tries to optimize the time of the activities that require the active participation of the students. The definition of these activities is a very complex task specially when dealing with database teaching in ICT engineering degrees, because usually the questions do not have a unique solution. LEARN -SQL is the tool developed by our group that automatically evaluates the correctness of any SQL statement (queries, updates, stored procedures, triggers etc.) with independence of the student solution. Furthermore, LEARN-SQL helps teachers design their tests as well as allow them review the solutions provided by the students. Finally, the system provides students with valuable feedback, so that they can learn from their mistakes.
2007
  • Oscar Romero and Alberto Abelló. Automating Multidimensional Design from OntologiesACM). In 10th International Workshop on Data Warehousing and OLAP (DOLAP). Lisbon (Portugal), November 2007. ACM Press, 2007. Pages 1-8. ISBN: 1-59593-827-5.

    This paper presents a new approach to automate the multidimensional design of Data Warehouses. In our approach we propose a semi-automatable method aimed to find the business multidimensional concepts from a domain ontology representing different and potentially heterogeneous data sources of our business domain. In short, our method identifies business multidimensional concepts from heterogeneous data sources having nothing in common but that they are all described by an ontology.
  • Alberto Abelló, Toni Urpí, M. Elena Rodríguez, and Marc Estévez. Extensión de Moodle para facilitar la corrección automática de cuestionarios y su aplicación en el ámbito de las bases de datos. In MoodleMoot (Moodle). Cáceres (Spain), October 2007.

    Moodle 1.5 dispone de un módulo de cuestionarios que facilita la gestión de un conjunto de preguntas para su posterior uso en diferentes cuestionarios que pueden ir definiéndose según las necesidades de cada curso. Básicamente, las preguntas pueden ser de opción múltiple o bien de respuesta corta. En caso de preguntas de respuesta corta, la simple presencia de un espacio en blanco de más o de menos en la respuesta del estudiante (respecto a la solución introducida previamente por el profesor) hace que ésta se considere incorrecta. En el ámbito de la docencia en informática, asignaturas como, por ejemplo, "programación" o "bases de datos", el problema es especialmente sangrante, debido a que los enunciados no acostumbran a tener solución única. Es por esto que nos planteamos la posibilidad de desarrollar un nuevo módulo para Moodle que permitiera más posibilidades en la corrección, que la simple comparación carácter a carácter respecto a la solución aportada por el profesor. Así pues, hemos desarrollado un nuevo tipo de cuestionario cuyas preguntas se encuentran en un repositorio externo al Moodle. Cada una de estas preguntas tiene asociado uno o más Servicios Web que son capaces de discernir si la respuesta del estudiante es correcta o no. En nuestro caso, estábamos interesados en la corrección de consultas sobre una base de datos utilizando SQL, pero mediante el mismo módulo conectando con un Servicio Web diferente, se puede corregir cualquier tipo de pregunta, no necesariamente del ámbito de bases de datos. Básicamente, únicamente requiere que la corrección sea objetivable y, en consecuencia, exista un procedimiento que permita realizarla automáticamente.
  • Oscar Romero, and Alberto Abelló. MDBE: Una herramienta Automática para el Modelado Multidimensional. Demonstration in Jornadas de Ingeniería del Software y Bases de Datos (JISBD). Zaragoza (Spain), September 2007. Thomson Editores, 2007. Pages 387-388. ISBN: 978-84-9732-595-0.

    Para facilitar el proceso de modelado multidimensional de un DW, en este trabajo presentamos MDBE (Multidimensional Design By Examples): nuestra propuesta de herramienta para validar requisitos multidimensionales proporcionados por el usuario final y expresados como consultas SQL sobre las fuentes de datos operacionales. MDBE descompone la consulta SQL de entrada para extraer el conocimiento multidimensional relevante que contiene y acorde con dicha información, deriva un conjunto de esquemas multidimensionales que satisfacen los requisitos (consultas) del usuario. Es decir, nos propone posibles esquemas multidimensionales de forma automática.
  • Oscar Romero and Alberto Abelló. On the Need of a Reference Algebra for OLAPSpringer-Verlag). In 9th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Regensburg (Germany), September, 2007. Lecture Notes in Computer Science volume 4654. Springer, 2007. Pages 99-110. ISSN: 0302-9743. ISBN: 3-540-28566-0.

    Although multidimensionality has been widely accepted as the best solution to conceptual modeling, there is not such agreement about the set of operators to handle multidimensional data. This paper presents a comparative of the existing multidimensional algebras trying to find a common backbone, as well as it discusses about the necessity of a reference multidimensional algebra and the current state of the art.
  • Oscar Romero and Alberto Abelló. Generating Multidimensional Schemas from the Semantic Web. Poster session in 19th Conference on Advanced Information Systems Engineering (CAiSE). Trodheim (Norwey), June 2007.

    In this paper, we introduce a semi-automatable method aimed to find the business multidimensional concepts from an ontology representing the organization domain. With these premises, our approach falls into the Semantic Web research area, where ontologies play a key role to provide a common vocabulary describing the meaning of relevant terms and relationships among them.
2006
  • Stefano Rizzi, Alberto Abelló, Jens Lechtenbörger, and Juan Trujillo. Research in Data Warehouse Modeling and Design: Dead or Alive?ACM). In 9th International Workshop on Data Warehousing and OLAP (DOLAP). Arlington (USA), November 2006. ACM Press, 2006. Pages 3-10. ISBN: 1-59593-530-4.

    Multidimensional modeling requires specialized design techniques. Though a lot has been written about how a data warehouse should be designed, there is no consensus on a design method yet. This paper follows from a wide discussion that took place in Dagstuhl, during the Perspectives Workshop "Data Warehousing at the Crossroads", and is aimed at outlining some open issues in modeling and design of data warehouses. More precisely, issues regarding conceptual models, logical models, methods for design, interoperability, and design for new architectures and applications are considered.
  • Alberto Abelló, Roberto García, Rosa Gil, Marta Oliva, and Ferran Perdix. Semantic Data Integration in a Newspaper Content Management SystemSpringer-Verlag). In poster session in 5th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE). Lyon (France), October, 2006. Lecture Notes in Computer Science volume 4277. Springer, 2006. Pages 41-41. ISSN: 0302-9743. ISBN: 3-540-28566-0. See poster

    A newspaper content management system has to deal with a very heterogeneous information space as the experience in the Diari Segre newspaper has shown us. The greatest problem is to harmonise the different ways the involved users (journalist, archivists) structure the newspaper information space, i.e. news, topics, headlines, etc. Our approach is based on ontology and differentiated universes of discourse (UoD). Users interact with the system and, from this interaction, integration rules are derived. These rules are based on Description Logic ontological relations for subsumption and equivalence. They relate the different UoD and produce a shared conceptualisation of the newspaper information domain.
  • Oscar Romero and Alberto Abelló. Multidimensional Design by ExamplesSpringer-Verlag). In 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Krakov (Poland), September, 2006. Lecture Notes in Computer Science volume 4081. Springer, 2006. Pages 85-94. ISSN: 0302-9743, ISBN: 3-540-28566-0.

    In this paper we present a method to validate user multidi-mensional requirements expressed in terms of SQL queries. Furthermore, our approach automatically generates and proposes the set of multidimensional schemas satisfying the user requirements, from the organizational operational schemas. If no multidimensional schema is generated for a query, we can state that requirement is not multidimensional.
  • Alberto Abelló, José Samos, and Fèlix Saltor. YAM²: A Multidimensional Conceptual Model Extending UMLElsevier). In Information Systems 31 (6), September, 2006. Elsevier, 2006. Pages 541-567. ISSN: 0306-4379.

    This paper presents a multidimensional conceptual Object-Oriented model for Data Warehousing and OLAP tools, its structures,integrity constraints and query operations. It has been developed as an extension of UML core metaclasses to facilitate its usage, and try to fill the absence of a standard model. Being a UML extension allows reusing modeling constructs and techniques, and integrating multidimensional modeling in more general modeling processes. Moreover,while existing multidimensional models are restricted to the modeling of isolated stars, this paper investigates the representation of several semantically related star schemas. Summarizability and identification constraints can also be represented in the model, and a closed and complete set of algebraic operations has been defined in terms of functions (so that mathematical properties of functions can be smoothly applied).
  • Adriana Marotta, Federico Piedrabuena, and Alberto Abelló. Managing Quality Properties in a ROLAP EnvironmentSpringer-Verlag). In 18th Conference on Advanced Information Systems Engineering (CAiSE). Luxemburg, June 2006. Lecture Notes in Computer Science volume 4001. Springer, 2006. Pages 127-141. ISSN: 0302-9743, ISBN: 3-540-28566-0.

    In this work we propose, for an environment where multidimensional queries are made over multiple Data Marts, techniques for providing the user with quality information about the retrieved data. This meta-information behaves as an added value over the obtained information or as an additional element to take into account during the proposition of the queries. The quality properties considered are freshness, availability and accuracy. We provide a set of formulas that allow estimating or calculating the values of these properties, for the result of any multidimensional operation of a predefined basic set.
  • Oscar Romero and Alberto Abelló. On the Mismatch Between Multidimensionality and SQL. Technical Report LSI-06-32-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), June 2006.

    ROLAP tools are intended to ease information analysis and navigation through the whole Data Warehouse. These tools automat-ically generate a query according to the multidimensional operations performed by the end-user, using the relational database technology to implement multidimensionality and consequently, automatically trans-lating multidimensional operations to SQL. In this paper, we consider this automatic translation process in detail and to do so, we present an exhaustive comparative (both theoretical and practical) between the multidimensional algebra and the relational one. Firstly, we discuss about the necessity of a multidimensional algebra with regard to the relational one and later, we thoroughly study those considerations to be made to guarantee the correctness of a cube-query (an SQL query making mul-tidimensional sense). With this aim, we analyze the multidimensional algebra expressiveness with regard to SQL pointing out the features a query must satisfy to make multidimensional sense and we also focus on those problems that can arise in a cube-query due to SQL intrinsic restrictions. The SQL translation of an isolated operation does not rep-resent a problem, but when mixing up the modifications brought about by a set of operations in a single cube-query, some conflicts derived from SQL could emerge depending on the operations involved. Therefore, if these problems are not detected and treated appropriately, the automatic translation can retrieve unexpected results.
  • Alberto Abelló, and Fernando Carpani. Using OWL to integrate relational Schemas. Technical Report LSI-06-10-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), March 2006.

    Ontologies offer two contributions to the Semantic Web. On the first hand, they show a vocabulary consensus inside a community. On the other hand, they provide reasoning capabilities. In this paper we present a completely automatic translation from relational schemas to OWL, so that inference mechanisms can be used to integrate different schemas, by dealing with structure heterogeneities. The output of the translation algorithm, which explicits functional dependencies in the relational schema, belongs to OWL Full.
2005
  • Oscar Romero, and Alberto Abelló. Improving automatic SQL translation for ROLAP tools. In Proceedings of Jornadas de Ingeniería del Software y Bases de Datos (JISBD). Granada (Spain), September 2005. Thomson Editores, 2005. Pages 123-130. ISBN: 84-9732-434-X

    In the last years, despite a vast amount of work have been devoted to modeling multidimensionality, multidimensional algebra translation to SQL have been overlooked. ROLAP tools automatically generate a cubequery according to the operations performed by the user. The SQL translation does not represent a problem when treating isolated operations but when mixing up together modifications brought about by a set of operations in the same cube-query, some conflicts could emerge depending on the operations involved. Therefore, if these problems are not detected and treated appropriately, the automatic translation can retrieve unexpected results. In this paper, we define and classify conflicts raised when automatically translating a multidimensional algebra to SQL, and analyze how to solve or minimize their impact.
  • Alberto Abelló, Xavi de Palol, and Mohand-Saïd Hacid. On the Midpoint of a Set of XML DocumentsSpringer-Verlag). In 16th International Conference on Database and Expert Systems Applications (DEXA). Copenhagen (Denmark), August 2005. Lecture Notes in Computer Science volume 3588. Springer, 2005. Pages 441-450. ISSN: 0302-9743, ISBN: 3-540-28566-0

    The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes dificult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we ofer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.
  • Alberto Abelló, Xavi de Palol, and Mohand-Saïd Hacid. Approximating the DTD of a set of XML documents. Technical Report LSI-05-7-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), March 2005.

    Extended/preliminary version of the previous paper: "On the Midpoint of a Set of XML Documents".
2003
  • Alberto Abelló, and Carme Martín. The Data Warehouse: A Temporal Database. In Proceedings of Jornadas de Ingeniería del Software y Bases de Datos (JISBD). Alacant (Spain), November 2003. Campobell S.L., 2003. Pages 675-684. ISBN: 84-688-3836-5

    The aim of this paper is to bring together two research areas, i.e. "Data Warehouses" and "Temporal Databases", involving representation of time. In order to achieve this goal, data warehouse and temporal database research results have been surveyed. Looking at temporal aspects within a data warehouse, more similarities than differences between temporal databases and data warehouses have been found. The first closeness between these areas consists in the possibility of a data warehouse redefinition in terms of a bitemporal database. Another relation is the use of temporal languages in data warehousing. Moreover, the correspondence between advances in temporal evolution and storage, and data warehouses are presented. Finally, Object-Oriented temporal data models contribute to add the integration and subject-orientation that is required by a data warehouse. Therefore, this paper is focussed on how contributions of the temporal database research could benefit data warehouses.
  • Alberto Abelló, José Samos, and Fèlix Saltor. Implementing Operations to Navigate Semantic Star SchemasACM). In 6th International Workshop on Data Warehousing and OLAP (DOLAP). New Orleans (USA), November 2003. ACM Press, 2003. Pages 56-62. ISBN: 1-58113-727-3

    In the last years, lots of work have been devoted to multidimensional modeling, star shape schemas and OLAP operations. However, \foreign{drill-across} has not captured as much attention as other operations. This operation allows to change the subject of analysis keeping the same analysis space we were using to analyze another subject. It is assumed that this can be done if both subjects share exactly the same analysis dimensions. In this paper, besides the implementation of an algebraic set of operations on a RDBMS, we are going to show when and how we can change the subject of analysis in the presence of semantic relationships, even if the analysis dimensions do not exactly coincide.
  • Carme Martín, and Alberto Abelló. A Temporal Study of Data Sources to Load a Corporate Data WarehouseSpringer-Verlag). In 5th International Conference on Data Warehousing and Knowledge Discovery (DaWaK). Prague (Czech Republic), September 2003. Lecture Notes in Computer Science volume 2737. Springer, 2003. Pages 109-118. ISSN: 0302-9743. ISBN: 3-540-40807-X

    The input data of the corporate data warehouse is provided by the data sources, that are integrated. In the temporal database research area, a bitemporal database is a database supporting valid time and transaction time. Valid time is the time when the fact is true in the modeled reality, while transaction time is the time when the fact is stored in the database. Defining a data warehouse as a bitemporal database containing integrated and subject-oriented data in support of the decision making process, transaction time in the data warehouse can always be obtained, because it is internal to a given storage system. When an event is loaded into the data warehouse, its valid time is transformed into a bitemporal element, adding transaction time, generated by the database management system of the data warehouse. However, depending on whether the data sources manage transaction time and valid time or not, we could obtain the valid time for the data warehouse or not. The aim of this paper is to present a temporal study of the different kinds of data sources to load a corporate data warehouse, using a bitemporal storage structure.
  • Alberto Abelló, Elena Rodríguez, Fèlix Saltor, Marta Oliva, Cecilia Delgado, Eladio Garví and José Samos. On Operations to Conform Object-Oriented Schemas. In International Conference on Enterprise Information Systems (ICEIS). Angers (France), April 2003. Selected among the best papers of the conference to be published in "Enterprise Information Systems V", Kluwer Academic Publishers, 2004. Pages 49-56. ISBN: 1-4020-1726-X

    To build a Cooperative Information System from several preexisting, heterogeneous systems, the schemas of these systems must be integrated. Operations used for this purpose include conforming operations, which change the form of a schema. In this paper we present a systematic approach to establish which conforming operations for Object-Oriented schemas are needed, and which of them can be considered as primitive, all others being derivable from these. We organize these operations in matrixes according to the Object-Oriented dimensions -Generalization/Specialization, Aggregation/Decomposition- on which they operate.
  • Alberto Abelló, and Carme Martín. A Bitemporal Storage Structure for a Corporate Data Warehouse. Short paper in International Conference on Enterprise Information Systems (ICEIS). Angers (France), April 2003.

    This paper brings together two research areas, i.e. "Data Warehouses" and "Temporal Databases", involving representation of time. Looking at temporal aspects within a data warehouse, more similarities than differences between temporal databases and data warehouses have been found. The first closeness between these areas consists in the possibility of a data warehouse redefinition in terms of a bitemporal database. A bitemporal storage mechanism is proposed along this paper. In order to meet this goal, a temporal study of data sources is developed. Moreover, we will show how Object-Oriented temporal data models contribute to add the integration and subject-orientation that is required by a data warehouse.
2002
  • Alberto Abelló, Francisco Araque, Cecilia Delgado, Eladio Garví, Marta Oliva, Elena Rodríguez, Emilia Ruíz, Fèlix Saltor, José Samos, and Manolo Torres. Operaciones para Conformar Esquemas Orientados a Objetos. In Taller sobre Integración Semántica de Fuentes de Datos Distribuidas y Heterogéneas de las Jornadas de Ingeniería del Software y Bases de Datos (JISBD2002). El Escorial (Spain), November 2002. (In Spanish)
  • Alberto Abelló, José Samos, and Fèlix Saltor. On Relationships Offering New Drill-across PossibilitiesACM). In 5th International Workshop on Data Warehousing and OLAP (DOLAP). McLean (USA), November 2002. ACM Press, 2002. Pages 7-13. ISBN: 1-58113-590-4

    OLAP tools divide concepts based on whether they are used as analysis dimensions, or are the fact subject of analysis, which gives rise to star shape schemas. Operations are always provided to navigate inside such star schemas. However, the navigation among different stars is usually overlooked. This paper studies different kinds of Object-Oriented conceptual relationships (part of UML standard) between stars (namely Derivation, Generalization, Association, and Flow) that allow to drill across them.
  • Carme Martín, and Alberto Abelló. The Data Warehouse: A Temporal Database. Technical Report LSI-02-66-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), Novembre 2002.

    Extended version of the homonimous paper published in 2003.
  • Alberto Abelló, José Samos, and Fèlix Saltor. YAM² (Yet Another Multidimensional Model): An extension of UMLIEEE). In International Database Engineering & Applications Symposium (IDEAS). Edmonton (Canada), July 2002. Mario A. Nascimento, M. Tamer Özsu, Osmar Zaïne Editors. IEEE Computer Society Press, 2002. Pages 172-181. ISBN: 0-7695-1638-6. ISSN: 1098-8086

    This paper presents a multidimensional conceptual Object-Oriented model, its structures, integrity constraints and query operations. It has been developed as an extension of UML core metaclasses to facilitate its usage, as well as to avoid the introduction of completely new concepts. YAM² allows the representation of several semantically related star schemas, as well as summarizability and identification constraints.
  • Alberto Abelló. YAM²: A Multidimensional Conceptual Model. PhD Thesis, Universitat Politècnica de Catalunya. Barcelona, April 2002.

    This thesis proposes YAM², a multidimensional conceptual model for OLAP (On-Line Analytical Processing). It is defined as an extension of UML (Unified Modeling Language). The aim is to benefit from Object-Oriented concepts and relationships to allow the definition of semantically rich multi-star schemas. Thus, the usage of Generalization, Association, Derivation, and Flow relationships (in UML terminology) is studied.

    An architecture based on different levels of schemas is proposed and the characteristics of its different levels defined. The benefits of this architecture are twofold. Firstly, it relates Federated Information Systems with Data Warehousing, so that advances in one area can also be used in the other. Moreover, the Data Mart schemas are defined so that they can be implemented on different Database Management Systems, while still offering a common integrated vision that allows to navigate through the different stars.

    The main concepts of any multidimensional model are facts and dimensions. Both are analyzed separately, based on the assumption that relationships between aggregation levels are part-whole (or composition) relationships. Thus, mereology axioms are used on that analysis to prove some properties.

    Besides structures, operations and integrity constraints are also defined for YAM². Due to the fact that, in this thesis, a data cube is defined as a function, operations (i.e. Drill-across, ChangeBase, Roll-up, Projection, and Selection) are defined over functions. Regarding the set of integrity constraints, they reflect the importance of summarizability (or aggregability) of measures, and pay special attention to it.
2001
  • Alberto Abelló, Francisco Araque, José Samos, and Fèlix Saltor. Bases de Datos Federadas, Almacenes de Datos y Análisis Multidimensional. In Taller de Almacenes de Datos y Tecnologia OLAP de las Jornadas de Ingeniería del Software y Bases de Datos (JISBD2001). Almagro (Spain), November 2001. (In Spanish)
  • Alberto Abelló, José Samos, and Fèlix Saltor. Understanding Facts in a Multidimensional Object-Oriented ModelACM). In 4th International Workshop on Data Warehousing and OLAP (DOLAP 2001). Atlanta (USA), November 2001. Pages 32-39. ACM Press, 2001. ISBN 1-58113-437-1.

    "On-Line Analytical Processing" tools are used to extract information from the "Data Warehouse" in order to help in the decision making process. These tools are based on multidimensional concepts, i.e. facts and dimensions. In this paper we study the meaning of facts, and the dependencies in multidimensional data. This study is used to find relationships between cubes (in an Object-Oriented framework) and explain navigation operations.
  • Alberto Abelló, José Samos, and Fèlix Saltor. Multi-star Conceptual Schemas for OLAP Systems.. Technical Report LSI-01-45-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), October 2001.

    Extended version of the paper published in 2002: "On Relationships Offering New Drill-across Possibilities".
  • Alberto Abelló, José Samos, and Fèlix Saltor. YAM2 (Yet Another Multidimensional Model): An extension of UML.. Technical Report LSI-01-43-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), October 2001.

    Extended version of the homonimous paper published in 2002.
  • Elena Rodríguez, Alberto Abelló, Marta Oliva, Fèlix Saltor, Cecilia Delgado, Eladio Garví and José Samos. On Operations along the Generalization/Specialization Dimension. In International Workshop on Engineering Federated Information Systems (EFIS). Berlin (Germany), October 2001. Pages 70-83. ISBN: 3-89838-027-0

    The need to derive a database schema from one or more existing schemas arises in Federated Database Systems as well as in other contexts. Operations used for this purpose include conforming operations, which change the form of a schema. In this paper we present a systematic approach to establish a set of primitive conforming operations that operate along the Generalization/Specialization dimension in the context of Object-Oriented schemas.
  • Alberto Abelló, José Samos, and Fèlix Saltor. A Framework for the Classification and Description of Multidimensional Data ModelsSpringer-Verlag). In 12th International Conference on Database and Expert Systems Applications (DEXA). Munich (Germany), September 2001. Lecture Notes in Computer Science volume 2113. Springer, 2001. Pages 668-677. ISSN: 0302-9743, ISBN: 3-540-42527-6

    The words On-Line Analytical Processing bring together a set of tools, that use multidimensional modeling in the management of information to improve the decision making process. Lately, a lot of work has been devoted to modeling the multidimensional space. The aim of this paper is twofold. On one hand, it compiles and classifies some of that work, with regard to the design phase they are used in. On the other hand, it allows to compare the different terminology used by each author, by placing all the terms in a common framework.
  • Alberto Abelló, José Samos, and Fèlix Saltor. Understanding Analysis Dimensions in a Multidimensional Object-Oriented Model. In 3rd International Workshop on Design and Management of Data Warehouses (DMDW). Interlaken (Switzerland), June 2001. SwissLife, 2001. ISSN: 1424-4691

    OLAP defines a set of data warehousing query tools characterized by providing a multidimensional view of data. Information can be shown at different aggregation levels (often called granularities) for each dimension. In this paper, we try to outline the benefits of understanding the relationships between those aggregation levels as Part-Whole relationships, and how it helps to address some semantic problems. Moreover, we propose the usage of other Object-Oriented constructs to keep as much semantics as possible in analysis dimensions.
2000
  • Alberto Abelló, José Samos, and Fèlix Saltor. A Data Warehouse Multidimensional Data Models Classification. Technical Report LSI-2000-6. Dept. Llenguages y Sistemas Informáticos (Universidad de Granada), December 2000.

    The words On-Line Analytical Processing (OLAP) bring together a set of tools, that use multidimensional modeling in the extraction of information from the Data Warehouse. Lately, a lot of work has been devoted to modeling the multidimensional space. The aim of this paper is twofold. On one hand, it compiles and classifies most of that work. On the other hand, it allows to compare the different terminology used by each author, by placing all the terms in a common framework.
  • Elena Rodríguez, Alberto Abelló, and Marta Oliva. Resumen del Simposium en Objetos y Bases de Datos del ECOOP'2000. In Taller de Bases de Datos Orientadas a Objetos dentro de las Jornadas de Ingeniería del Software y Bases de Datos (JISBD2000). Valladolid (Spain), November 2000. (In Spanish)

  • Alberto Abelló, and Elena Rodríguez. Describing BLOOM99 with regard to UML Semantics. In Proceedings of Jornadas de Ingeniería del Software y Bases de Datos (JISBD). Valladolid (Spain), November 2000. Gráficas Andrés Martín S.L., 2000. Pages 307-319. ISBN: 84-8448-065-8

    In this paper, we describe the BLOOM metaclasses with regard to the Unified Modeling Language (UML) semantics. We concentrate essentially on the Generalization/Specialization and Aggregation/Decomposition dimensions, because they are used to guide the integration process BLOOM was intended for. Here we focus on conceptual data modeling constructs that UML offers. In spite of UML provides much more abstractions than BLOOM, we will show that BLOOM still has some abstractions that UML does not. For some of these abstractions, we will sketch how UML can be extended to deal with this semantics that BLOOM adds.
  • Fèlix Saltor, Marta Oliva, Alberto Abelló, and José Samos. Building Secure Data Warehouse Schemas from Federated Information Systems. In International CODATA Conference on Data and Information for the Coming Knowledge Milenium (CODATA), Baveno (Italy), October 2000 (Extended abstract). "Heterogeneous Information Exchange and Organizational Hubs", Bestougeff, Dubois and Thuraisingham Editors. Kluwer Academic Publishers, 2002. Pages 123-134. ISBN: 1-4020-0649-7

    There are similarities between architectures for Federated Information Systems and architectures for Data Warehousing. In the context of an integrated architecture for both Federated Information Systems and Data Warehousing, we discuss how additional schema levels provide security, and operations to convert from one level to the next.
  • Alberto Abelló, José Samos, and Fèlix Saltor. Benefits of an Object-Oriented Multidimensional Data ModelSpringer-Verlag). In Objects and Databases - International Symposium- in 14th European Conference on Object-Oriented Programming (ECOOP). Sophia Antipolis and Cannes (France), June 2000. Lecture Notes in Computer Science volume 1944. Springer, 2000. Pages 141-152. ISSN: 0302-9743. ISBN: 3-540-41664-1

    In this paper, we try to outline the goodness of using an O-O model on designing multidimensional Data Marts. We argue that multidimensional modeling is lacking in semantics, which can be obtained by using the O-O paradigm. Some benefits that could be obtained by doing this are classified in six O-O-Dimensions (i.e. Classification/Instantiation, Generalization/Specialization, Aggregation/Decomposition, Caller/Called, Derivability, and Dynamicity), and exemplified with specific cases.
  • Alberto Abelló, Marta Oliva, José Samos, and Fèlix Saltor. Information System Architecture for Data Warehousing from a Federation. In Proc. of the Int. Workshop on Engineering Federated Information Systems (EFIS). Dublin (Ireland), June 2000. IOS Press, 2000. Pages 33-40. ISBN: 1-58603-075-2

    This paper is devoted to Data Warehousing architecture and its data schemas. We relate a federated databases architecture to Data Warehouse schemas, which allows us to provide better understanding to the characteristics of every schema, as well as the way they should be defined. Because of the confidentiality of data used to make decisions, and the federated architecture used, we also pay attention to data protection.
  • Alberto Abelló, Marta Oliva, José Samos, and Fèlix Saltor. Information System Architecture for Secure Data Warehousing. Technical Report LSI-00-26-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), April 2000.

    Extended version of the previous paper: "Information System Architecture for Data Warehousing from a Federation".
1999
  • José Samos, Alberto Abelló, Marta Oliva, Elena Rodríguez, Fèlix Saltor, Jaume Sistac, Francisco Araque, Cecilia Delgado, Eladio Garví and Emilia Ruíz. Sistema Cooperativo para la Integración de Fuentes Heterogéneas de Información y Almacenes de Datos. In Novatica, 142 (Nov-Dec 1999). Asociación de Técnicos de Informática (ATI), 1999. Pages 44-49. (In Spanish). ISSN: 0211-2124

    En este trabajo se presenta nuestra propuesta de creación de un prototipo de sistema cooperativo para la integración de fuentes heterogéneas de información y almacenes de datos en el cual se centran actualmente nuestras investigaciones. El objetivo general es proporcionar una capa de software que permita la cooperación entre diversas fuentes de información que están interconectadas mediante una red de líneas de comunicación. Cada fuente posee sus propios servicios de respuesta a preguntas que sobre sus datos realizan sus usuarios y, adicionalmente, se desea ofrecer a determinados usuarios la capacidad de acceder al conjunto de datos de una forma uniforme (acceso integrado), ya sea en tiempo real, ya sea a través de almacenes de datos.
  • Alberto Abelló, Marta Oliva, Elena Rodríguez, and Fèlix Saltor. The syntax of BLOOM99 schemas. Technical Report LSI-99-34-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), July 1999.

    The BLOOM (BarceLona Object Oriented Model) data model was developed to be the Canonical Data Model (CDM) of a Federated Database Management System prototype. Its design satisfies the features that a data model should have to be suitable as a CDM. The initial version of the model (BLOOM91) has evolved into the present version, BLOOM99.

    This report specifies the syntax of the schema definition language of BLOOM99. In our model, a schema is a set of classes, related through two dimensions: the generalization/specialization dimension, and the aggregation/decomposition dimension. BLOOM supports several features in each of these dimensions, through their corresponding metaclasses.

    Even if users are supposed to define and modify schemas in an interactive way, using a Graphical User Interface, a linear schema definition language is clearly needed. Syntax diagrams are used in this report to specify the language; an alternative using grammar productions appears as Appendix A. A possible graphical notation is given in Appendix B.

    A comprehensive running example illustrates the model, the language and its syntax, and the graphical notation.
  • Alberto Abelló, Marta Oliva, Elena Rodríguez, and Fèlix Saltor. The BLOOM model revisited: An evolution proposal (poster sesion). In Workshop Reader of the 13th European Conference on Object-Oriented Programming (ECOOP). Lisboa, June 1999. Lecture Notes in Computer Science, Vol. 1743. Springer, 2000. Pages 376-378. ISBN: 3-540-66954-X

    Once argued the desirable characteristics of a suitable CDM, the BLOOM model (BarceLona Object Oriented Model) was progressively defined. It results in an extension of an object oriented model with a semantically rich set of abstractions. BLOOM was not developed as a whole but suffered extensions in different phases. Its abstractions were conceived for building the FDBS in as needed basis. It drove to a lack of unity and differences in the nomenclature.

    The necessity of revising the BLOOM model outcropped during the design process of the directory of the FDBS. It is essential to have such storage system because of the amount of needed information in building and operating a FDBS. The directory is the core of our FDBS architecture and it must contain the different schema levels as well as the mappings among them. Therefore, the model had to be fixed in order to store those schemas and mappings in a structured manner.
  • Alberto Abelló. CORBA: A middleware for an heterogeneous cooperative system. Technical Report LSI-99-21-R. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), May 1999.

    Two kinds of heterogeneities interfere with the integration of different information sources, those in systems and those in semantics. They generate different problems and require different solutions. This paper tries to separate them by proposing the usage of a distinct tool for each one (i.e. CORBA and BLOOM respectively), and analizing how they could collaborate. CORBA offers lots of ways to deal with distributed objects and their potential needs, while BLOOM takes care of the semantic heterogeneities. Therefore, it seems promising to handle the system heterogeneities by wrapping the components of the BLOOM execution architecture into CORBA objects.
  • Alberto Abelló, and Fèlix Saltor. Implementation of the BLOOM data model on ObjectStore. Technical Report LSI-99-7-T. Dept Llenguatges i Sistemes Informàtics (Universitat Politècnica de Catalunya), May 1999.

    BLOOM is a semantically enriched object oriented data model. It offers extra semantic abstractions to better represent the real world. Those abstractions are not implemented in any commercial product. This paper explains how all them could be simulated with a software layer on an object oriented database management system. Concretely, it proved to work on ObjectStore.
1998

"A celebrity is a person who works hard all his life to become known, then wears dark glasses to avoid being recognized."

Copyright © 1997, Alberto Abelló Gamazo
Dept. Enginyeria de Serveis i Sistemes d'Informació.
Universitat Politècnica de Catalunya.
All rights reserved.
Revised: May 21st, 2020
URL: http://www.essi.upc.edu/~aabello/publications/home.html
Please, send comments and suggestions to: aabello [at] essi.upc.edu