DTIM | UPC

Advanced Data Management

Description

This research line focuses on data management on non-traditional data formats. In our group we focus on two main data structures: graphs and flexible enery data management.

Reseach line: Graph Data Warehouses

Graphs are widely used to represent domains with complex structural properties. Applications include emerging topics such as social networks analysis, ontology management and bioinformatics. The greater expressive power of graphs enables revealing valuable insights on both the data and its structural representation. However, graph data modeling, querying and processing become more complex.

Graph analysis is performed by traversing the network structures. Queries such as k-neighborhood or pattern matching are not obvious to express using traditional query languages such as SQL. The analysis is based on arbitrary traversal of the graph structure and could not be efficiently performed using block reads. The efficient management of graph data cannot be naturally handled by traditional data management approaches. This calls for new database models, query languages and processing frameworks naturally designed for graph structured data.

At the multidimensional level, traditional OLAP frameworks provide a multi-level multi-perspective view of the data. They place the relevant measures within the multidimensional space and support their navigation and summarization following the cube metaphor. Graphs provide, in addition to numerical measure, a new class of complex structural measures such as the shortest path between nodes or centrality. Computation and aggregation of these measures require specific algorithms capable of computing and aggregating graphs. ROLAP engines are accepted as the most common logical models for data warehouses. The star and snowflake data models are built on the relational model and are designed to handle numerical data. They are not well-equipped for supporting the analysis and aggregation of structural properties of graphs. Therefore, ROLAP systems at their current state are also not ready for efficient multidimensional analysis of graph data.

These limitations, at both the database and multidimensional levels, have called for the development of next-generation data warehousing systems that can provide the required features and performance.

Reseach line: Flexible energy data management

Nowadays, the usage of energy produced by renewable sources such as wind and solar increases. Furthermore, new technological achievements such as electric vehicles and heat pumps may provoke overload of the power grid in the future, especially in peak demand situations. In this new energy scenery that is being formed, the power grid is gradually transformed to a Smart Grid that uses the information and communication technologies to improve the existing energy services.

Within the Smart Grid, we aim to provide an alternative using the flex-offer (Micro-request) concept, based on the idea that the consumption of energy is not occurring only in fixed time slots but could be shifted and be flexible regarding time so that part of the consumption could be shifted away from the peaks or closer to the peaks of production respectively. Furthermore, those flex-offers could even be flexible regarding the amount of energy or even the price of the corresponding energy. For example, a consumer could use his dishwasher a few hours later than he intended to, because during the shifted time period there will be larger production of energy by wind power. As a result, in the future energy market there will be a need of management, storing and processing large amounts of data that represent such kind of flexibilities. Furthermore, the introduction of a new commodity (flex-offer) in the energy market will create a new energy market model in which business intelligence techniques will ensure its best
operation. Specifically, we focus on advanced aggregation techniques over complex energy related data.

Reseach line: Automating Information Extraction from Spreadsheets

Spreadsheet applications have evolved to be a tool of great importance for businesses, open data and scientific communities. Using these applications, users can perform various transformations, address quality issues, generate new content, and format the data such that are visually comprehensive. The same data can be presented in deferent ways, depending on the preferences and the intentions of the user.

All these make spreadsheet applications a user-friendly tool, but not as much machine-friendly. When it comes to the integration of spreadsheets with other sources, the structural and formatting flexibility is disadvantageous. In other words, it is rather difficult to algorithmically interpret the contents of these files. The current practices require manual involvements, which are cumbersome and timeconsuming.

Overall the non-existence of an automatic processing method limits our ability to explore and reuse the great amount of rich data stored into partially-structured documents such as spreadsheets. In this research line we aim at solving this issue by developing a system able to understand the characteristics (e.g., structure and content type) of the data in spreadsheets. Such a system has to automatically perform many consecutive tasks, each dealing with a different aspect (challenge), before being able to extract the data in a usable form. However, we should consider that not all spreadsheets contain meaningful data. They are not only used to work in a tabular form, but also to create forms, scorecards, graphs and other not genuine table structures. The intended solution should be able to discard this files.

In this research project, we are particularly interested on those spreadsheets containing data that can be transformed into the relational model. This allows us on the one hand to put spreadsheets data under the control of DBMSs and on the other hand to provide these data to a wide range of applications for data analysis, entity augmentation, etc. Since, spreadsheets that contain relational knowledge can exhibit different characteristics we need a flexible workflow of different transformation activities.

Finally, we aim a solution able to work with large spreadsheet corpora. This will enable us to build a system that can be used on an enterprise level or that can be an integral component of research projects from related areas, such as information retrieval and data management.

Related publications

2024
Besim Bilalli, Petar Jovanovic, Sergi Nadal, Anna Queralt, Oscar Romero: There is no Data Science without Data Governance: a Proposal Based on Knowledge Graphs. DOLAP 2024

2023
Sergi Nadal, Alberto Abelló, Oscar Romero, Stijn Vansummeren, Panos Vassiliadis: Graph-Driven Federated Data Management. IEEE Trans. Knowl. Data Eng. 2023

2022
Sergi Nadal, Alberto Abelló, Oscar Romero, Stijn Vansummerem, Panos Vassiliadis: Graph-Driven Federated Data Management (Extended Abstract). ICDE 2022
Yalei Li, Sergi Nadal, Oscar Romero: A Data Quality Framework for Graph-Based Virtual Data Integration Systems. ADBIS 2022
Javier Flores 0002, Emmanuel Jamin, Sergi Nadal, Oscar Romero: The Knowledge Graph Lifecycle in NTT DATA. ISWC (Posters/Demos/Industry) 2022
Javier Flores 0002, Emmanuel Jamin, Sergi Nadal, Oscar Romero: The Knowledge Graph Lifecycle in NTT DATA. ISWC (Posters/Demos/Industry) 2022

2021
Amine Ghrab, Oscar Romero, Sabri Skhiri, Esteban Zimányi: TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes. Inf. Syst. Frontiers 2021
Moditha Hewasinghage, Alberto Abelló, Jovan Varga, Esteban Zimányi: Managing polyglot systems metadata with hypergraphs. Data Knowl. Eng. 2021

2019
Elvis Koci, Dana Kuban, Nico Luettig, Dominik Olwig, Maik Thiele, Julius Gonsior, Wolfgang Lehner, Oscar Romero: XLIndy: Interactive Recognition and Information Extraction in Spreadsheets. DocEng 2019
Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets. ICDAR 2019
Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, Wolfgang Lehner: DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. ICDAR 2019
Carles Farré, Jovan Varga, Robert Almar: GraphQL Schema Generation for Data-Intensive Web APIs. MEDI 2019

2018
Elvis Koci, Maik Thiele, Wolfgang Lehner, Oscar Romero: Table Recognition in Spreadsheets via a Graph Representation. DAS 2018
Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló: Day-ahead Trading of Aggregated Energy Flexibility. e-Energy 2018
Moditha Hewasinghage, Jovan Varga, Alberto Abelló, Esteban Zimányi: Managing Polyglot Systems Metadata with Hypergraphs. ER 2018
Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló: Day-ahead Trading of Aggregated Energy Flexibility - Full Version. CoRR 2018
Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri: Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018
Rohit Kumar 0002: Temporal graph mining and distributed processing. 2018

2017
Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: Table Identification and Reconstruction in Spreadsheets. CAiSE 2017
Rohit Kumar 0002, Alberto Abelló, Toon Calders: Cost Model for Pregel on GraphX. ADBIS 2017

2016
Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: A Machine Learning Approach for Layout Inference in Spreadsheets. KDIR 2016
Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: Cell Classification for Layout Recognition in Spreadsheets. IC3K 2016
Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló, Katja Hose, Laurynas Siksnys: Towards constraint-based aggregation of energy flexibilities. e-Energy (Posters) 2016
Emmanouil Valsomatzis, Torben Bach Pedersen, Alberto Abelló, Katja Hose: Aggregating energy flexibilities under constraints. SmartGridComm 2016
Amine Ghrab, Oscar Romero, Sabri Skhiri, Alejandro A. Vaisman, Esteban Zimányi: GRAD: On Graph Database Modeling. CoRR 2016

2015
Amine Ghrab, Oscar Romero, Sabri Skhiri, Alejandro A. Vaisman, Esteban Zimányi: A Framework for Building OLAP Cubes on Graphs. ADBIS 2015