DTIM | UPC

Data Pre-processing

Description

Our capability of gathering data has developed to the highest extents, whereas the ability to analyze them, lags far behind. Recently, storing huge volumes of data is worth the effort only if we are able to identify valid, novel, potentially useful, and ultimately understandable patterns in data, or put differently, if we are able to transform data into knowledge. The process of transforming data into knowledge was formerly known as Knowledge Discovery in Databases (KDD), but nowadays is referred to as Data Analysis or even just Data Mining. However different researchers might term it, this process generally consists of the following steps: data selection, data pre-processing, data mining and evaluation or interpretation. Each step requires careful attention, since any change made, may it even be small, can hugely impact the final result.

The process starts by first selecting or shifting the relevant data from the whole range of the available data. The selection of the data affects the results, but even more the representation and the quality of the data. The data to be analyzed is commonly served in a raw form meaning that it is irrelevant, redundant and incomplete, hence, requires pre-processing. It is well known that 50% at best and 80% at worst of the data analysis time is spent on pre-processing. It is the most time consuming step of the whole process. Yet, that is not the only thing to worry about. Another thing that is of great importance is that novice users do not have the background of knowing which pre-processing might positively or negatively impact their final result. A blindly performed pre-processing is of no purpose.

Taking these two into account, our research in the group focuses on first, reducing the time spent on pre-processing by deliberately providing user support and second, providing goal oriented (e.g. improving the final result) support by ultimately “automating” the pre-processing step. We know that a complete automation is challenging and maybe not even feasible because of the interactive and iterative nature of the problem, however, we contend that any move towards it is of utmost importance.
After pre-processing there comes the step of selecting the most adequate mining algorithm for a given problem. Many different algorithms are available and their performance can vary considerably. There is no single algorithm that outperforms the rest for every given problem. One has to be chosen over the others. At the end, the final step is that of evaluating or interpreting the generated models.

Research line: ETL for Advanced Data Analytics

Data Analysis poses difficulties for experts, let alone the novice users. That is because in depth knowledge of Machine Learning, Statistics and Database Systems is generally required. In our research we are focusing on preventing Data Analysis from being an asset only in the hands of experts by making it available for the novice users. This entails that complexities lying underneath the process of data analysis need to be hidden by having it automated at best or semi-automated at worst. We aim at developing an intelligent system that will provide user assistance during the whole process of data analysis.

There are two main directions that we are elaborating:

1) Since the effectiveness and need for domain knowledge in data analysis has been confirmed in past research efforts we are exploring how to embed and incorporate domain knowledge into systems which aim at supporting the user.

2) In order to automate the selection and composition of pre-processing and data mining operators (e.g., algorithms) we are exploring different techniques such as Cased Based Reasoning and Meta Learning.

Ultimately our goal is to generate complete workflows for the users who want to perform analysis. This means that the system will help the user in combining different pre-processing and mining algorithms in order to achieve better results.

Related publications

2022
Joseph Giovanelli, Besim Bilalli, Alberto Abelló: Data pre-processing pipeline generation for AutoETL. Inf. Syst. 2022
Yalei Li, Sergi Nadal, Oscar Romero: A Data Quality Framework for Graph-Based Virtual Data Integration Systems. ADBIS 2022

2021
Lidia López, Martí Manzano, Cristina Gómez, Marc Oriol, Carles Farré, Xavier Franch, Silverio Martínez-Fernández, Anna Maria Vollmer: QaSD: A Quality-aware Strategic Dashboard for supporting decision makers in Agile Software Development. Sci. Comput. Program. 2021
Joseph Giovanelli, Besim Bilalli, Alberto Abelló: Effective data pre-processing for AutoML. DOLAP 2021
Joseph Giovanelli, Besim Bilalli, Alberto Abelló: Effective data pre-processing for AutoML. DOLAP 2021

2019
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: PRESISTANT: Learning based assistant for data pre-processing. Data Knowl. Eng. 2019

2018
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Intelligent assistance for data pre-processing. Comput. Stand. Interfaces 2018
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Rana Faisal Munir, Robert Wrembel: PRESISTANT: Data Pre-processing Assistant. CAiSE Forum 2018
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: PRESISTANT: Learning based assistant for data pre-processing. CoRR 2018
Besim Bilalli: Learning the impact of data pre-processing in data analysis. 2018

2017
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet: On the predictive power of meta-features in OpenML. Int. J. Appl. Math. Comput. Sci. 2017

2016
Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: A Machine Learning Approach for Layout Inference in Spreadsheets. KDIR 2016
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Automated Data Pre-processing via Meta-learning. MEDI 2016
Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner, Maik Thiele: Quality measures for ETL processes: from goals to implementation. Concurr. Comput. Pract. Exp. 2016

2015
Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner: POIESIS: a Tool for Quality-aware ETL Process Redesign. EDBT 2015

2014
Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner: Quality Measures for ETL Processes. DaWaK 2014