Data Pre-processing

Alberto Abelló, Tomàs Aluja, Besim Bilalli, Vasileios Theodorou

  • Description

    Our capability of gathering data has developed to the highest extents, whereas the ability to analyze them, lags far behind. Recently, storing huge volumes of data is worth the effort only if we are able to identify valid, novel, potentially useful, and ultimately understandable patterns in data, or put differently, if we are able to transform data into knowledge. The process of transforming data into knowledge was formerly known as Knowledge Discovery in Databases (KDD), but nowadays is referred to as Data Analysis or even just Data Mining. However different researchers might term it, this process generally consists of the following steps: data selection, data pre-processing, data mining and evaluation or interpretation. Each step requires careful attention, since any change made, may it even be small, can hugely impact the final result. 

    The process starts by first selecting or shifting the relevant data from the whole range of the available data. The selection of the data affects the results, but even more the representation and the quality of the data. The data to be analyzed is commonly served in a raw form meaning that it is irrelevant, redundant and incomplete, hence, requires pre-processing. It is well known that 50% at best and 80% at worst of the data analysis time is spent on pre-processing. It is the most time consuming step of the whole process. Yet, that is not the only thing to worry about. Another thing that is of great importance is that novice users do not have the background of knowing which pre-processing might positively or negatively impact their final result. A blindly performed pre-processing is of no purpose.

    Taking these two into account, our research in the group focuses on first, reducing the time spent on pre-processing by deliberately providing user support and second, providing goal oriented (e.g. improving the final result) support by ultimately “automating” the pre-processing step. We know that a complete automation is challenging and maybe not even feasible because of the interactive and iterative nature of the problem, however, we contend that any move towards it is of utmost importance.
    After pre-processing there comes the step of selecting the most adequate mining algorithm for a given problem. Many different algorithms are available and their performance can vary considerably. There is no single algorithm that outperforms the rest for every given problem. One has to be chosen over the others. At the end, the final step is that of evaluating or interpreting the generated models.

    Research line: ETL for Advanced Data Analytics

    Data Analysis poses difficulties for experts, let alone the novice users. That is because in depth knowledge of Machine Learning, Statistics and Database Systems is generally required. In our research we are focusing on preventing Data Analysis from being an asset only in the hands of experts by making it available for the novice users. This entails that complexities lying underneath the process of data analysis need to be hidden by having it automated at best or semi-automated at worst. We aim at developing an intelligent system that will provide user assistance during the whole process of data analysis. 

    There are two main directions that we are elaborating:

    1) Since the effectiveness and need for domain knowledge in data analysis has been confirmed in past research efforts we are exploring how to embed and incorporate domain knowledge into systems which aim at supporting the user. 

    2) In order to automate the selection and composition of pre-processing and data mining operators (e.g., algorithms) we are exploring different techniques such as Cased Based Reasoning and Meta Learning.

    Ultimately our goal is to generate complete workflows for the users who want to perform analysis. This means that the system will help the user in combining different pre-processing and mining algorithms in order to achieve better results. 

    Related publications
    Elvis Koci, Maik Thiele, Oscar Romero, Wolfgang Lehner: A Machine Learning Approach for Layout Inference in Spreadsheets. KDIR 2016
    Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Automated Data Pre-processing via Meta-learning. MEDI 2016
    Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner, Maik Thiele: Quality measures for ETL processes: from goals to implementation. Concurrency and Computation: Practice and Experience 2016
    Vasileios Theodorou, Alberto Abelló, Maik Thiele, Wolfgang Lehner: POIESIS: a Tool for Quality-aware ETL Process Redesign. EDBT 2015
    Vasileios Theodorou, Alberto Abelló, Wolfgang Lehner: Quality Measures for ETL Processes. DaWaK 2014