Data Discovery
Javier Flores, Cristina Gómez, Sergi Nadal, Raquel Panadero, Oscar RomeroDescription
Data discovery requires to identify interesting or relevant datasets that enable informed data analysis. Discovery and integration of datasets is nowadays a largely manual and arduous taskthat consumes up to 80% of a data scientists’ time. This only gets aggravated by the proliferation of large repositories of het-erogeneous data, such as data lakes or open data-related initiatives. Due to the unprecedented web-scale volumes of heterogeneous data sources, manual data discovery becomes an unfeasible task that calls for automation. Hence, we focus onthe very first task of data discovery: the problem of discovering joinable attributes among structured datasets in a data lake.
Related publications
2023 Sergi Nadal, Raquel Panadero, Javier Flores 0002, Oscar Romero: Measuring and Predicting the Quality of a Join for Data Discovery. CoRR 2023 2020 Javier Flores, Sergi Nadal, Oscar Romero: Scalable Data Discovery Using Profiles. CoRR 2020