NextiaJD: Effective and Scalable Data Discovery
Javier Flores, Sergi Nadal, Oscar RomeroDescription
NextiaJD is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). NextiaJD's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel and distributed fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. NextiaJD is implemented as an extension of Apache Spark.
Read more about NextiaJD here
Related publications
2020 Javier Flores, Sergi Nadal, Oscar Romero: Scalable Data Discovery Using Profiles. CoRR 2020