DTIM | UPC

← back to tools

NextiaJD: Effective and Scalable Data Discovery

Javier Flores, Sergi Nadal, Oscar Romero

More info Github page

Description
Nextia_JD is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). Nextia_JD's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel and distributed fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. Nextia_JD is implemented as an extension of Apache Spark.

Read more about Nextia_JDhere

Related publications
2020
Javier Flores, Sergi Nadal, Oscar Romero: Scalable Data Discovery Using Profiles. CoRR 2020