NextiaJD: Effective and Scalable Data Discovery

Javier Flores, Sergi Nadal, Oscar Romero

More info Github page

  • Description

    NextiaJD is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). NextiaJD's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel and distributed fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. NextiaJD is implemented as an extension of Apache Spark.

    Read more about NextiaJD here


    Related publications
    2020
    Javier Flores, Sergi Nadal, Oscar Romero: Scalable Data Discovery Using Profiles. CoRR 2020