: Scalable Data Discovery Using Profiles

Nextia_JD is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). This website is a companion of the research and demonstration papers submitted to EDBT 2021, where we present the method underlying our approach. Nextia_JD's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel and distributed fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. Nextia_JD is implemented as an extension of Apache Spark.

People

Publications

2021

Towards Scalable Data Discovery Short paper published in EDBT 2021
Effective and scalable data discovery with NextiaJD Demo paper published in EDBT 2021

2020

An integration data tool for joinable tables based on Apache Spark Master thesis

Resources

Software repository

The source code of the system can be found in the following Github repository.

The easy way to use NextiaJD is with Maven. For SBT just add the following dependency in your build.sbt

libraryDependencies += "edu.upc.essi.dtim.nextiajd" % "nextiajd_2.12" % "1.0.1"

For more ways to add NextiaJD using Maven, please go here

You can check how to use NextiaJD here or see the zeppelin notebook with an explanation step by step, see demonstration section

Ground truth datasets

Datasets used in this work have been obtained from open data repositories with no copyright such as Kaggle and OpenML. The datasets used to generate both our ground truth and to evaluate our method are available in the following links:

Training dataset
Testbed XS (datasets with file size 0-1MB)
Testbed S (datasets with file size 1-100MB)
Testbed M (datasets with file size 100MB-1GB)
Testbed L (datasets with file size >1GB)

Reproducibility

We believe in transparent and shareable research [1], [2]. Hence, we provide you with detailed instructions on how to reproduce the experiments presented in our work:

Demonstration

Live demo

Live demos of NextiaJD are available as Zeppelin notebooks. Bear in mind that, in order to access them you must first login with the following credentials (user: user1, password: nextiajd).

Interactive GUI

A live demo of the user interface is available here.

Library usage

We also provide with a code-oriented demo showcasing how proficient data analysts can take full benefit of our tool here.

Videos

Last update: 2021/03/14 by Javier Flores