Disclaimer

This is an anonymized version of the original website that complies with SIGMOD's 2024 double-blind review requirements. No details on participants, or past publications are provided.

Links to Github pages have been processed using the Anonymous GitHub service.

Resources

Software repository

The source code of the system can be found in the following Github repository.

The easy way to use NextiaJD is with Maven. For SBT just add the following dependency in your build.sbt

libraryDependencies += "edu.upc.essi.dtim.nextiajd" % "nextiajd_2.12" % "1.0.1"

For more ways to add NextiaJD using Maven, please go here

You can check how to use NextiaJD in the anonymized GitHub repository or see the zeppelin notebook with an explanation step by step, see demonstration section

Ground truth datasets

Datasets used in this work have been obtained from open data repositories with no copyright such as Kaggle and OpenML. The datasets used to generate both our ground truth and to evaluate our method are available in the following links:

Training dataset
Testbed XS (datasets with file size 0-1MB)
Testbed S (datasets with file size 1-100MB)
Testbed M (datasets with file size 100MB-1GB)
Testbed L (datasets with file size >1GB)

Reproducibility

We believe in transparent and shareable research [1], [2]. Hence, we provide you with detailed instructions on how to reproduce the experiments presented in our work:

Demonstration

We provide NextiaJD in two modes of functioning: a) as a standalone Pickle ML model that can be integrated into any Python application, and b) as an Apache Spark extension.

Standalone Pickle model

We provide the learning model that, given a vector of profile distances, provides the predicted join quality for a pair of attributes.

In the following Github repository, we provide an API that wrap's NextiaJD's services so they can be used from other programming languages (e.g., Python) invoking the command via terminal. These are required to compute the profiles and their distances.

ML Model

The model can be downloaded from the following link (see the following link for more details on how to use it).

Apache Spark extension

Live demos of NextiaJD are available as Zeppelin notebooks. Bear in mind that, in order to access them you must first login with the following credentials (user: user1, password: nextiajd).

Interactive GUI

A live demo of the user interface is available here.

Library usage

We also provide with a code-oriented demo showcasing how proficient data analysts can take full benefit of our tool here.

Videos

Last update: hidden for double-blind reviews

: Scalable Data Discovery Using Profiles