ProxMine: Data Lake Metadata Management System for Dataset Discovery
Alberto Abelló, Ayman Elserafi, Oscar RomeroDescription
ProxMine is a tool for governing the data lake using metadata management for dataset discovery. It supports the collection of descriptive statistics about tabular datasets and their attributes, calculation of overall dataset similarity scores between dataset pairs using the collected statistics and proximity models based on machine learning techniques, categorization of datasets into pre-existing categories defined in the data lake using the computed similarities between datasets and a k-Nearest-Neighbour algorithm, in addition to construction of proximity graph visualisations which summarise the overall structure of the data lake by showing datasets and their categories as nodes and relationships (similarities) modelled as edges.
The tool can be used to support the data lake users in extracting useful metadata involving descriptive statistics about the content of the datasets stored, can help data wranglers and curators in finding relevant datasets by querying similar datasets in the data lake given an input query dataset, and can also provide schema matching support by showing the overlap of similar attributes and their data / names between dataset pairs (using name-based and content-based analysis techniques like described in this paper).
The tool is licensed under the open-source Creative Commons BY-NC-SA 3.0 legal terms and copyrights are protected by this license.
To request access to a demonstration of the tool using an OpenML sample of datasets e-mail us at: alserafi [at] essi [dot] upc [dot] edu.Related publications
2020 Ayman Alserafi, Alberto Abelló, Oscar Romero, Toon Calders: Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching. ACM Trans. Inf. Syst. 2020 2019 Ayman Alserafi, Alberto Abelló, Oscar Romero, Toon Calders: Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. MEDI 2019 2017 Ayman Alserafi, Toon Calders, Alberto Abelló, Oscar Romero: DS-Prox: Dataset Proximity Mining for Governing the Data Lake. SISAP 2017