An Automatic Data Discovery Approach to Enhance Barcelona's Data Ecosystem (DiscoveryGNN)
February, 2021
Cristina Gómez, Sergi Nadal, Raquel Panadero, Oscar RomeroDescription
The importance of data is not new, and the predicted economic value that can be extracted remains as of yet largely unrealized. Following this trend, the city of Barcelona is nowadays a major European hub for Data Science and Artificial Intelligence (AI). With a constantly evolving and live data ecosystem SMEs, large organizations, startups and research groups are building data-driven solutions, promoting a data-driven culture and building a rich ecosystem. Nonetheless, we believe there is still a big gap in terms of accessing this data deluge.
Yet, there is no unified manner to access this plethora of data. To become a data-driven society, Barcelona needs to push for democratizing access to all this data via a centralized entry point, with two main objectives (i) be more efficient in managing the city (i.e., public services), and (ii) enable third parties to access and cross a wealth of data that will, eventually, benefit providers, consumers and citizens using services based on such data. This need is particularly aligned with the urban challenges that Barcelona is currently facing.
We propose a novel research line on Data Discovery that will democratize the access to data. The proposal is twofold: (i) a flexible shared and accessible data hub, under the town council’s control, where private and public actors publish their datasets. For this, we propose to rely on the Open Data BCN dataset catalog. And, (ii) an innovative semi-automated Data Discovery approach to effectively cross disparate, heterogeneous and intersectoral data sources without needing to manually process the data. We will automatically scrutinize the datasets: i.e., their data, definition (or schema) and hidden relationships, to automatically profile the dataset. We plan to use Graph Neural Networks (GNNs), an advanced AI technique that generalizes the deep neural network model to exploit further aspects such as topology and connectivity. This is nowadays a hot research topic, which has not yet been explored in the context of Data Discovery.