ODIN: A Dataspace Management System

ODIN (On-demand Data Integration) is a system that supports the incremental pay-as-you-go integration of data sources into dataspaces and provides user-friendly querying mechanisms of the resulting dataspaces. This website is a companion of a demonstration paper submitted to ISWC 2019, where we describe some of its characteristics and underlying assumptions, including the user interactions required. ODIN's novelty lies in a largely automated bottom-up approach (i.e., driven by the sources at hand) that includes the user in the loop for disambiguation purposes. ODIN relies on the concept of traceability graph, which are generic metadata abstraction (i.e., not tailored for an specific task) about the integration of a particular set of data sources. From this graphs, ODIN is capable of generating target-oriented metadata constructs. In this demonstration we focus on those for query answering over dataspaces.

People


Software

Sources

All sources are available in the following Github page

Demonstration

WISCENTD - WHO Information System to Control and Eliminate Neglected Tropical Diseases

The WHO Information System to Control and Eliminate Neglected Tropical Diseases (WISCENTD) is an ambitious project following the Sixty-sixth World Health Assembly resolution (from May 2013), where Member States were urged to further strengthen the disease surveillance system especially on neglected tropical diseases targeted for eradication and requesting WHO to monitor progress in achieving the targets for neglected tropical diseases set in WHOs roadmap (...) and to provide support to Member States in their efforts to collect, validate and analyse data from national surveillance systems. For the first time, WHO highlighted the relevance of data. However, Neglected Tropical Diseases (NTDs) are still very often neglected by national health information systems or by surveillance systems that are too weak to ensure good quality data collection, flow, validation, use and dissemination. Official data reported by health ministries are, therefore, often incomplete and WHO must integrate available official data with several other data sources providing additional information (e.g., non-governmental organizations, researchers) or from other sources (e.g., pharmacovigilance systems, vector distribution in the territories, etc.), but these data can be largely fragmented and heterogeneous. WISCENTD was born to integrate data about NTDs more efficiently.

The following demo simulates the day-by-day of a WHO data analyst and how ODIN is used to first collect and integrate different sources of relevance for a certain NTD, and later cross-query them. Precisely, we use the following datasets: UN Data (open-data JSON datasets) about health economics indicators and migrant information per country, data about diagnosis and treatment per country periodically extracted from WIDP, and data about drug distribution periodically extracted from WIMEDS.

Source bootstrapping and alignment of the provenance graph

To overcome the heterogeneity of the input sources, we extract the schema from each source and then represent it using a more expressive data model (i.e., RDFS). In the following video, we depict how to produce source graphs from data sources in various formats (JSON, XML, CSV and relational). The resulting RDFS models comply with the RDFS meta-model, as the translation is accompanied by production rules at the meta-model level. Once the source graphs are bootstrapped, potential confidence-based alignments between concepts or between properties are identified. The system, thus, identified the correct correspondences between the geographical and temporal variables among the chosen datasets from WHO. In order to accept or reject the proposed alignments, user intervention is required. The resulting provenance graph is incrementally created using the accepted alignments in the form of taxonomies.


Video link

Query answering

From the provenance graph the necessary constructs for query answering have been generated. In the following video we can see the query answering phase. First, we show the generated constructs: global graph, data sources, wrappers, and LAV mappings. The global graph has been generated from the provenance graph, although ODIN's interface allows further manual refinement (for example to update the graph labels). Its main constructs are concepts and features (similarly to classes and their attributes). Data sources represent the connection point to the datasets, while wrappers encode the query to extract their data and expose a first-normal form relation of their schemata. Finally, LAV mappings encode (a) the connections between wrapper's attributes and the global graph features, and (b) the fragment of the global graph that this wrapper is covering. The ontology-mediated query interface depicts the following queries: (1) population and immigration per country; (2) population, immigration and drug distribution per country; (3) diagnosis and drug distribution per year.


Video link


Publications


Acknowledgements

This work is partly supported by the GENESIS project, funded by the Spanish Ministerio de Ciencia e Innovación under project TIN2016-79269-R. It is also supported by the Erasmus Mundus Joint Doctorate in Information Technologies for Business Intelligence – Doctoral College (IT4BI-DC), and the Erasmus Mundus Master in Big Data Management and Analytics (BDMA).


Last update: 2019/06/27 by Sergi Nadal