Improving Quality and Scalability in Semantic Data Management

ocorcho — Thu, 10 Dec 2015 10:34:41 +0000

Author(s):

Vassilis Christophides

Oscar Corcho

Vasilis Efthymiou

Vadim Ermolayev

Andreas Harth

Luis Daniel Ibáñez

Natalya Keberle

Kostas Stefanidis

Ilias Tachmazidis

Date published:

05/2016

Full collection of the Tutorial Materials could be accessed at http://km.aifb.kit.edu/sites/qs-sdm

Tutorial Objectives.

This tutorial offers a walk through several techniques in the field of Semantic Data Management. These techniques stem across coping with semantic data provenance and ontology learning, dealing with dynamics and evolution; improving quality of semantic data and linked data in decentralized settings, at scale, allowing trust and autonomy, guaranteeing consistency; improving quality by applying multitype and largescale entity resolution, also at scale; and finally enabling feasible assertion of new knowledge by applying large scale parallel reasoning. Datasets developed under the context of the EU IRSES project SemData will be used as running examples throughout the tutorial.

This tutorial has a goal to present and discuss several complementary techniques, developed in frame of the SemData project, which are important to help solve the challenges of improving quality and scalability of Semantic Data Management, giving an account to distributed nature, dynamics and evolution of the data sets. In particular, the objectives are:

Refined temporal representation: to present the developments regarding the required temporal features which help cope with dataset dynamics and evolution.
Ontology learning and knowledge extraction: to demonstrate the methodology allowing robustly extracting a consensual set of community requirements from the relevant professional document corpus; to refine the ontology and evaluate the quality of this refinement.
Distribution, autonomy, consistency and trust: to present the approach to implement a Read/Write Linked Open Data coping with participants autonomy and trust at scale.
Entity resolution: to discuss how traditional entity resolution workflows have to be revisited in order to cope with the new challenges stemming from the Web openness and heterogeneity, data variety, complexity, and scale.
Large scale reasoning: overview the variety of existing platforms and techniques that allow parallel and scalable data processing, enabling largescale reasoning based on rule and data partitioning over various logics.

Program.

The tutorial comprises a short introduction followed by the four topical parts focusing on different but mutually complementary techniques for Semantic Data Management. The parts are scheduled as follows:

Introduction (10 min)
Refining Temporal Representations using OntoElect (45 min)
Distribution, Autonomy, Consistency, and Trust in a Read/Write LOD (45 min)
Coffee Break (30 min)
MultiType and LargeScale Entity Resolution (45 min)
LargeScale Reasoning over RDF(S) and OWLbased Semantic Data (45 min)
Roundtable Discussion (20 min)

The preparation of this tutorial has been funded by the EU Marie Curie IRSES project SemData.

What is SemData

admin — Wed, 13 Nov 2013 13:53:05 +0000

SemData is a four year project, started on October 2013, funded under the International Research Staff Exchange Scheme (IRSES) of the EU Marie Curie Actions. As such, it is mainly focused on allowing exchanges of members among the participating institutions, bringing together research leaders across the globe from the relevant communities: the linked data, semantic web and the database systems.

SemData is all about semantic data management, a term by which we refer to a range of techniques for the manipulation and usage of data based on its meaning. For the purposes of SemData, semantic data will be data expressed in RDF, the lingua franca of Linked Open Data and hence the default data model for annotating data on the Web.

Now that the foundational concepts and technologies are laid, and a critical amount of semantic, linked data has been published, the next crucial step has to do with maturity and quality. In fact, many aspects of Linked Data management face great challenges in mastering the varying maturity and quality of the data sources exposed online. One of the reasons for this state of affairs is the ‘publish-first-refine-later’ philosophy promoted by the Linked Open Data movement, complemented by the open, decentralized nature of the environment in which we are operating. While both have led to amazing developments in terms of the amount of data made available online, the maturity and quality of the actual data and of the links connecting data sets is something that the community is often left to resolve. In this context maturity and quality are understood under the general term of ‘fitness-for-use’, which cover features related to data completeness, presence of (obvious) inconsistencies, timeliness, or relevance to the application domain. These features may span to different data sets, given the prominent role that linkage plays in Linked Data management. In the context of this project, secondments have allowed project partners to produce results in the areas of curation, preservation and provenance, which have resulted in:

- A survey on temporal ontologies, which was clearly required in the current state of the art in temporal ontologies.

- A survey (in the form of a book) on entity resolution techniques for the Web of Data.

- Some initial work on the relationship between ontology design patterns and the publication of high-quality Linked Data (including licensing data), with examples coming from the domain of chess.

- A workshop proposal for the use of crowdsourcing and citizen science in knowledge engineering in different disciplines (in preparation).

In addition, semantic data management faces new challenges from dynamic data that become available from sensor and social networks. And it has to be able to cope with the huge amounts of data that become available in modern large-scale applications, e.g. in the area of smart cities. These challenges require, again, new levels of maturity, robustness and scalability. In the context of this project, secondments have allowed project partners to produce results in:

- A theoretical characterisation of the relationship between two W3C recommendations (R2RML and Direct Mapping) that can be used for the generation of semantic data (RDF) for the Semantic Web and Web of Linked Data.

- The combination of two existing approaches for the processing of sensor-based data (morph-streams and SensorCloud) coming from two of the project partners/beneficiaries.

These challenges require, again, new levels of maturity, robustness and scalability.

Therefore, the goal of this project is to address the main challenges regarding maturity, quality, robustness and scalability of semantic data management, to enable the sustained adoption of semantic (linked) data in a broad range of practical settings. Particular objectives are:

To develop solutions for managing quality through the adaptation and use of techniques from preservation, database systems, provenance management and crowdsourcing.
To address the particular semantic management requirements of dynamic (sensor and social) data, thus increasing the robustness and maturity of semantic data management.
To develop efficient, sound solutions for managing semantic data, including the aspects of 1 - 2 above, thus addressing the key quality criteria of efficiency and adequacy.

For each of these objectives we have created a specific workpackage, whose detailed description can be found here.

Finally, the large size of the data being published is also challenging existing hardware and software infrastructures for an efficient semantic data management. This includes the need to analyse how high-performance / cloud computing infrastructures support such management, in terms of massively parallel semantic reasoning, with methods capable of dealing with imperfect, imprecise, temporal and/or spatial semantic data. In the context of this project, secondments have allowed project partners to produce results in:

- A survey of large-scale reasoning with semantic data (currently under review by the prestigious Journal of Web Semantics), which gives, for the first time, a comprehensive and systematic overview and analysis of mass-parallelization techniques applied to a variety of reasoning methods.

- More efficient processing of streams for reasoning using GPU units.

- New approaches for the efficient processing of the OWL2 RL and EL profiles.