What is SemData

SemData is a four year project, started on October 2013, funded under the International Research Staff Exchange Scheme (IRSES) of the EU Marie Curie Actions. As such, it is mainly focused on allowing exchanges of members among the participating institutions, bringing together research leaders across the globe from the relevant communities: the linked data, semantic web and the database systems.

SemData is all about semantic data management, a term by which we refer to a range of techniques for the manipulation and usage of data based on its meaning. For the purposes of SemData, semantic data will be data expressed in RDF, the lingua franca of Linked Open Data and hence the default data model for annotating data on the Web. 

Now that the foundational concepts and technologies are laid, and a critical amount of semantic, linked data has been published, the next crucial step has to do with maturity and quality. In fact, many aspects of Linked Data management face great challenges in mastering the varying maturity and quality of the data sources exposed online. One of the reasons for this state of affairs is the ‘publish-first-refine-later’ philosophy promoted by the Linked Open Data movement, complemented by the open, decentralized nature of the environment in which we are operating. While both have led to amazing developments in terms of the amount of data made available online, the maturity and quality of the actual data and of the links connecting data sets is something that the community is often left to resolve. In this context maturity and quality are understood under the general term of ‘fitness-for-use’, which cover features related to data completeness, presence of (obvious) inconsistencies, timeliness, or relevance to the application domain. These features may span to different data sets, given the prominent role that linkage plays in Linked Data management. In the context of this project, secondments have allowed project partners to produce results in the areas of curation, preservation and provenance, which have resulted in:

- A survey on temporal ontologies, which was clearly required in the current state of the art in temporal ontologies.

- A survey (in the form of a book) on entity resolution techniques for the Web of Data.

- Some initial work on the relationship between ontology design patterns and the publication of high-quality Linked Data (including licensing data), with examples coming from the domain of chess.

- A workshop proposal for the use of crowdsourcing and citizen science in knowledge engineering in different disciplines (in preparation).

In addition, semantic data management faces new challenges from dynamic data that become available from sensor and social networks. And it has to be able to cope with the huge amounts of data that become available in modern large-scale applications, e.g. in the area of smart cities. These challenges require, again, new levels of maturity, robustness and scalability. In the context of this project, secondments have allowed project partners to produce results in:

- A theoretical characterisation of the relationship between two W3C recommendations (R2RML and Direct Mapping) that can be used for the generation of semantic data (RDF) for the Semantic Web and Web of Linked Data.

- The combination of two existing approaches for the processing of sensor-based data (morph-streams and SensorCloud) coming from two of the project partners/beneficiaries.

These challenges require, again, new levels of maturity, robustness and scalability.

Therefore, the goal of this project is to address the main challenges regarding maturity, quality, robustness and scalability of semantic data management, to enable the sustained adoption of semantic (linked) data in a broad range of practical settings. Particular objectives are:

  1. To develop solutions for managing quality through the adaptation and use of techniques from preservation, database systems, provenance management and crowdsourcing.
  2. To address the particular semantic management requirements of dynamic (sensor and social) data, thus increasing the robustness and maturity of semantic data management.
  3. To develop efficient, sound solutions for managing semantic data, including the aspects of 1 - 2 above, thus addressing the key quality criteria of efficiency and adequacy.

For each of these objectives we have created a specific workpackage, whose detailed description can be found here.

Finally, the large size of the data being published is also challenging existing hardware and software infrastructures for an efficient semantic data management. This includes the need to analyse how high-performance / cloud computing infrastructures support such management, in terms of massively parallel semantic reasoning, with methods capable of dealing with imperfect, imprecise, temporal and/or spatial semantic data. In the context of this project, secondments have allowed project partners to produce results in:

- A survey of large-scale reasoning with semantic data (currently under review by the prestigious Journal of Web Semantics), which gives, for the first time, a comprehensive and systematic overview and analysis of mass-parallelization techniques applied to a variety of reasoning methods.

- More efficient processing of streams for reasoning using GPU units.

- New approaches for the efficient processing of the OWL2 RL and EL profiles.