MultiType and LargeScale Entity Resolution on the Web of Data.
Over the past decade, numerous knowledge bases (KBs) have been built to power a new generation of Web applications that provide entitycentric search and recommendation services. These KBs offer comprehensive, machinereadable descriptions of a large variety of realworld entities (e.g., persons, places, products, events) published on the Web as LD. Even when derived from the same data source (e.g., a Wikipedia entry), KBs such as DBpedia, YAGO2, or Freebase may provide multiple, nonidentical descriptions for the same realworld entities. This is due to the different information extraction tools and curation policies employed by KBs, resulting in complementary and sometimes conflicting entity descriptions. Entity resolution (ER) aims to identify different descriptions that refer to the same realworld entity, and emerges as a central dataprocessing task for an entitycentric organization of Web data. ER is needed to enrich interlinking of data elements describing entities, even by thirdparties, so that the Web of data can be accessed by machines as a global data space using standard languages, such as SPARQL. ER can also facilitate an automated KB construction by integrating entity descriptions from legacy KBs with Web content published as HTML documents.
The objective of this part of the tutorial is to present the new ER challenges stemming from the Web openness in describing, by an unbounded number of KBs, a multitude of entity types across domains, as well as the high heterogeneity (semantic and structural) of descriptions, even for the same types of entities. The scale, diversity and graph structuring of entity descriptions published according to the LD paradigm challenge the core ER tasks, namely: (i) how descriptions can be effectively compared for similarity; and (ii) how resolution algorithms can efficiently filter the candidate pairs of descriptions that need to be compared.
In a multitype and largescale entity resolution, we need to examine whether two entity descriptions are somehow (or near) similar without resorting to domainspecific similarity functions and/or mapping rules. Furthermore, the resolution of some entity descriptions might influence the resolution of other neighborhood descriptions. This setting clearly goes beyond deduplication (or record linkage) of collections of descriptions usually referring to a single entity type that slightly differ only in their attribute values. It essentially requires leveraging similarity of descriptions both on their content and structure. It also forces us to revisit traditional ER workfows consisting of separate indexing (for pruning the number of candidate pairs) and matching (for resolving entity descriptions) phases.