ResourcR: How to support data linking and entity reconciliation algorithms

Data linking and entity reconciliation are key tasks to enrich an input dataset with data from another source solving semantic mismatches. In previous posts, we introduced the tools that implement the LinkR components of the enRichMyData toolkit, which support these tasks. In the post, we provided more details about how we approach these tasks in our work, using knowledge graphs as valuable abstractions of the external source to link to and reconcile against. This post presents our ResourcR component that aims to back data linking and entity reconciliation algorithms effectively and efficiently.

Data linking and entity reconciliation algorithms often must take difficult decisions, for example, pondering which one among several similar entities best matches an input record. Consider the example in the picture, the value “jurassic world” in the table below must be reconciled against millions of entities in Wikidata; three of these entities have the very same name and it even perfectly matches the table row. To decide which entity is the best match an algorithm should consider different and heterogenous clues that require to inspect entity features stored in the knowledge graph, such as attributes, categories, relations to other entities, and so on.

LamAPI, the ResourcR tool of the enRichMyData toolkit, has the goal of digesting a data source that we want to link to, in our case a knowledge graph, and preparing data structures that support linking and reconciliation. The primary feature is obviously to index the knowledge graph to support fast entity lookup and follow-up matching operations. However, LamAPI also implements a variety of other features that all help to back the linking task. It supports the retrieval of useful data from the indexed graph, including entity relations and literal values. It stores links across graphs (e.g., sameAs), so that they can be exploited for cross-graph lookup. More recent features support entity embeddings, such as RDF2vec, which can be useful to estimate the relatedness between candidate matches for different values in the table. Finally, it also provides services to analyze the input data, such as a data type identification service, which is crucial for identifying specific data types such as numbers, dates, emails, telephone numbers, street addresses and URLs. LamAPI has been extensively tested and engineered with large knowledge graphs such as WikiData and DBpedia, which helps to make its configuration easier when a user wants to make a new knowledge graph, e.g., a proprietary graph, available for linking and reconciliation.

In conclusion, the enRichMyData ResourceR component, implemented by LamAPI, offers a comprehensive suite of data retrieval functionalities to support data linking and entity reconciliation against knowledge graphs, exploiting the speed and reliability of well-known IR technologies and combining recent AI-based models for content analysis.

Related Posts