DiscoverR tools are the components of the enRichMyData toolbox that help users find and understand data that they can use in their data enrichment processes.
Since knowledge graphs (KGs) play a crucial role in data enrichment, either as target data sources of interest or as bridges to reach additional sources, the first DiscoverR tool, ABSTAT, supports pattern-based profiling of even very large KGs, as well as explorative searches on top of these profiles. In the backend, profiles list all the schema-level connections existing in a graph as well as several statistics, thus providing a schema-level complete summary of the data stored in the KG. In the frontend, explorative queries support humans and machines in searching relevant data (“Which connections do the graph represent between cities and sports teams?”), and filter and browse all available connections (e.g., finding all the properties used to describe entities of the class dbo:City, or finding that the unique property connecting dbo:City and dbo:SportTeam in DBpedia is dbo:wikiPageWikiLink and that about 9000 of these connections exist). These profiles have been proved useful to help humans formulate queries over KGs with complex schemas, annotate tables, detect quality problems, and to help machines select the most relevant features. In enRichMyData, profiles of well-established KGs like WikiData and DBpedia will be considered, but also of KGs produced and used in the project.
A second DiscoverR tool is SemTUI, a framework that provides a User Interface (UI) to let users enrich tables by combining data linking and data extension services. In this case, the discovery process happens while the user is enriching a data sample with the user interface: the user can discover linking algorithms that are available and use them to bridge to existing data sources (e.g., linking cities described in a column to their id in the DBpedia KG); once the links are found using the selected linking service, additional data can be fetched from the reference data source: the user can explore data available in the data source and specify the data she/he wants to add to the table (e.g., fetching the population of each city from DBpedia). Although this “link and extend” mechanism is inspired by the principles of web-based data exploration of linked open data, SemTUI is not limited to exploiting linked data sources. For example, we tested linking and extension services from private company KGs and included the HERE geocoding service for linking addresses to coordinates and; once two columns have geocoordinates., data can be enriched with the shortest route distance (as calculated by a route planning service). In other words, explorative functionalities offered by SemTUI support the discovery of (1) linking and extension services, (2) data fetched from external sources, and (2) possible flaws in the data enrichment process (e.g., wrong links). SemTUI implements, extends, and improves functionalities previously available in a similar tool named ASIA, providing a better user experience and the possibility of translating the enrichment operations into code that can also be manipulated from developer-friendly interfaces like a notebook.
The University of Milano-Bicocca contributes to the development of DiscoverR tool by bringing expertise in data discovery and linking solutions thanks to its research and associated tools in data profiling, data integration, semantic annotation of tabular data, and service-based architectures.