TOOLBOX
enRichMyData delivers its capabilities as a set of interoperable tools and services that form the enRichMyData Toolbox:
- a collection of tools that provide functional capabilities needed to support the design of pipelines;
- a set of infrastructure services that provide non-functional capabilities needed to support the effective and efficient deployment and execution of pipelines.
enRichMyData, a toolbox with loosely coupled but interoperable tools and services, handles complex data enrichment scenarios where tools and services can be combined and customized.
Tool collection
DiscoverR
DiscoverR assists users in searching datasets, ontologies, and enrichment services and provides insights on their content to support their use in the enrichment pipeline. The user can search for keywords over descriptions of catalogued datasets/ontologies/services (particular features such as metadata, formats, ontology terms, and quality features), or browse specific descriptions from a visual interface. Catalogued datasets/services/ontologies encompass well-established knowledge bases (e.g., WikiData, DBpedia, Schema.org) and data used in and created with pipelines. DiscoverR provides semantic data profiling techniques to enrich basic descriptions based on metadata (e.g., DCAT) with ontology usage patterns (e.g., connections between concepts) and statistics (e.g., frequency, cardinality, etc.). These profiling techniques are applied to semantic sources, as well as “non semantic” sources, which are profiled by inferring the semantics of their schema exploiting annotation services of the LinkR component. Profiling is compliant with and boosts FAIR principles.
CleanR
CleanR supports the specification of data manipulation transformations, including data cleaning operations and generating knowledge graphs from various data formats. Users specify transformations interactively from a user interface, while specifications will be stored in a machine-readable format to be replicated and reused. CleanR provides a broad set of AI-enabled data transformations (e.g., ML-based recommendations) and integrates them with generic linking and extension functionalities provided by the ResourcR. CleanR enables data cleaning and enrichment operations to be shared (as asset, text or executable), managed, and, if needed, incorporated as steps in the data pipelines in the ScalR component.
StructR
StructR is the counterpart of LinkR for unstructured data. It generates structured data from the unstructured input text through semantic annotation, linking and extension. The text is processed by linguistic and semantic tools, and concept mentions are identified and disambiguated from context. Furthermore, the text is projected into an embedding space using representation learning. StructR supports a range of different pre-computed embeddings to represent the text and expand the dataset. Extension with custom annotation services is supported through a labelling interface for creating and editing text annotations which can then be used to build new annotation models in a human-in-the-loop fashion.
WrappR
WrappR provides data access using a virtual semantic layer and ensures secure access. WrappR is delivered as a semantic graph database with efficient reasoning, cluster and external index synchronization support. It provides various types of APIs and access methods, as well as different types of data federation and virtualization. Through semantic data access and integration, WrappR provides a practical, robust and versatile tool to improve access to data.
LinkR
LinkR provides capabilities for semantic annotation of structured and semi-structured data using reference knowledge graphs and category schemes. Annotations consist of links from elements of the input data to elements of well-established knowledge bases and ontologies (e.g., WikiData, DBpedia, and Geonames), or user-defined knowledge graphs made available through the ResourcR (including schema-level annotations ontology terms, and instance-level annotations with identifiers). LinkR supports annotations through intelligent ML algorithms recommending annotations and a human-in-the-loop approach enabling fine-tuning the recommendations algorithms and revising the results, ensuring high-quality annotations while minimizing the users' effort even on very large data volumes. Annotations will be converted into data transformations, to be used as part of enrichment pipelines.
ClassifieR
ClassifiR supports data classification as a service and complements StructR. Whereas StructR identifies properties of parts of the text, ClassifiR labels the documents as a whole. The labels can be part of standard taxonomies, industry classifications, and custom sets of labels for which a classifier is built. Custom classification is supported by an interactive graphical interface which allows users to explore a document corpus and create ontologies through clustering, labelling and querying. ClassifiR automates the classification process and exposes it through a common endpoint independent of the classification used.
Infrastructure services
ScalR
ScalR provides infrastructure components for executing cleaning, transformation and linking at a large scale. ScalR provides horizontal scalability of data enrichment pipelines using software containers, and support for managing the different procedures associated with the execution of data enrichment pipelines flexibly on heterogeneous computing infrastructures. ScalR provides integrated support for specific data enrichment operations in the form of a pipeline through the development of reusable standard templates for setting up such pipelines. ScalR promotes the reuse and modification of existing data enrichment pipelines by exposing them as an integrated deployable unit, as opposed to ad-hoc, non-reusable pieces of code.
StreamR
StreamR provides infrastructure components for streaming support in data enrichment pipelines. It pipes data streams from/to appropriate endpoints and ensures high throughput, providing a configurable set of tools for setting up custom streams for new applications.
ReusR
ReusR provides infrastructure components for search and recommendation of assets (e.g., datasets, transformations, etc.) related to setting up and running data enrichment pipelines. It provides user login to data management assets, public/private access to assets and editing, sharing and versioning them. ReusR enables users of the enRichMyData toolkit to edit pipelines and promotes their reuse across use cases.
GreenR
GreenR provides infrastructure components to support monitoring of data enrichment pipelines in terms of their environmental impact. It monitors the carbon footprint of the various components in the pipeline and provides the results to the use through a dashboard to log and modulate the environmental impact due to the heavy computations within the pipelines.
ResourcR
ResourcR provides infrastructure components to support the creation of linking services for a given dataset from a data provider as well as access mechanisms such as search and query. ResourcR enables performant linking and search functionalities with limited effort and expose them as search and linking APIs. The combination of ResourcR and LinkR makes it possible to turn semantic data produced with the toolbox into resources immediately available for reuse.
The consortium
Consists of 13 partners from 11 countries. It has three strong university partners specialised in Big Data, distributed computing, and high-productivity languages, led by a research institute. Additionally, one research institute and one international organisation are involved. EnrichMyData gathers three SMEs and five large companies that prioritise the business focus of the project in achieving high business impacts.