I am Roberto Avogadro I work at SINTEF as a Research Scientist and my organization is the coordinator of the EU-funded enRichMyData project whose aim is to develop a comprehensive toolbox for data enrichment pipelines.
In simple words, this means a set of tools and services that allow to execute data pipelines to enrich data with added values
The initial version of the toolbox is already released. How does it help data scientists?
First, the initial version of the toolbox helps data scientists gain valuable insights from their data. Once these insights are discovered, the next step is to scale up the process to efficiently handle large volumes of data.
An interesting example comes from one of our project partners – Spend Network (SN), where it is essential to reconcile public bodies against a reference dataset, such as Wikidata. This process provides proof of existence for public bodies, enabling access to more detailed information about these entities.
Which businesses can benefit from this toolbox and how?
In the project, we have so-called “business case” partners whose role is to demonstrate the added value of the toolbox adoption in various business areas such as digital marketing, manufacturing, predictive maintenance, public procurement, innovation ecosystems, and mineral processing. Based on their experience with enriched data operations other companies in the same or similar sectors can also benefit from the toolbox.
What is the added value of the enRichMyData toolbox compared to other existing sets of tools?
If we talk about the added value of the enRichMyData toolbox, the main advantage is that it is tailored to specific business cases. It allows you to process data very efficiently, especially considering that the sustainability problem is becoming increasingly relevant nowadays.
A good example is that everyone can see how large language models can be beneficial, but they often don’t scale well. If you need to process a large amount of data, it will take a significant amount of time and may not be sustainable. Additionally, large language models can sometimes produce inaccurate or wrong information, known as “hallucination.” This is a crucial advantage of the enRichMyData toolbox, which ensures data accuracy and reliability through its robust components.
For instance, the LinkR component is designed to efficiently link entities to reference data, ensuring high accuracy and consistency. Meanwhile, the ResourceR component provides access to high-quality reference data, such as Wikidata, enhancing the overall data enrichment process.
Moreover, the toolbox’s ability to integrate and enrich data from diverse sources ensures that users can leverage comprehensive and high-quality datasets, which is vital for accurate analytics and decision-making.