Could you give a quick outline of your history and the skills you bring to the enrichMyData project?
I have been a data engineer at Ontotext for the last 7 years and have been part of countless data integration projects, both small and relatively large in scale. Ontotext is a unique company as we are both a producer of software – Ontotext GraphDB, a solutions company, solving data related tasks in a bespoke manner for different clients and a research-oriented company, part of tens of consortia in EC-funded projects starting from the FP7 program in the early 2000s.
This means that the variety of data-related tasks and issues we encounter at Ontotext is huge and with that the the depth of the collective knowledge and skills accumulated in the organization.
What factors inspired you to embark on the enRichMyData endeavor, and how does Ontotext contribute as a partner within this project?
The EMD project is a prefect fit for an organization like ours. We struggle constantly with building data-enrichment and cleaning pipelines. No two projects have the same requirement and no two solutions are identical. It follows that we have accumulated bits of solutions that we reuse and adapt constantly. When the EMD consortium invited us to collectively systematize these tools with other similar in nature and build a toolbox for data enrichment we were very happy to accept.
Can you provide an explanation of how the CleanR and WrappR tools tackle data enrichment, including the specific techniques or approaches being employed for this objective?
CleanR is dedicated to data cleaning and transformation. This in my view is the step which defines the real world data problems and sets them apart from the ones in the textbook. The quality of the data in the real world is usually quite low while at the same time there are no excuses not to use it. It follows that somehow cleaning and fixing it is the only option and always involves some manual intervention. This is where CleanR helps, aiming at optimizing the human input without wasting precious human attention on stupid tasks.
WrappR is a collection of tools and techniques assembled around the idea of data virtualizations. Depending on the data integration project it might make more sense to keep part of the data in it’s native format and in its native system while accessing it through a “bridge” or connector from a different system. There are many variations on this theme and WrappR aims to provide solutions to the most common ones.
What obstacles have you come across during the data enrichment process, and what strategies have you employed to overcome these challenges?
People tend to take data quality for granted. The truth is that most often the data is messy and in order to fulfil a given project’s requirements it need cleaning and enriching, stitching together and integrating from various sources. The obstacles to this are many. Sometimes it’s the scale of the data, sometimes the velocity, or speed at which it needs to be updated, sometimes it is non compatibility of formats and sometimes it is even a non-compatibility at the ontological level, where the same reality has been modeled differently. There is no single strategy to overcome such obstacles it is more a question of expertise and experience of the knowledge engineer, who by the end of the EMD project will be substantially happier with a shiny new set of tools in his toolbox.
What makes CleanR a user-friendly tool and what is its main purpose?
CleanR is build around tools such as Open Refine with decades in the making and with hundreds of users. What is unique about the approach chosen by Open Refine is that the interface is designed to be vary user-friendly and usable by non experts, while at the same time allowing advanced users to perform powerful data transformations.
How does WrappR’s inclusion in the enRichMyData toolbox promise to enhance data enrichment processes and elevate data quality?
WrappR adds various data virtualization capabilities covering a wide range of scenarios from and to a graph data format. Whether we want to read a RDF knowledge graph using a tool which only understands relational data or the inverse we want to integrate some tables from a high velocity SQL database with a compex interconnected graph data while indexing parts of it in a full text search engine and dynamically exposing it as a REST API, there will be a WrappR component for the job.
How does Ontotext contribute to the development of those two tools, and what expertise do they bring to the project?
Ontotext is heavily investing in it’s product development and as part of the EMD project will dedicate part of its resources to the specificities of the use cases presented by the project.
Collaboration frequently plays a crucial role in research. In what ways do you envision the partnership between collaborators enhancing the progression of the field?
We already see how the rich and diverse backgrounds of the partners in the project mutually enrich each other’s contribution both to the tools and to the use-cases. As a product company we expect to learn a lot and gain knowledge from the academic partners in the project. Respectively we will provide technical expertise and commercial-grade tooling to the toolbox.
To wrap up, do you have any additional insights you’d like to offer regarding the enrichMyData project, your involvement as a partner, or the larger significance of your work within the domain?
Data engineering is a craft, and one should always be on the lookout for new tools. We are happy and grateful to be part of such a meaningful effort to create such new tools.