- Could you briefly share your journey and the skills you bring to the enrichMyData project?
CS GROUP – ROMANIA team has a very strong experience in designing and implementing software solutions for big-data processing. It brings in the enRichMyData project the experience gained during implementation of toolboxes and processing chains for Earth Observation domain (where we deal with large amounts of data in terms of terabytes and even petabytes, that are processed on multiple machines). - What personally motivated you to join the enRichMyData project, and in what unique ways does the CS GROUP – ROMANIA enrich the project as a partner?
TAO (stands for Tool Augmentation by user enhancements and Orchestration) is a very powerful open source, lightweight, generic, extensible and distributed orchestration framework. It was already providing some of the scaling and reusing capabilities required within the enRichMyData project. Nevertheless, it is intended more for Earth Observation processing chains. Our goal is to make it, within enRichMyData project, more flexible and applicable for pipelines implementation in other domains. - The endeavor your team is undertaking involves crafting two tools (ScalR and ReusR) for the enrichMyData toolbox. Could you shed light on the complexities you face in terms of juggling between the different natures of those tools and their intended use?
The two tools are provided actually as a single tool (TAO) offering both reusing capabilities, through its web interface but also scaling capability (a more invisible part for the user) that allows horizontal scaling of the pipelines using various resource managers (the resource managers are in charge with the execution on various platforms, depending on the implementation chosen). - As the creator of ScalR, what aspect of the tool’s functionality do you find most innovative and impactful?
We believe the visual designer for workflows can be an important feature for users less familiar with workflow languages. In addition, being able to define a workflow once and execute it on various backends is very useful, making it easier to switch between job runners if necessary. - Have there been any specific challenges you’ve encountered while developing ScalR, and how have you addressed them?
TAO has a strong feature set but was originally designed with a specific problem domain in mind. Because of this, it’s sometimes heavy on jargon and can be confusing for other users. We are trying to make it more approachable to new users, which should improve on this. - Can you share a personal experience where ScalR successfully managed to achieve higher data quality and improved data flow readability?
The ScalR is still under development (as sub-functionality of TAO) and its role is not to improve the data quality but to provide horizontal scalability of pipeline executions. Thus, one of our major achievements during enRichMyData project so far was the first successful integration of the Kubernetes as a Resource Manager in TAO, allowing the execution of TAO workflows in a Kubernetes cluster. Even if the integration is at the beginning, the initial results are very promising and encouraging. Also, we added recently in TAO (still under development, but first results are also encouraging) the notifications using websockets. This allows a more reactive web interface in the changes during executions and other notifications. - As the creator of ReusR, what specific aspects of the tool’s features do you believe provide the most value to users?
The ReusR is covered also by the TAO framework that comes with the possibility to version the components various components (Application containers, processing components, workflows etc.). This brings to the users the option to reuse the versioned components in easily creating new workflows or clone workflows from current ones. The cloned workflows can be used, by changing a component in the workflow with another one, for comparing results between pipelines executions or just to easily extend the current workflow with new steps. - What excites you the most about the integration of ReusR within the enRichMyData toolkit and its potential impact?
The ability to version and reuse various components (processing components, workflows, different application containers) allow to easily interchange the components. The changes can be done by the user by simply removing some components on a dashboard and replacing them with a simple drag and drop. Validations about compatibilities are also perform during operations. Nevertheless, some operations will be also impossible to manage like in the cases when some component versions become obsolete, and they cannot be integrated in the workflows. Nevertheless, the ease of integrating new components and (re)use them makes the tool a very powerful one.