Executing cleaning, transformation, and linking at large scale requires infrastructural components that allow for scalability. As the scalability is the ability of a system to sustain increasing workloads by making use of additional resources, the implementation of a system with this characteristic is an essential step in a big data pipeline to avoid common performance bottlenecks.
There are two main ways of scaling:
- Scaling up, or vertical scaling: means using more powerful hardware and more memory. This method offers the best performance, since everything works on the same machine. A possible limitation could be related to the speed of growth of the process; for a fast process, it represents just a short-term solution, and frequent updates became more and more expensive due to hardware limitations.
- Scaling out, or horizontal scaling: means adding new power across the infrastructure and not in the same machine. This solution uses parallel computing to increase the performance of the infrastructure and is also valid in the long term. At the same time, moving from a single machine to a distributed system leads to lower speed and higher complexity.
The main goal of the ScalR is to provide horizontal scalability of data enrichment pipelines using software containers and support for management of the different procedures associated with the execution of data enrichment pipelines flexibly on heterogeneous computing infrastructures.
The goal can be achieved by promoting the reuse and modification of existing data enrichment pipelines by exposing them as an integrated deployable unit, as opposed to ad-hoc, non-reusable pieces of code. For this reason, most of the tools that handle scalability can also be considered data orchestrator tools, since, in general, it is usually more convenient to use an automatic tool to orchestrate a pipeline, rather than combine different scripts. This option allows the user to use different languages, like batch or Python files, and use them in the same pipeline with no difficulties. A data orchestrator can receive as input different types of files, handle the scalability, create and schedule the progress of the entire data pipeline. Other advantages of the use of a data orchestrator are the good readability of the data flow, a limited space for error since a great number of activities are managed automatically, and better data quality visibility.
In the enRichMyData project a tool from CS GROUP–ROMANIA called TAO (which stands for Tool Augmentation by user enhancements and Orchestration) is used as for providing the main ScalR functionalities. TAO is an open source, lightweight, generic, extensible, and distributed orchestration framework. It allows to reuse (i.e., integrate) commonly used toolboxes (such as, but not limited to some EarthObservation processing tools like SNAP, Orfeo Toolbox, GDAL, PolSARPro, etc.). This framework allows for processing composition and distribution in such a way that end users could define processing workflows by themselves and easily integrate additional processing modules, without any programming knowledge requirements.
TAO platform provides a means for orchestrating heterogeneous processing components and libraries to process scientific data. This is achieved in the following steps:
Preparation of resources (including processing components or tools) and data input.
Definition of a workflow as a processing chain.
Execution of workflows, scaling up from one to as many nodes as made available.
Retrieval and visualization of the results, which allows one to see what is executing in the system and where it is executing, as well as the system resource usage (CPU, memory, storage space).
An important TAO component is the DRMAA (Distributed Resource Management Application API), which provides a standardized access to the DRM systems for execution resources. It is focused on job submission, job control, reservation management, retrieval of jobs, and machine monitoring information. Currently, there are supported DRMAA implementations for local, remote SSH, Torque and SLURM executions. Support for Kubernetes and CWL (Common Workflow Language) implementation is also in progress. The DRMAA implementations are provided as plugins that allow a high flexibility as the current implementation can be changed easily with another.