Introducing SIM-PIPE: A New Approach to Data Pipeline Optimization

enRichMyData has released the next generation of tools and services in its Toolbox V2. As part of this EU-funded project, we offer innovative resources to enhance data enrichment processes, optimize data workflows, and promote sustainable practices.

Developed by SINTEF, SIM-PIPE is a robust tool designed to streamline data pipeline management. It empowers users to simulate or “dry-run” data workflows, providing precise metrics on resource usage at each pipeline stage. These insights allow users to anticipate and manage resource needs effectively, ensuring greater efficiency and predictability in data processes.

Our team has focused on integrating SIM-PIPE with TAO to enhance prediction capabilities. We have developed backend functionality specifically for this purpose, allowing for seamless interaction between SIM-PIPE and TAO and offering an enhanced approach to resource prediction.

Enhanced Resource Prediction and Data Handling

In Version 2, SIM-PIPE features API endpoints for detailed resource consumption predictions based on dry-run data. This addition enables users to predict and optimize resource needs more accurately. We have also implemented a dedicated pipeline to split tabular input data for dry runs, making it easier to handle large datasets and predict specific resource requirements.

Broader Use Cases and Sustainability Initiatives

Version 2 also brings the ability to run data pipelines on emulated hardware. This functionality allows users to obtain predictions for resource consumption across different computing architectures, ideal for testing configurations across diverse hardware environments.

In line with our commitment to sustainable data practices, we are also exploring the integration of carbontracker within SIM-PIPE. Once implemented, carbontracker will offer carbon footprint insights for pipeline dry-runs, providing users with an understanding of environmental impact alongside resource optimization.

SIM-PIPE frontend showing resource consumption and logs

SIM-PIPE Availability and Usage

SIM-PIPE is available as open-source software on GitHub: https://github.com/DataCloud-project/SIM-PIPE.

The tool can be accessed through its GraphQL API, with schema documentation available via tools like Insomnia or Apollo GraphQL Studio, or through its intuitive frontend (GUI) interface. A snapshot of the front end is shown in the Figure.

For more information and to start using SIM-PIPE, visit our GitHub repository: SIM-PIPE GitHub Link.

Scroll to Top