ploosh: three key approaches to automating tests in data projects

Introduction

In previous articles, we introduced Ploosh as an automated testing framework, highlighting its role in preventing regressions and improving the quality of deliveries in complex data projects. We also demonstrated its effectiveness in a data migration context, where it was used to test data flows between a legacy system and a cloud platform.

In this article, we will explore new approaches to using Ploosh in more traditional data projects, focusing on regression testing, data quality checks, and validation through test datasets.

Context

In this traditional data project, Ploosh was used to manage complex data flows feeding multiple layers of data, including sources in the form of files, a data warehouse, and a data mart. Each modification to this ETL chain, composed of complex calculations and specific business rules, introduced a risk of regression.

Initially, Ploosh was used to minimize these regression risks, ensuring optimal quality in the delivery of new features and fixes. Later, its use was extended to more operational checks, such as managing data rejections and verifying data completeness across the different layers.

Ploosh was fully integrated into Azure DevOps, with tests planned and executed automatically in the pipelines. The test cases, stored in a Git repository, were designed and maintained by the entire team (developers, testers, and business analysts), promoting seamless collaboration.

Approaches

Ploosh is a tool designed to simplify test creation, but the person writing the tests remains central to the process. Although Ploosh allows for the creation of any type of test, the method of implementation depends on the approach chosen. In this project, three main approaches were used, although other options are also possible.

By recalculation

This approach involves validating the results produced by the ETL processes by recalculating the same operations using independent SQL queries. The idea is to replicate in SQL the transformations performed by the ETL to ensure that the final results are accurate and meet expectations. This ensures that the business rules and complex calculations applied during the transformations are correctly implemented.

Principle

The recalculation approach involves writing an SQL query that reproduces the calculations and transformations performed by the ETL on a given dataset. Then, the results obtained by this SQL query are compared to the results from the ETL process. If the results are identical, the test is validated; otherwise, it fails, indicating a potential error in the ETL process.

This method is particularly useful in cases where complex calculations or aggregations are performed, as it ensures that the ETL follows the business rules and that the data is handled correctly at each stage.

Exemple

Let’s suppose an ETL process aggregates sales data and groups it by category and region. To verify the accuracy of the results, you can recalculate these aggregations with a simple SQL query:

Test sales aggregation:
  source:
    connection: demo_sql
    type: MSSQL
    query: |
      SELECT category, region, SUM(sales_amount) AS total_sales
      FROM dwh.fact_sales
      GROUP BY category, region

  expected:
    connection: demo_sql
    type: MSSQL
    query: |
      SELECT category, region, total_sales
      FROM dmt.fact_sales_aggregated

In this example, the source query and the expected query are identical, as they simulate the aggregation that the ETL process would perform. The results of the two queries are then compared to ensure that the ETL correctly applied the business rules during the aggregation of sales data.

Use cases

This approach can be used in several types of scenarios, such as:

Verification of complex aggregations: For example, calculating monthly sales by product and by region.
Calculation of ratios or indicators: Checking the accuracy of KPIs (Key Performance Indicators) derived from multiple data sources.
Validation of business rules: Ensuring that business rules are properly applied at each stage of data processing.

Disadvantages

The main disadvantage of this approach is the time and effort required to write SQL queries that replicate complex transformations. However, it provides fine control and allows you to identify exactly where errors may occur in the ETL processes.

By empty

This approach stands out for its simplicity and effectiveness, as it only requires half of a typical test case. Instead of defining both a source query and an expected query, this method relies on the use of the “empty” connector for the expected part of the test. The idea is to define a source query that should return no data, meaning that the data is considered incorrect. If Ploosh receives data in response to this query, the test is considered to have failed.

Principle

The test is based on the assumption that the data source should not contain any results that meet the query’s criteria. If the query returns data, this indicates that an anomaly has been detected, causing the test to fail. Conversely, if no data is returned, the test is validated.

This method is particularly suitable for checks such as:

Detecting incorrect or invalid data (e.g., values outside the allowed ranges),
Verifying the absence of duplicates,
Ensuring data consistency (absence of conflicts or inconsistencies in the datasets).

Example

Let’s take an example to verify the absence of rejected records in a table called fact_sales_rejects, which contains entries rejected during the ETL process.

Test sales rejects:
  source:
    connection: demo_sql
    type: MSSQL
    query: |
      SELECT * 
      FROM dwh.fact_sales_rejects
      WHERE date_insert BETWEEN DATEADD(day, -1, GETDATE()) AND GETDATE()

  expected:
    type: empty

In this example, the source query checks the fact_sales_rejects table to retrieve data inserted in the last 24 hours. The empty connector is used for the expected part of the test, meaning Ploosh expects no data to be returned. If any data is found, this signals a problem (abnormal or unresolved rejections), and the test fails. If no data is returned, the test passes.

Use cases

This approach is extremely flexible and can be applied to various control scenarios:

Data quality checks: Ensuring that invalid data is not present in target tables.
Duplicate verification: Ensuring that no duplicate records exist in a given table.
Consistency checks: Ensuring that relationships and business rules are respected in the datasets.

Advantages

Simplicity: You only need to define a source query without creating an expected query. The empty connector significantly simplifies the test definition.
Fast implementation: This method allows for rapid creation of tests to check for the absence of undesirable or inconsistent data.
Flexibility: It can be used in various contexts to quickly verify critical aspects of the data system.

Variants

You can adapt this approach by slightly modifying the criteria in the source query to test different aspects of the data. For example, it can be used to check for the absence of NULL values in specific columns or to ensure that certain thresholds or business rules are not violated.

By test data

This approach is particularly effective for validating the accuracy of calculations and business rules in data projects. However, it requires a solid functional understanding of the system being tested. The idea is to insert predefined test data into the project’s data sources to simulate specific scenarios. These data are then compared to expected results, which have already been calculated, to ensure that the processes comply with the established business rules.

One major advantage of this method is that it allows for the control of specific cases while ensuring that the sampling of test data remains consistent and reliable. By doing so, potential regressions can be quickly identified. The test data are injected permanently into the data flow (for example, via a Y-flow) without altering the real production data.

Implementation steps

Set up a Y-flow to inject the test source files, ensuring that these data do not modify the production sources.
Create test source files that contain the datasets necessary to populate the relevant tables (e.g., the fact_sales table).
Develop the expected results files, which will be used to compare the test output with known and validated data.
Write the Ploosh test case, which will compare the data generated from the test files with the expected results.

Example

Suppose the fact_sales table in a data warehouse is populated by two source files (sales_headers.csv and sales_content.csv), which are ingested via ETL processes. These processes apply complex business rules to transform the data.

Here’s how a test can be implemented to verify the integrity of this process:

Test sales:
  source:
    connection: demo_sql
    type: MSSQL
    query: | 
      SELECT * 
      FROM dwh.fact_sales
      WHERE customer = 'Test Customer'

  expected:
    type: csv
    path: ./data/dwh/facts/fact_sales.csv

In this example, the SQL query retrieves data for a specific test customer (Test Customer) from the fact_sales table in the data warehouse. The results obtained are then compared to a pre-existing CSV file, which contains the expected results. This file has already been validated, ensuring that any discrepancies detected during test execution indicate a potential regression in the ETL processes.

Advantages

Quick regression detection: Since the input data (test files) and expected results (output files) are fixed, any unexpected variation in the results can be quickly identified and corrected.
Controlled sampling: Using predefined test datasets ensures that the tests cover a wide range of scenarios, including edge cases, exceptions, or special cases.
Reliability: Because the test datasets and expected results are constant, tests can be run with each code modification to ensure system stability, without needing to rewrite the tests at each iteration.

Conclusion

Ploosh is a versatile tool, extremely useful for securing ETL workflow changes and ensuring data integrity at each stage of processing. The various approaches presented here demonstrate how Ploosh can adapt to a wide range of needs, whether for regression testing, data quality control, or specific validations. Its agnosticism and flexibility make it a valuable asset for technical teams and business analysts, who can collaborate in the creation and execution of tests.

As Ploosh evolves, new features and connectors will further enhance its capabilities. In the meantime, you can install it (see the guide on the wiki) or check out the source code available on GitHub.

ploosh.

ploosh: three key approaches to automating tests in data projects

ploosh: how to simplify your migration testing?

ploosh: a framework to automatize tests in data project