ploosh. – A yaml based framework used to automatized the testing process in data projects.

Get started

Ploosh is open source yaml based framework used to automatized the testing process in data projects.

What’s ploosh?

Ploosh is an innovative testing framework designed to automate the validation processes in your data projects. Built around YAML configuration, Ploosh allows you to quickly compare datasets and ensure they meet expected results, all with minimal code.

Why Choose Ploosh?

Effortless Automation: Ploosh simplifies test automation, especially for projects handling large volumes of data. Its YAML-based structure offers clear, easy-to-manage test cases and configuration files.
Fast and Powerful: With its intuitive interface and streamlined commands, Ploosh drastically reduces the time required to set up and run comprehensive tests. Results can be exported in multiple formats (JSON, CSV), giving you full flexibility for analysis.
Seamless Integration: Whether you’re working with SQL databases, file systems, or APIs, Ploosh integrates smoothly into your existing data pipelines. It can be employed in both DevOps environments and traditional software development workflows.

Key Features

YAML-Based Testing: Define your connections and test cases in well-structured YAML files.
Data Comparison: Automatically compare source datasets against expected outcomes with high precision.
Result Exporting: Generate detailed test reports in various formats to facilitate analysis and traceability.

Get started

With lots of unique blocks, you can easily build a page without coding. Build your next landing page.

Steps

Install the ploosh package
Setup a connection file
Setup the test cases
Run tests
Get results

Installation

Install the ploosh package from PyPi package manager with the following command

pip install ploosh

Setup a connection file

Add a yaml file with name “connections.yml” and following content:

mssql_getstarted:
  type: mysql
  hostname: my_server_name.database.windows.net
  database: my_database_name
  username: my_user_name
  // using a parameter is highly recommended
  password: $var.my_sql_server_password

Setup the test cases

Add a folder “test_cases” with a yaml file with any name. In this example “example.yaml”. Add the following content:

Test aggregated data:
  options:
    sort:
      - gender
      - domain
  source:
    connection: mysql_demo
    type: mysql
    query: | 
      select gender, right(email, length(email) - position("@" in email)) as domain, count(*) as count
        from users
        group by gender, domain
  expected:
    type: csv
    path: ./data/test_target_agg.csv

Test invalid data:
  source:
    connection: mysql_demo
    type: mysql
    query: | 
      select id, first_name, last_name, email, gender, ip_address
        from users 
        where email like "%%.gov"
  expected:
    type: empty

Run tests

ploosh --connections "connections.yml" --cases "test_cases" --export "JSON" --p_my_sql_server_password "mypassword"

Test results

[
  {
    "name": "Test aggregated data",
    "state": "passed",
    "source": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.0032982
    },
    "expected": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 6.0933333333333335e-05
    },
    "compare": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.00046468333333333334
    }
  },
  {
    "name": "Test invalid data",
    "state": "failed",
    "source": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.00178865
    },
    "expected": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 1.49e-05
    },
    "compare": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 1.8333333333333333e-07
    },
    "error": {
      "type": "count",
      "message": "The count in source dataset (55) is different than the count in the expected dataset (0)"
    }
  }
]

Run with spark

It’s possible to run the tests with spark. To do that, you need to install the spark package or use a platform that already has it installed like Databricks or Microsoft Fabric.

See the Spark connector for more information.

Read our blog

ploosh: three key approaches to automating tests in data projects
by Charlie Collier
1 October 2024
Introduction In previous articles, we introduced Ploosh as an automated testing framework, highlighting its role in preventing regressions and improving the quality of deliveries in complex data projects. We also demonstrated its effectiveness in a data migration context, where it was used to test data flows between a legacy system and a cloud platform. In… Read more: ploosh: three key approaches to automating tests in data projects
ploosh: how to simplify your migration testing?
by Charlie Collier
17 September 2024
In a previous article, I introduced Ploosh, a tool I developed to facilitate testing in the data domain. Today, I will show you a use case where Ploosh was used to improve efficiency during testing phases. This article presents how Ploosh was implemented in a migration project. In a future article, I will demonstrate how… Read more: ploosh: how to simplify your migration testing?
ploosh: a framework to automatize tests in data project
by Charlie Collier
13 February 2024
In this article, I will present the issues related to testing in data projects and introduce one of my tools to address them. In a future article, I will return to the various approaches that can be applied as well as some use cases. 1. The problem of automated testing in data Testing tools for… Read more: ploosh: a framework to automatize tests in data project