What is Ploosh? #

Ploosh is yaml based framework used to automatize the testing process in data projects. It is designed to be simple to use and to be easily integrated in any CI/CD pipelines and it is also designed to be easily extended to support new data connectors.

Connectors #

Type	Native connectors	Spark connectors
Databases
Files
Others
Not yet but soon

Get started #

Steps #

Install Ploosh package
Setup connection file
Setup test cases
Run tests
Get results

Install Ploosh package #

Install from PyPi package manager:

pip install ploosh

Setup connection file #

Add a yaml file with name “connections.yml” and following content:

mssql_getstarted:
  type: mysql
  hostname: my_server_name.database.windows.net
  database: my_database_name
  username: my_user_name
  // using a parameter is highly recommended
  password: $var.my_sql_server_password

Setup test cases #

Add a folder “test_cases” with a yaml file with any name. In this example “example.yaml”. Add the following content:

Test aggregated data:
  options:
    sort:
      - gender
      - domain
  source:
    connection: mysql_demo
    type: mysql
    query: | 
      select gender, right(email, length(email) - position("@" in email)) as domain, count(*) as count
        from users
        group by gender, domain
  expected:
    type: csv
    path: ./data/test_target_agg.csv

Test invalid data:
  source:
    connection: mysql_demo
    type: mysql
    query: | 
      select id, first_name, last_name, email, gender, ip_address
        from users 
        where email like "%%.gov"
  expected:
    type: empty

Run tests #

ploosh --connections "connections.yml" --cases "test_cases" --export "JSON" --p_my_sql_server_password "mypassword"

Execution result

Test results #

[
  {
    "name": "Test aggregated data",
    "state": "passed",
    "source": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.0032982
    },
    "expected": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 6.0933333333333335e-05
    },
    "compare": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.00046468333333333334
    }
  },
  {
    "name": "Test invalid data",
    "state": "failed",
    "source": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 0.00178865
    },
    "expected": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 1.49e-05
    },
    "compare": {
      "start": "2024-02-05T17:08:36Z",
      "end": "2024-02-05T17:08:36Z",
      "duration": 1.8333333333333333e-07
    },
    "error": {
      "type": "count",
      "message": "The count in source dataset (55) is different than the count in the expected dataset (0)"
    }
  }
]

Run with spark #

It’s possible to run the tests with spark. To do that, you need to install the spark package or use a platform that already has it installed like Databricks or Microsoft Fabric.

See the Spark connector for more information.

ploosh.

Configuration

Exporters

Connectors native

Connectors Spark

Ploosh

What is Ploosh? #

Connectors #

Get started #

Steps #

Install Ploosh package #

Setup connection file #

Setup test cases #

Run tests #

Test results #

Run with spark #

What are your feelings

Ploosh

What is Ploosh? #

Connectors #

Get started #

Steps #

Install Ploosh package #

Setup connection file #

Setup test cases #

Run tests #

Test results #

Run with spark #

What are your feelings

Share This Article: