What is Ploosh? #
Ploosh is yaml based framework used to automatize the testing process in data projects. It is designed to be simple to use and to be easily integrated in any CI/CD pipelines and it is also designed to be easily extended to support new data connectors.
Connectors #
Type | Native connectors | Spark connectors |
---|---|---|
Databases | ![]() ![]() ![]() ![]() ![]() ![]() |
![]() |
Files | ![]() ![]() ![]() |
![]() ![]() |
Others | ![]() |
![]() |
Not yet but soon | ![]() ![]() |
![]() |
Get started #
Steps #
- Install Ploosh package
- Setup connection file
- Setup test cases
- Run tests
- Get results
Install Ploosh package #
Install from PyPi package manager:
pip install ploosh
Setup connection file #
Add a yaml file with name “connections.yml” and following content:
mssql_getstarted:
type: mysql
hostname: my_server_name.database.windows.net
database: my_database_name
username: my_user_name
// using a parameter is highly recommended
password: $var.my_sql_server_password
Setup test cases #
Add a folder “test_cases” with a yaml file with any name. In this example “example.yaml”. Add the following content:
Test aggregated data:
options:
sort:
- gender
- domain
source:
connection: mysql_demo
type: mysql
query: |
select gender, right(email, length(email) - position("@" in email)) as domain, count(*) as count
from users
group by gender, domain
expected:
type: csv
path: ./data/test_target_agg.csv
Test invalid data:
source:
connection: mysql_demo
type: mysql
query: |
select id, first_name, last_name, email, gender, ip_address
from users
where email like "%%.gov"
expected:
type: empty
Run tests #
ploosh --connections "connections.yml" --cases "test_cases" --export "JSON" --p_my_sql_server_password "mypassword"
Test results #
[
{
"name": "Test aggregated data",
"state": "passed",
"source": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 0.0032982
},
"expected": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 6.0933333333333335e-05
},
"compare": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 0.00046468333333333334
}
},
{
"name": "Test invalid data",
"state": "failed",
"source": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 0.00178865
},
"expected": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 1.49e-05
},
"compare": {
"start": "2024-02-05T17:08:36Z",
"end": "2024-02-05T17:08:36Z",
"duration": 1.8333333333333333e-07
},
"error": {
"type": "count",
"message": "The count in source dataset (55) is different than the count in the expected dataset (0)"
}
}
]
Run with spark #
It’s possible to run the tests with spark. To do that, you need to install the spark package or use a platform that already has it installed like Databricks or Microsoft Fabric.
See the Spark connector for more information.