ploosh.
Documentation
Data quality checks
Ploosh can be used for operational data quality monitoring, ensuring data consistency, completeness, and correctness across your data platform.
Approaches
Absence of anomalies (empty approach)
Use the empty connector to verify that no problematic data exists:
Test no NULL emails:
source:
connection: dwh
type: mssql
query: |
SELECT *
FROM dim_customer
WHERE email IS NULL AND is_active = 1
expected:
type: emptyTest no duplicate customer IDs:
source:
connection: dwh
type: mssql
query: |
SELECT customer_id, COUNT(*) AS cnt
FROM dim_customer
GROUP BY customer_id
HAVING COUNT(*) > 1
expected:
type: empty
Test no future dates:
source:
connection: dwh
type: mssql
query: |
SELECT *
FROM fact_orders
WHERE order_date > GETDATE()
expected:
type: empty
Completeness checks
Verify that record counts match across layers:
Test row count consistency:
source:
connection: raw
type: mssql
query: |
SELECT sourcetable, COUNT(*) AS rowcount
FROM raw.load_log
WHERE load_date = CAST(GETDATE() AS DATE)
GROUP BY source_table
ORDER BY source_table
expected:
connection: dwh
type: mssql
query: |
SELECT sourcetable, COUNT(*) AS rowcount
FROM dwh.load_log
WHERE load_date = CAST(GETDATE() AS DATE)
GROUP BY source_table
ORDER BY source_table
Referential integrity
Verify that foreign key relationships are respected:
Test no orphan orders:
source:
connection: dwh
type: mssql
query: |
SELECT f.orderid, f.productid
FROM fact_orders f
LEFT JOIN dimproduct p ON f.productid = p.product_id
WHERE p.product_id IS NULL
expected:
type: empty
Range validation
Verify that values fall within expected ranges:
Test valid percentages:
source:
connection: dwh
type: mssql
query: |
SELECT *
FROM fact_metrics
WHERE successrate < 0 OR successrate > 1
expected:
type: empty
Spark mode for Fabric
The same quality checks work in Microsoft Fabric with Spark connectors:
Test no NULL customer IDs (Fabric):
source:
type: sql_spark
query: |
SELECT *
FROM lakehouse.dim_customer
WHERE customer_id IS NULL
expected:
type: empty_spark
See Testing a Fabric data platform for a complete Fabric use case.
Scheduling quality checks
For operational monitoring, schedule quality checks to run regularly:
- Daily: After nightly batch processing completes
- After each pipeline run: As a post-processing step
- On demand: When investigating data issues
Use the disabled option to temporarily skip checks that are under investigation:
Temporarily disabled check:
disabled: true
source:
connection: dwh
type: mssql
query: SELECT * FROM problematic_table
expected:
type: empty