ploosh.
Documentation
Testing a Fabric data platform
This guide demonstrates a complete use case of Ploosh in Microsoft Fabric: validating workloads that write data to Lakehouse tables and events to KQL databases, distributed across multiple workspaces.
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ingestion WS │ │ Transform WS │ │ Reporting WS │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Raw │ │ │ │ DW │ │ │ │ Datamart │ │
│ │ Lakehouse │ │ │ │ Lakehouse │ │ │ │ Lakehouse │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ shortcuts │ shortcuts │
└───────────────┬───────┴───────────────────────┘
│
┌──────────▼──────────┐
│ Ploosh WS │
│ ┌───────────────┐ │
│ │ Ploosh │ │
│ │ Lakehouse │ │
│ │ ├ cases/ │ │
│ │ ├ outputs/ │ │
│ │ └ shortcuts/ │ │
│ ├───────────────┤ │
│ │ Notebook │ │
│ │ Semantic Model│ │
│ │ PBI Report │ │
│ └───────────────┘ │
└─────────────────────┘
Test scenarios
1. Cross-layer validation
Verify that the transformation layer correctly aggregates raw data:
Test employee aggregation:
options:
sort:
- department
source:
type: sql_spark
query: |
SELECT department, COUNT(*) AS employee_count
FROM raw_lakehouse.employees
GROUP BY department
expected:
type: sql_spark
query: |
SELECT department, employee_count
FROM dwlakehouse.dimdepartment_summary
2. KQL event completeness
Verify that pipeline events are correctly logged:
Test pipeline events logged:
source:
type: fabrickqlspark
connection: kql_events
query: |
PipelineEvents
| where Timestamp > ago(1d)
| summarize event_count = count() by PipelineName
| order by PipelineName asc
expected:
type: sql_spark
query: |
SELECT pipelinename AS PipelineName, expectedcount AS event_count
FROM plooshresources.expecteddaily_events
ORDER BY PipelineName ASC
3. Data quality checks
Detect anomalies using the empty connector:
Test no NULL customer IDs:
source:
type: sql_spark
query: |
SELECT *
FROM dwlakehouse.factorders
WHERE customer_id IS NULL
expected:
type: empty_sparkTest no duplicate orders:
source:
type: sql_spark
query: |
SELECT order_id, COUNT(*) AS cnt
FROM dwlakehouse.factorders
GROUP BY order_id
HAVING cnt > 1
expected:
type: empty_spark
4. Reference data validation
Compare Lakehouse data against known reference files:
Test product categories:
options:
sort:
- category_id
source:
type: sql_spark
query: |
SELECT categoryid, categoryname, is_active
FROM dwlakehouse.dimproduct_category
expected:
type: csv_spark
path: /lakehouse/default/Files/plooshresources/expectedcategories.csv
header: true
inferSchema: true
Connections file
kql_events:
type: fabrickqlspark
connection_mode: native
kusto_uri: https://mycluster.kusto.windows.net
databaseid: eventsdatabase
Spark SQL queries against Lakehouse tables (via shortcuts) do not require a connection definition.
Results exploitation
After execution, the results are:
- Exported as JSON for immediate review
- Persisted to a Delta table (
ploosh_results) for historical tracking - Visualized in Power BI through a Semantic Model
See Fabric reporting for dashboard setup details.
End-to-end automation
- Upstream pipeline completes data processing
- Fabric Pipeline triggers the Ploosh notebook
- Ploosh executes all test cases and exports results
- History tracking appends results to the Delta table
- Power BI refreshes and shows updated quality metrics
- Alerts notify the team if tests fail