Spark mode overview

Ploosh supports two execution modes: Native (Pandas) and Spark (PySpark). Spark mode is designed to run within distributed environments like Microsoft Fabric, Databricks, or a local Spark session, enabling validation of large-scale datasets.

Why Spark mode?

Benefit	Description
Distributed processing	Leverage Spark clusters to validate large volumes of data
Native platform access	Query Lakehouse tables, KQL databases, and Delta files directly
No data movement	Data stays within the platform, avoiding costly exports
Integrated execution	Run tests directly from notebooks alongside your data pipelines

When to use Spark mode?

Scenario	Recommended mode
CI/CD pipeline on a build agent	Native
Local development with small datasets	Native
Microsoft Fabric notebooks	Spark
Databricks notebooks	Spark
Large datasets (millions of rows)	Spark
Querying Lakehouse/KQL/Delta files on a cluster	Spark

Spark connectors

Spark mode uses dedicated connectors. You cannot mix Spark and native connectors in the same test case.

Connector	Type	Description
`csvspark`	File	Read CSV files via Spark
`jsonspark`	File	Read JSON files via Spark
`parquetspark`	File	Read Parquet files via Spark
`deltaspark`	File	Read Delta tables via Spark
`sqlspark`	Query	Execute Spark SQL queries
`fabrickqlspark`	Database	Query Fabric KQL databases
`dremiospark`	Database	Query Dremio via Arrow Flight SQL
`empty_spark`	Utility	Return an empty DataFrame

Spark comparison engine

The Spark compare engine supports two comparison modes:

Mode	Description
order (default)	Rows are matched by position using a `rownumber()` window function
join	Rows are matched by specified `joinkeys` columns (Spark only)

The join mode is particularly useful when row ordering is not deterministic or when matching by business keys is more appropriate.

My test case:
  options:
    compare_mode: join
    join_keys:
      - employee_id
  source:
    type: sql_spark
    query: SELECT * FROM lakehouse.employees
  expected:
    type: csv_spark
    path: /lakehouse/default/Files/expected/employees.csv

Calling Ploosh from Python

In Spark mode, Ploosh is called programmatically from Python using the execute_cases() function:

from ploosh import execute_casesexecute_cases(
    cases="/path/to/cases",
    connections="/path/to/connections.yaml",
    spark_session=spark,
    filter="*.yaml",
    path_output="/path/to/output"
)

See the Python API reference for full details.

Platform-specific guides

Microsoft Fabric setup — Complete guide for Fabric
Fabric notebook orchestration — Notebook implementation
Fabric shortcuts strategy — Cross-workspace data access
Fabric reporting — Power BI dashboards on test results
Databricks setup — Running Ploosh on Databricks
Local Spark — Running Ploosh with a local SparkSession

ploosh.