Spark mode overview

Ploosh supports two execution modes: Native (Pandas) and Spark (PySpark). Spark mode is designed to run within distributed environments like Microsoft Fabric, Databricks, or a local Spark session, enabling validation of large-scale datasets.

Why Spark mode?

BenefitDescription
Distributed processingLeverage Spark clusters to validate large volumes of data
Native platform accessQuery Lakehouse tables, KQL databases, and Delta files directly
No data movementData stays within the platform, avoiding costly exports
Integrated executionRun tests directly from notebooks alongside your data pipelines

When to use Spark mode?

ScenarioRecommended mode
CI/CD pipeline on a build agentNative
Local development with small datasetsNative
Microsoft Fabric notebooksSpark
Databricks notebooksSpark
Large datasets (millions of rows)Spark
Querying Lakehouse/KQL/Delta files on a clusterSpark

Spark connectors

Spark mode uses dedicated connectors. You cannot mix Spark and native connectors in the same test case.

ConnectorTypeDescription
csvsparkFileRead CSV files via Spark
jsonsparkFileRead JSON files via Spark
parquetsparkFileRead Parquet files via Spark
deltasparkFileRead Delta tables via Spark
sqlsparkQueryExecute Spark SQL queries
fabrickqlsparkDatabaseQuery Fabric KQL databases
dremiosparkDatabaseQuery Dremio via Arrow Flight SQL
empty_sparkUtilityReturn an empty DataFrame

Spark comparison engine

The Spark compare engine supports two comparison modes:

ModeDescription
order (default)Rows are matched by position using a rownumber() window function
joinRows are matched by specified joinkeys columns (Spark only)
The join mode is particularly useful when row ordering is not deterministic or when matching by business keys is more appropriate.

My test case:
  options:
    compare_mode: join
    join_keys:
      - employee_id
  source:
    type: sql_spark
    query: SELECT * FROM lakehouse.employees
  expected:
    type: csv_spark
    path: /lakehouse/default/Files/expected/employees.csv

Calling Ploosh from Python

In Spark mode, Ploosh is called programmatically from Python using the execute_cases() function:

from ploosh import execute_cases

execute_cases( cases="/path/to/cases", connections="/path/to/connections.yaml", spark_session=spark, filter="*.yaml", path_output="/path/to/output" )

See the Python API reference for full details.

Platform-specific guides