The DataFrame equality take a look at features have been launched in Apache Sparkā¢ 3.5 and Databricks Runtime 14.2 to simplify PySpark unit testing. The total set of capabilities described on this weblog submit shall be accessible beginning with the upcoming Apache Spark 4.0 and Databricks Runtime 14.3.
Write extra assured DataFrame transformations with DataFrame equality take a look at features
Working with information in PySpark entails making use of transformations, aggregations, and manipulations to DataFrames. As transformations accumulate, how are you going to be assured that your code works as anticipated? PySpark equality take a look at utility features present an environment friendly and efficient option to examine your information towards anticipated outcomes, serving to you determine surprising variations and catch errors early within the evaluation course of. What’s extra, they return intuitive data pinpointing exactly the variations so you’ll be able to take motion instantly with out spending a variety of time debugging.
Utilizing DataFrame equality take a look at features
Two equality take a look at features for PySpark DataFrames have been launched in Apache Spark 3.5: assertDataFrameEqual
and assertSchemaEqual
. Let’s check out use every of them.
assertDataFrameEqual
: This operate lets you evaluate two PySpark DataFrames for equality with a single line of code, checking whether or not the info and schemas match. It returns descriptive data when there are variations.
Let’s stroll by way of an instance. First, we’ll create two DataFrames, deliberately introducing a distinction within the first row:
df_expected = spark.createDataFrame(information=[("Alfred", 1500), ("Alfred", 2500), ("Anna",
500), ("Anna", 3000)], schema=["name", "amount"])
df_actual = spark.createDataFrame(information=[("Alfred", 1200), ("Alfred", 2500), ("Anna", 500),
("Anna", 3000)], schema=["name", "amount"])
Then we’ll name assertDataFrameEqual
with the 2 DataFrames:
from pyspark.testing import assertDataFrameEqual
assertDataFrameEqual(df_actual, df_expected)
The operate returns a descriptive message indicating that the primary row within the two DataFrames is totally different. On this instance, the primary quantities listed for Alfred on this row will not be the identical (anticipated: 1500, precise: 1200):
With this data, you instantly know the issue with the DataFrame your code generated and may goal your debugging primarily based on that.
The operate additionally has a number of choices to manage the strictness of the DataFrame comparability with the intention to alter it in accordance with your particular use circumstances.
assertSchemaEqual
: This operate compares solely the schemas of two DataFrames; it doesn’t evaluate row information. It enables you to validate whether or not the column names, information sorts, and nullable property are the identical for 2 totally different DataFrames.
Let us take a look at an instance. First, we’ll create two DataFrames with totally different schemas:
schema_actual = "title STRING, quantity DOUBLE"
data_expected = [["Alfred", 1500], ["Alfred", 2500], ["Anna", 500], ["Anna", 3000]]
data_actual = [["Alfred", 1500.0], ["Alfred", 2500.0], ["Anna", 500.0], ["Anna", 3000.0]]
df_expected = spark.createDataFrame(information = data_expected)
df_actual = spark.createDataFrame(information = data_actual, schema = schema_actual)
Now, let’s name assertSchemaEqual
with these two DataFrame schemas:
from pyspark.testing import assertSchemaEqual
assertSchemaEqual(df_actual.schema, df_expected.schema)
The operate determines that the schemas of the 2 DataFrames are totally different, and the output signifies the place they diverge:
On this instance, there are two variations: the info kind of the quantity
column is LONG
within the precise DataFrame however DOUBLE
within the anticipated DataFrame, and since we created the anticipated DataFrame with out specifying a schema, the column names are additionally totally different.
Each of those variations are highlighted within the operate output, as illustrated right here.
assertPandasOnSparkEqual
isn’t lined on this weblog submit since it’s deprecated from Apache Spark 3.5.1 and scheduled to be eliminated within the upcoming Apache Spark 4.0.0. For testing Pandas API on Spark, see Pandas API on Spark equality take a look at features.
Structured output for debugging variations in PySpark DataFrames
Whereas the assertDataFrameEqual
and assertSchemaEqual
features are primarily aimed toward unit testing, the place you sometimes use smaller datasets to check your PySpark features, you would possibly use them with DataFrames with greater than only a few rows and columns. In such situations, you’ll be able to simply retrieve the row information for rows which are totally different to make additional debugging simpler.
Let’s check out how to do this. We’ll use the identical information we used earlier to create two DataFrames:
df_expected = spark.createDataFrame(information=[("Alfred", 1500), ("Alfred", 2500),
("Anna", 500), ("Anna", 3000)], schema=["name", "amount"])
df_actual = spark.createDataFrame(information=[("Alfred", 1200), ("Alfred", 2500), ("Anna",
500), ("Anna", 3000)], schema=["name", "amount"])
And now we’ll seize the info that differs between the 2 DataFrames from the assertion error objects after calling assertDataFrameEqual
:
from pyspark.testing import assertDataFrameEqual
from pyspark.errors import PySparkAssertionError
attempt:
assertDataFrameEqual(df_actual, df_expected, includeDiffRows=True)
besides PySparkAssertionError as e:
# `e.information` right here seems to be like:
# [(Row(name='Alfred', amount=1200), Row(name='Alfred', amount=1500))]
spark.createDataFrame(e.information, schema=["Actual", "Expected"]).present()
Making a DataFrame primarily based on the rows which are totally different and exhibiting it, as we have performed on this instance, illustrates how straightforward it’s to entry this data:
As you’ll be able to see, data on the rows which are totally different is straight away accessible for additional evaluation. You not have to put in writing code to extract this data from the precise and anticipated DataFrames for debugging functions.
This function shall be accessible from the upcoming Apache Spark 4.0 and DBR 14.3.
Pandas API on Spark equality take a look at features
Along with the features for testing the equality of PySpark DataFrames, Pandas API on Spark customers may have entry to the next DataFrame equality take a look at features:
assert_frame_equal
assert_series_equal
assert_index_equal
The features present choices for controlling the strictness of comparisons and are nice for unit testing your Pandas API on Spark DataFrames. They supply the very same API because the pandas take a look at utility features, so you should use them with out altering current pandas take a look at code that you simply need to run utilizing Pandas API on Spark.
Listed here are a few examples demonstrating using assert_frame_equal
with totally different parameters, evaluating Pandas API on Spark DataFrames:
from pyspark.pandas.testing import assert_frame_equal
import pyspark.pandas as ps
# Create two barely totally different Pandas API on Spark DataFrames
df1 = ps.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})
df2 = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) # 'b' column as integers
# Validate DataFrame equality with strict information kind checking
assert_frame_equal(df1, df2, check_dtype=True)
On this instance, the schemas of the 2 DataFrames are totally different. The operate output lists the variations, as proven right here:
We will specify that we would like the operate to check column information even when the columns shouldn’t have the identical information kind utilizing the check_dtype
argument, as on this instance:
# DataFrames are equal with check_dtype=False
assert_frame_equal(df1, df2, check_dtype=False)
Since we specified that assert_frame_equal
ought to ignore column information sorts, it now considers the 2 DataFrames equal.
These features additionally enable comparisons between Pandas API on Spark objects and pandas objects, facilitating compatibility checks between totally different DataFrame libraries, as illustrated on this instance:
import pandas as pd
# Pandas DataFrame
df_pandas = pd.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})
# Evaluating Pandas API on Spark DataFrame with the Pandas DataFrame
assert_frame_equal(df1, df_pandas)
# Evaluating Pandas API on Spark Collection with the Pandas Collection
assert_series_equal(df1.a, df_pandas.a)
# Evaluating Pandas API on Spark Index with the Pandas Index
assert_index_equal(df1.index, df_pandas.index)
Utilizing the brand new PySpark DataFrame and Pandas API on Spark equality take a look at features is an effective way to ensure your PySpark code works as anticipated. These features enable you to not solely catch errors but in addition perceive precisely what has gone incorrect, enabling you to rapidly and simply determine the place the issue is. Take a look at the Testing PySpark web page for extra data.
These features shall be accessible from the upcoming Apache Spark 4.0. DBR 14.2 already helps it.