Statistical Data Testing with Pandera
October 8, 2021
Overview
Pandera is a lightweight library to perform data validation on dataframes. Data validation is especially important in situations where pre-conditions / assumptions must be met as data flows from input to output. This ensures that data pipelines are readable and robust.
Key Features
- Varying validation strictness
- Built-in and custom validation functions
- Informative error messages
- Integrate into existing pipelines with validator decorators
Usage
To install Pandera:
pip install pandera
Pandera has an object-based API for data validation that revolves around creating validation schemas as objects of the DataFrameSchema
class. The DataFrameSchema
object specifies the required columns that need to be present and the required checks that need to pass in order for the data to be valid. An example can be represented as follows:
- Define a dataframe that needs to be validated:
df = pd.DataFrame(
{
"Name": ["Bitcoin", "Ethereum", "Solana"],
"Price": [54292.87, 3604.29, 165.31],
"Market Cap": [1024978397427, 424826635487, 49515270730],
}
)
- Define a schema with the required columns and checks:
schema = pa.DataFrameSchema(
{
"Name": pa.Column(str, checks=pa.Check(lambda s: s.str.istitle())),
"Price": pa.Column(float, checks=pa.Check.ge(0)),
"Market Cap": pa.Column(int, checks=pa.Check.ge(0)),
}
)
In the highlighted line above, Name
is a required column where each value must be of type str
and each value must follow title case syntax (ie: each word starts with a capital letter).
- Validate the dataframe using the schema:
validated_df = schema(df)
print(validated_df)
This will validate and print the dataframe:
Name Price Market Cap
0 Bitcoin 54292.87 1024978397427
1 Ethereum 3604.29 424826635487
2 Solana 165.31 49515270730
If the input dataframe doesn't have the required columns or the checks don't pass, a SchemaError
will be thrown. An example can be represented as follows:
df = pd.DataFrame(
{
"Name": ["bitcoin", "Ethereum", "SoLaNa"],
"Price": [54292.87, 3604.29, "165.31"],
"Market Cap": [1024978397427, 424826635487.99, 49515270730],
}
)
In the highlighted line above, bitcoin
and SoLaNa
don't follow title case syntax. When the dataframe is validated against the schema, a SchemaError
will be thrown:
validated_df = schema(df)
print(validated_df)
Which returns:
SchemaError: <Schema Column(name=Name, type=DataType(str))> failed element-wise validator 0:
<Check <lambda>>
failure cases:
index failure_case
0 0 bitcoin
1 2 SoLaNa
DataFrameSchema
Object
The most basic DataFrameSchema
object has a columns
parameter where the keys are the column names and the values are the Column
objects that specify the data type and properties of that column. By default, only the columns specified in the schema are allowed in the dataframe, ordereing doesn't matter and uniqueness doesn't matter.
The complete set of parameters (and their defaults) for the DataFrameSchema
object include:
class pandera.schemas.DataFrameSchema(
columns=None,
checks=None,
index=None,
dtype=None,
coerce=False,
strict=False,
name=None,
ordered=False,
unique=None
)
Parameters of interest:
columns
- specifiy the mapping of column names to column schema objects (see example inColumn
Object)checks
- specifiy the checks used to verify validity of the column (see example inCheck
Object)coerce
- specifiy whether the data types of each column in a dataframe should be coerced to match the data types specified by each column in the schema
# coerce=True: ✅ coerces int values to float values
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(float)
},
coerce=True
)
# coerce=False: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(float)
},
coerce=False
)
strict
- specifiy whether columns that exist in the dataframe but not in the schema should raise an error, be ignored or be dropped
# strict=True: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int)
},
strict=True
)
# strict=False: ✅ ignores `col_2` on validation
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int)
},
strict=False
)
# strict="filter": ✅ drops `col_2` on validation
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int)
},
strict="filter"
)
ordered
- specifiy whether the columns should be validated based on order
# ordered=True: ❌ raises SchemaError
df = pd.DataFrame({"col_2": [4, 5, 6], "col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
ordered=True
)
# ordered=False: ✅ ignores order on validation
df = pd.DataFrame({"col_2": [4, 5, 6], "col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
ordered=False
)
unique
- specify whether a list of columns should be jointly unique
# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
unique=["col_1", "col_2"]
)
# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 1, 1], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
unique=["col_1", "col_2"]
)
# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
unique=["col_1", "col_2"]
)
# unique=["col_1", "col_2"]: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 1], "col_2": [1, 2, 1]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int)
},
unique=["col_1", "col_2"]
)
Column
Object
The most basic Column
object takes in a dtype
parameter that specifies the required data type of the values in that column. By default, the column specified in the schema must appear in the dataframe, null values are not allowed and uniqueness is not checked.
The complete set of parameters (and their defaults) for the Column
object include:
class pandera.schema_components.Column(
dtype=None,
checks=None,
nullable=False,
unique=False,
coerce=False,
required=True,
name=None,
regex=False
)
Parameters of interest:
checks
- specifiy the checks used to verify validity of the column (see example inCheck
Object)nullable
- specifiy whether the column can contain null values
# nullable=True: ✅ ignores null value on validation
df = pd.DataFrame({"col_1": [1, 2, np.nan]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(float, nullable=True)
}
)
# nullable=False: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, np.nan]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(float, nullable=False)
}
)
Note the special case of integer columns not supporting nan
values. To avoid this, specify the data type as float
.
unique
- specifiy whether column values should be uniquecoerce
- specifiy whether the data types of each column in a dataframe should be coerced to match the data types specified by each column in the schemarequired
- specifiy whether the column is allowed to be missing
# required=True: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int, required=True)
}
)
# required=False: ✅ ignores missing columns that aren't required on validation
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int),
"col_2": pa.Column(int, required=False)
}
)
Check
Object
Check
objects may be used to define built-in or custom checks that apply to a given column. The most basic Check
object takes in a check_fn
parameter that specifies the validation function for a given column. The validation function works by taking in as input a pandas series and returns as output a boolean or series of boolean values. For the checks to pass, all values in the series must be positive. By default, vectorized checks are used, null values are ignored and failing checks will throw an error.
The complete set of parameters (and their defaults) for the Check
object include:
class pandera.checks.Check(
check_fn,
groups=None,
groupby=None,
ignore_na=True,
element_wise=False,
name=None,
error=None,
raise_warning=False,
n_failure_cases=10,
**check_kwargs
)
Parameters of interest:
check_fn
- specifiy the function used to verify validity of the column
# built-in checks: ✅
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": ["_1", "_2", "_3"], "col_3": ["c", "b", "a"]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(int, pa.Check.less_than(10)),
"col_2": pa.Column(str, pa.Check.str_startswith("_")),
"col_3": pa.Column(str, pa.Check.isin(["a", "b", "c"])),
}
)
# custom checks: ✅
df = pd.DataFrame({"col_1": [2, 4, 6]})
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(
int,
pa.Check(lambda num: (num % 2) == 0)
)
}
)
# multiple checks: ✅
df = pd.DataFrame({"col_1": [1, 2, 3])
schema = pa.DataFrameSchema(
{
"col_1": pa.Column(
int,
checks=[
pa.Check.less_than(10),
pa.Check.isin([1, 2, 3, 4, 5])
]
)
}
)
Pandera has a number of built-in checks that you can use out of the box.
ignore_na
- specifiy whether null values should be ignored when performing checksraise_warning
- specify whether a failing check should raise aSchemaError
exception or raiseUserWarning
warning
Integration in a Data Pipeline
We've seen basic usage examples to validate a dataframe, but we need a standardised approach to work with dataframes in any pipeline.
Validation Decorators
Pandera provides input and output schema validation decorators that can be used for pipeline integration.
Input schema decorators are validated against the dataframe being passed to the function:
in_schema = pa.DataFrameSchema({
"col_1": pa.Column(int, pa.Check.less_than(10)),
})
@pa.check_input(in_schema)
def some_function(df):
...
return df
Output schema decorators are validated against the dataframe being returned from the function:
out_schema = pa.DataFrameSchema({
"col_1": pa.Column(int, pa.Check.less_than(10)),
})
@pa.check_output(out_schema)
def some_function(df):
...
return df
Finally, Pandera allows specifying both an input schema and output schema in one decorator:
@pa.check_io(df=in_schema, out=out_schema)
def preprocessor(df):
...
return df
Custom Classes
We may want the flexibility to define custom classes that represent our data models. These models are initialised with a validated dataframe and contain any other attributes or methods that best characterize the data. Using our cryptocurrency example from the beginning:
# define a data model
class CryptoModel:
_schema = pa.DataFrameSchema(
{
"Name": pa.Column(str, checks=pa.Check(lambda s: s.str.istitle())),
"Price": pa.Column(float, checks=pa.Check.ge(0)),
"Market Cap": pa.Column(int, checks=pa.Check.ge(0)),
}
)
def __init__(self, df: pd.DataFrame):
self._df = self._schema.validate(df)
def get_data(self):
return self._df
# create a dataframe
df = pd.DataFrame(
{
"Name": ["Bitcoin", "Ethereum", "Solana"],
"Price": [54292.87, 3604.29, 165.31],
"Market Cap": [1024978397427, 424826635487, 49515270730],
}
)
# create an object of the data model and return it's values
model = CryptoModel(df)
model.get_data()
Which returns:
Name Price Market Cap
0 Bitcoin 54292.87 1024978397427
1 Ethereum 3604.29 424826635487
2 Solana 165.31 49515270730