Statistical Data Testing with Pandera

October 8, 2021

Overview

Pandera is a lightweight library to perform data validation on dataframes. Data validation is especially important in situations where pre-conditions / assumptions must be met as data flows from input to output. This ensures that data pipelines are readable and robust.

Key Features

  • Varying validation strictness
  • Built-in and custom validation functions
  • Informative error messages
  • Integrate into existing pipelines with validator decorators

Usage

To install Pandera:

pip install pandera

Pandera has an object-based API for data validation that revolves around creating validation schemas as objects of the DataFrameSchema class. The DataFrameSchema object specifies the required columns that need to be present and the required checks that need to pass in order for the data to be valid. An example can be represented as follows:

  1. Define a dataframe that needs to be validated:
df = pd.DataFrame(
    {
        "Name": ["Bitcoin", "Ethereum", "Solana"],
        "Price": [54292.87, 3604.29, 165.31],
        "Market Cap": [1024978397427, 424826635487, 49515270730],
    }
)
  1. Define a schema with the required columns and checks:
schema = pa.DataFrameSchema(
    {
        "Name": pa.Column(str, checks=pa.Check(lambda s: s.str.istitle())),
        "Price": pa.Column(float, checks=pa.Check.ge(0)),
        "Market Cap": pa.Column(int, checks=pa.Check.ge(0)),
    }
)

In the highlighted line above, Name is a required column where each value must be of type str and each value must follow title case syntax (ie: each word starts with a capital letter).

  1. Validate the dataframe using the schema:
validated_df = schema(df)
print(validated_df)

This will validate and print the dataframe:

       Name     Price     Market Cap
0   Bitcoin  54292.87  1024978397427
1  Ethereum   3604.29   424826635487
2    Solana    165.31    49515270730

If the input dataframe doesn't have the required columns or the checks don't pass, a SchemaError will be thrown. An example can be represented as follows:

df = pd.DataFrame(
    {
        "Name": ["bitcoin", "Ethereum", "SoLaNa"],
        "Price": [54292.87, 3604.29, "165.31"],
        "Market Cap": [1024978397427, 424826635487.99, 49515270730],
    }
)

In the highlighted line above, bitcoin and SoLaNa don't follow title case syntax. When the dataframe is validated against the schema, a SchemaError will be thrown:

validated_df = schema(df)
print(validated_df)

Which returns:

SchemaError: <Schema Column(name=Name, type=DataType(str))> failed element-wise validator 0:
<Check <lambda>>
failure cases:
   index failure_case
0      0      bitcoin
1      2       SoLaNa

DataFrameSchema Object

The most basic DataFrameSchema object has a columns parameter where the keys are the column names and the values are the Column objects that specify the data type and properties of that column. By default, only the columns specified in the schema are allowed in the dataframe, ordereing doesn't matter and uniqueness doesn't matter.

The complete set of parameters (and their defaults) for the DataFrameSchema object include:

class pandera.schemas.DataFrameSchema(
    columns=None,
    checks=None,
    index=None,
    dtype=None,
    coerce=False,
    strict=False,
    name=None,
    ordered=False,
    unique=None
)

Parameters of interest:

  • columns - specifiy the mapping of column names to column schema objects (see example in Column Object)
  • checks - specifiy the checks used to verify validity of the column (see example in Check Object)
  • coerce - specifiy whether the data types of each column in a dataframe should be coerced to match the data types specified by each column in the schema
# coerce=True: ✅ coerces int values to float values
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(float)
    },
    coerce=True
)
 
# coerce=False: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(float)
    },
    coerce=False
)
  • strict - specifiy whether columns that exist in the dataframe but not in the schema should raise an error, be ignored or be dropped
# strict=True: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int)
    },
    strict=True
)
 
# strict=False: ✅ ignores `col_2` on validation
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int)
    },
    strict=False
)
 
# strict="filter": ✅ drops `col_2` on validation
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int)
    },
    strict="filter"
)
  • ordered - specifiy whether the columns should be validated based on order
# ordered=True: ❌ raises SchemaError
df = pd.DataFrame({"col_2": [4, 5, 6], "col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    ordered=True
)
 
# ordered=False: ✅ ignores order on validation
df = pd.DataFrame({"col_2": [4, 5, 6], "col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    ordered=False
)

unique - specify whether a list of columns should be jointly unique

# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    unique=["col_1", "col_2"]
)
 
# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 1, 1], "col_2": [4, 5, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    unique=["col_1", "col_2"]
)
 
# unique=["col_1", "col_2"]: ✅ rows are unique across columns of interest
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    unique=["col_1", "col_2"]
)
 
# unique=["col_1", "col_2"]: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 1], "col_2": [1, 2, 1]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int)
    },
    unique=["col_1", "col_2"]
)

Column Object

The most basic Column object takes in a dtype parameter that specifies the required data type of the values in that column. By default, the column specified in the schema must appear in the dataframe, null values are not allowed and uniqueness is not checked.

The complete set of parameters (and their defaults) for the Column object include:

class pandera.schema_components.Column(
    dtype=None,
    checks=None,
    nullable=False,
    unique=False,
    coerce=False,
    required=True,
    name=None,
    regex=False
)

Parameters of interest:

  • checks - specifiy the checks used to verify validity of the column (see example in Check Object)
  • nullable - specifiy whether the column can contain null values
# nullable=True: ✅ ignores null value on validation
df = pd.DataFrame({"col_1": [1, 2, np.nan]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(float, nullable=True)
    }
)
 
# nullable=False: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, np.nan]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(float, nullable=False)
    }
)

Note the special case of integer columns not supporting nan values. To avoid this, specify the data type as float.

  • unique - specifiy whether column values should be unique
  • coerce - specifiy whether the data types of each column in a dataframe should be coerced to match the data types specified by each column in the schema
  • required - specifiy whether the column is allowed to be missing
# required=True: ❌ raises SchemaError
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int, required=True)
    }
)
 
# required=False: ✅ ignores missing columns that aren't required on validation
df = pd.DataFrame({"col_1": [1, 2, 3]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int),
        "col_2": pa.Column(int, required=False)
    }
)

Check Object

Check objects may be used to define built-in or custom checks that apply to a given column. The most basic Check object takes in a check_fn parameter that specifies the validation function for a given column. The validation function works by taking in as input a pandas series and returns as output a boolean or series of boolean values. For the checks to pass, all values in the series must be positive. By default, vectorized checks are used, null values are ignored and failing checks will throw an error.

The complete set of parameters (and their defaults) for the Check object include:

class pandera.checks.Check(
    check_fn,
    groups=None,
    groupby=None,
    ignore_na=True,
    element_wise=False,
    name=None,
    error=None,
    raise_warning=False,
    n_failure_cases=10,
    **check_kwargs
)

Parameters of interest:

  • check_fn - specifiy the function used to verify validity of the column
# built-in checks: ✅
df = pd.DataFrame({"col_1": [1, 2, 3], "col_2": ["_1", "_2", "_3"], "col_3": ["c", "b", "a"]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(int, pa.Check.less_than(10)),
        "col_2": pa.Column(str, pa.Check.str_startswith("_")),
        "col_3": pa.Column(str, pa.Check.isin(["a", "b", "c"])),
    }
)
 
# custom checks: ✅
df = pd.DataFrame({"col_1": [2, 4, 6]})
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(
            int,
            pa.Check(lambda num: (num % 2) == 0)
        )
    }
)
 
# multiple checks: ✅
df = pd.DataFrame({"col_1": [1, 2, 3])
schema = pa.DataFrameSchema(
    {
        "col_1": pa.Column(
            int,
            checks=[
                pa.Check.less_than(10),
                pa.Check.isin([1, 2, 3, 4, 5])
            ]
        )
    }
)

Pandera has a number of built-in checks that you can use out of the box.

  • ignore_na - specifiy whether null values should be ignored when performing checks
  • raise_warning - specify whether a failing check should raise a SchemaError exception or raise UserWarning warning

Integration in a Data Pipeline

We've seen basic usage examples to validate a dataframe, but we need a standardised approach to work with dataframes in any pipeline.

Validation Decorators

Pandera provides input and output schema validation decorators that can be used for pipeline integration.

Input schema decorators are validated against the dataframe being passed to the function:

in_schema = pa.DataFrameSchema({
    "col_1": pa.Column(int, pa.Check.less_than(10)),
})
 
@pa.check_input(in_schema)
def some_function(df):
    ...
    return df

Output schema decorators are validated against the dataframe being returned from the function:

out_schema = pa.DataFrameSchema({
    "col_1": pa.Column(int, pa.Check.less_than(10)),
})
 
@pa.check_output(out_schema)
def some_function(df):
    ...
    return df

Finally, Pandera allows specifying both an input schema and output schema in one decorator:

@pa.check_io(df=in_schema, out=out_schema)
def preprocessor(df):
    ...
    return df

Custom Classes

We may want the flexibility to define custom classes that represent our data models. These models are initialised with a validated dataframe and contain any other attributes or methods that best characterize the data. Using our cryptocurrency example from the beginning:

# define a data model
class CryptoModel:
    _schema = pa.DataFrameSchema(
        {
            "Name": pa.Column(str, checks=pa.Check(lambda s: s.str.istitle())),
            "Price": pa.Column(float, checks=pa.Check.ge(0)),
            "Market Cap": pa.Column(int, checks=pa.Check.ge(0)),
        }
    )
 
    def __init__(self, df: pd.DataFrame):
        self._df = self._schema.validate(df)
 
    def get_data(self):
        return self._df
 
# create a dataframe
df = pd.DataFrame(
    {
        "Name": ["Bitcoin", "Ethereum", "Solana"],
        "Price": [54292.87, 3604.29, 165.31],
        "Market Cap": [1024978397427, 424826635487, 49515270730],
    }
)
 
# create an object of the data model and return it's values
model = CryptoModel(df)
model.get_data()

Which returns:

       Name     Price     Market Cap
0   Bitcoin  54292.87  1024978397427
1  Ethereum   3604.29   424826635487
2    Solana    165.31    49515270730