Parsing Data with Pydantic
October 15, 2021
Overview
Pydantic is a data parsing and validation library built on top of Python type hints. Pydantic enforces type hints at runtime and is a useful utility when working with data models that have fields that need to conform to pre-defined data types.
Pydantic Models
Data models in Pydantic are defined as classes that inherit from BaseModel
. After parsing and validation, fields of the resultant model instance will conform to the field types defined in the model (or else an error will be thrown).
For example, we can define a Person
class that has two required fields, name
and age
:
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
We can create an instance of this class, parsing and validating the specified fields:
person = Person("name": "Luke", "age": 24)
Data types will be coereced if possible to match the specified type constraints or else an error will be thrown.
Pydantic Validators
Pydantic allows you to define additional validation checks that can be enforced on underlying data fields using validation decorators.
validator
Decorator
The validator
decorator is used to validate individual fields. The first argument is the field to validate (you can specifiy multiple fields in a list and you can specifiy all the fields by using the special value *
). Subsequent arguments may include:
pre: bool
- apply the validator prior to other validations (including data type validations)each_item: bool
- apply the validator to each element of a callable object (List
,Dict
,Set
)always: bool
- apply the validator even if a field has not been supplied (useful for dynamic default values)
root_validator
Decorator
The root_validator
decorator is used to validate the entire data model, giving you access to a dictionary of each field's name-to-value mapping.
Extended Example
To illustrate this with an example, suppose we have an API request body that needs to be parsed and validated before it can be processed. The purpose of the request is to retreive the historical data for the metrics of a given list of securities. This model can be represented as follows:
from datetime import date
from typing import List
from pydantic import BaseModel, Field
class DataRequest(BaseModel):
securities: List[str] = Field(..., title="ISIN values")
metrics: List[str] = Field(..., title="Metric values")
start_date: date = Field(..., title="Start date")
end_date: date = Field(..., title="End date")
The data model above specifies the requried fields: securities
, metrics
, start_date
and end_date
with their respective types. However, even though we know that securities
and metrics
must be string objects and that start_date
and end_date
must be date objects, there are further validations that could be done in order for the request to make sense.
Additional information we know about the data model include:
- each element in
securities
must be a valid ISIN (an International Securities Identification Number, which is a 12-digit alphanumeric code that uniquely identifies a specific security) - each element in
metrics
must belong to a set of valid metrics - the
start_date
cannot occur after theend_date
We can therefore extend this model and define additional validation checks for each field using Pydantic validator decorators:
from datetime import date
from typing import List
from pydantic import BaseModel, Field, root_validator, validator
from stdnum import isin
class DataRequest(BaseModel):
securities: List[str] = Field(..., title="ISIN values")
metrics: List[str] = Field(..., title="Metric values")
start_date: date = Field(..., title="Start date")
end_date: date = Field(..., title="End date")
@validator("securities", each_item=True)
def validate_securities(cls, v):
try:
isin.validate(v)
except Exception:
raise ValueError(f"Security invalid: {v}")
return v.upper()
@validator("metrics", each_item=True)
def validate_metric(cls, v):
metrics = ["price", "market_cap", "dividend_yield", "eps"]
if v not in metrics:
raise ValueError(f"Metric invalid: {v}")
return v.upper()
@root_validator
def validate_dates(cls, v):
start_date = v.get("start_date")
end_date = v.get("end_date")
if start_date > end_date:
raise ValueError("Start date occurs after end date")
return v
The data model above makes use of the validator
and root_validator
decorators. Using these validators, we ensure that securities
are valid ISIN values (using the library stdnum
), that metrics
belong to the set of available metrics, and that start_date
cannot occur after end_date
.
The following applies to validation functions (defined underneath the validator decorators):
- validators are class methods
- the first argument is the class object
- the second argument is the field to validate (which can be named anything)
- validation is done in the order that fields are defined
We can now create an instance of the DataRequest
model, parse the input arguments and be certain that our fields are valid.
request = DataRequest(
securities=["US0378331005", "US0231351067"],
metrics=["market_cap", "price"],
start_date=date(2020, 1, 1),
end_date=date(2020, 12, 31),
)