Parsing Data with Pydantic

October 15, 2021

Overview

Pydantic is a data parsing and validation library built on top of Python type hints. Pydantic enforces type hints at runtime and is a useful utility when working with data models that have fields that need to conform to pre-defined data types.

Pydantic Models

Data models in Pydantic are defined as classes that inherit from BaseModel. After parsing and validation, fields of the resultant model instance will conform to the field types defined in the model (or else an error will be thrown).

For example, we can define a Person class that has two required fields, name and age:

from pydantic import BaseModel
 
 
class Person(BaseModel):
    name: str
    age: int

We can create an instance of this class, parsing and validating the specified fields:

person = Person("name": "Luke", "age": 24)

Data types will be coereced if possible to match the specified type constraints or else an error will be thrown.

Pydantic Validators

Pydantic allows you to define additional validation checks that can be enforced on underlying data fields using validation decorators.

validator Decorator

The validator decorator is used to validate individual fields. The first argument is the field to validate (you can specifiy multiple fields in a list and you can specifiy all the fields by using the special value *). Subsequent arguments may include:

  • pre: bool - apply the validator prior to other validations (including data type validations)
  • each_item: bool - apply the validator to each element of a callable object (List, Dict, Set)
  • always: bool - apply the validator even if a field has not been supplied (useful for dynamic default values)

root_validator Decorator

The root_validator decorator is used to validate the entire data model, giving you access to a dictionary of each field's name-to-value mapping.

Extended Example

To illustrate this with an example, suppose we have an API request body that needs to be parsed and validated before it can be processed. The purpose of the request is to retreive the historical data for the metrics of a given list of securities. This model can be represented as follows:

from datetime import date
from typing import List
 
from pydantic import BaseModel, Field
 
 
class DataRequest(BaseModel):
    securities: List[str] = Field(..., title="ISIN values")
    metrics: List[str] = Field(..., title="Metric values")
    start_date: date = Field(..., title="Start date")
    end_date: date = Field(..., title="End date")

The data model above specifies the requried fields: securities, metrics, start_date and end_date with their respective types. However, even though we know that securities and metrics must be string objects and that start_date and end_date must be date objects, there are further validations that could be done in order for the request to make sense.

Additional information we know about the data model include:

  • each element in securities must be a valid ISIN (an International Securities Identification Number, which is a 12-digit alphanumeric code that uniquely identifies a specific security)
  • each element in metrics must belong to a set of valid metrics
  • the start_date cannot occur after the end_date

We can therefore extend this model and define additional validation checks for each field using Pydantic validator decorators:

from datetime import date
from typing import List
 
from pydantic import BaseModel, Field, root_validator, validator
from stdnum import isin
 
class DataRequest(BaseModel):
    securities: List[str] = Field(..., title="ISIN values")
    metrics: List[str] = Field(..., title="Metric values")
    start_date: date = Field(..., title="Start date")
    end_date: date = Field(..., title="End date")
 
    @validator("securities", each_item=True)
    def validate_securities(cls, v):
        try:
            isin.validate(v)
        except Exception:
            raise ValueError(f"Security invalid: {v}")
 
        return v.upper()
 
    @validator("metrics", each_item=True)
    def validate_metric(cls, v):
        metrics = ["price", "market_cap", "dividend_yield", "eps"]
        if v not in metrics:
            raise ValueError(f"Metric invalid: {v}")
 
        return v.upper()
 
    @root_validator
    def validate_dates(cls, v):
        start_date = v.get("start_date")
        end_date = v.get("end_date")
 
        if start_date > end_date:
            raise ValueError("Start date occurs after end date")
 
        return v

The data model above makes use of the validator and root_validator decorators. Using these validators, we ensure that securities are valid ISIN values (using the library stdnum), that metrics belong to the set of available metrics, and that start_date cannot occur after end_date.

The following applies to validation functions (defined underneath the validator decorators):

  • validators are class methods
  • the first argument is the class object
  • the second argument is the field to validate (which can be named anything)
  • validation is done in the order that fields are defined

We can now create an instance of the DataRequest model, parse the input arguments and be certain that our fields are valid.

request = DataRequest(
    securities=["US0378331005", "US0231351067"],
    metrics=["market_cap", "price"],
    start_date=date(2020, 1, 1),
    end_date=date(2020, 12, 31),
)