Data engineering in python

September 16, 2024

compute
- Databricks (which uses Apache Spark as it's processing engine and allows for in-memory caching and optimised query execution) and DeltaLake (which provides the data lake / warehouse)
storage
- relational (MSSQL, PostgreSQL)
- non-relational (MongoDB, Redis)
- object (Azure blob / Amazon S3)
containerisation
- Docker (automate and manage the deployment of applications inside containers)

Airflow (task-centric, great open source support) and Dagster (asset-centric, great testing and debugging support) for orchestration
- https://github.com/apache/airflow
- https://github.com/dagster-io/dagster
Polars and Pandas for data processing (Polars is becomming more mainstream, it uses Rust and Apache Arrow on the backend)
- https://github.com/pola-rs/polars
DBT for data transformations
- https://github.com/dbt-labs/dbt-core
FastAPI for building API's (this is a very popular lilbrary, and works on top of Pydantic which handles data parsing and validation)
- https://github.com/fastapi/fastapi
- https://github.com/pydantic/pydantic
SQLAlchemy for database connections (they released a new 2.0+ major release a few months ago)
- https://github.com/sqlalchemy/sqlalchemy
Poetry / UV for project, package and dependency management
- https://github.com/python-poetry/poetry
- https://github.com/astral-sh/uv
Ansible for infrastructure management and configuration
- https://github.com/ansible/ansible
Opentelemetry for monitoring (open source, language agnostic, can capture rich metadata)
- https://opentelemetry.io/