Data engineering in python
September 16, 2024
General tools
- compute
- Databricks (which uses Apache Spark as it's processing engine and allows for in-memory caching and optimised query execution) and DeltaLake (which provides the data lake / warehouse)
- storage
- relational (MSSQL, PostgreSQL)
- non-relational (MongoDB, Redis)
- object (Azure blob / Amazon S3)
- containerisation
- Docker (automate and manage the deployment of applications inside containers)
Python tools
- Airflow (task-centric, great open source support) and Dagster (asset-centric, great testing and debugging support) for orchestration
- Polars and Pandas for data processing (Polars is becomming more mainstream, it uses Rust and Apache Arrow on the backend)
- DBT for data transformations
- FastAPI for building API's (this is a very popular lilbrary, and works on top of Pydantic which handles data parsing and validation)
- SQLAlchemy for database connections (they released a new 2.0+ major release a few months ago)
- Poetry / UV for project, package and dependency management
- Ansible for infrastructure management and configuration
- Opentelemetry for monitoring (open source, language agnostic, can capture rich metadata)