Data engineering in python

September 16, 2024

General tools

  • compute
    • Databricks (which uses Apache Spark as it's processing engine and allows for in-memory caching and optimised query execution) and DeltaLake (which provides the data lake / warehouse)
  • storage
    • relational (MSSQL, PostgreSQL)
    • non-relational (MongoDB, Redis)
    • object (Azure blob / Amazon S3)
  • containerisation
    • Docker (automate and manage the deployment of applications inside containers)

Python tools