Pandas and Polars using PyArrow backend

July 4, 2024

  • overviews
    • pandas 2.0 moved from numpy backend (c++) to pyarrow backend
    • polars is written in rust with python bindings
    • both implement the same Apache Arrow data specification (Pandas 2.0 uses pyarrow, Polars uses arrow2)
  • data format and performance
    • in-memory, contiguous columnar format with a rich data type system (including nested and user-defined data types)
      • supports vectorization using the latest SIMD (Single Instruction, Multiple Data) operations in modern processors
      • supports zero-copy reads for lightning-fast data access without serialization overhead
      • supports computational routines and execution engines to maximize their efficiency
  • standardisation
    • without a standard columnar data format, every technology (languages and databases) has to implement its own
      • moving data involves costly serialization and deserialization
    • with arrow, you can share data in an in-memory representation without persisting
      • don't need to create custom connectors
      • facilitates reuse of libraries of algorithms, across languages
  • ecosystem
    • a language-agnostic framework with a vast ecosystem
      • official libraries across programming languages implement the format
        • third-party projects may work with arrow data without having to implement the format themselves
        • easy to share the data among different programs