Pandas and Polars using PyArrow backend

July 4, 2024

overviews
- pandas 2.0 moved from numpy backend (c++) to pyarrow backend
- polars is written in rust with python bindings
- both implement the same Apache Arrow data specification (Pandas 2.0 uses pyarrow, Polars uses arrow2)
data format and performance
- in-memory, contiguous columnar format with a rich data type system (including nested and user-defined data types)
  - supports vectorization using the latest SIMD (Single Instruction, Multiple Data) operations in modern processors
  - supports zero-copy reads for lightning-fast data access without serialization overhead
  - supports computational routines and execution engines to maximize their efficiency
standardisation
- without a standard columnar data format, every technology (languages and databases) has to implement its own
  - moving data involves costly serialization and deserialization
- with arrow, you can share data in an in-memory representation without persisting
  - don't need to create custom connectors
  - facilitates reuse of libraries of algorithms, across languages
ecosystem
- a language-agnostic framework with a vast ecosystem
  - official libraries across programming languages implement the format
    - third-party projects may work with arrow data without having to implement the format themselves
    - easy to share the data among different programs