Pandas and Polars using PyArrow backend
July 4, 2024
- overviews
- pandas 2.0 moved from numpy backend (c++) to pyarrow backend
- polars is written in rust with python bindings
- both implement the same Apache Arrow data specification (Pandas 2.0 uses pyarrow, Polars uses arrow2)
- data format and performance
- in-memory, contiguous columnar format with a rich data type system (including nested and user-defined data types)
- supports vectorization using the latest SIMD (Single Instruction, Multiple Data) operations in modern processors
- supports zero-copy reads for lightning-fast data access without serialization overhead
- supports computational routines and execution engines to maximize their efficiency
- in-memory, contiguous columnar format with a rich data type system (including nested and user-defined data types)
- standardisation
- without a standard columnar data format, every technology (languages and databases) has to implement its own
- moving data involves costly serialization and deserialization
- with arrow, you can share data in an in-memory representation without persisting
- don't need to create custom connectors
- facilitates reuse of libraries of algorithms, across languages
- without a standard columnar data format, every technology (languages and databases) has to implement its own
- ecosystem
- a language-agnostic framework with a vast ecosystem
- official libraries across programming languages implement the format
- third-party projects may work with arrow data without having to implement the format themselves
- easy to share the data among different programs
- official libraries across programming languages implement the format
- a language-agnostic framework with a vast ecosystem