PythonDB - Python as a Partitioned Database
PythonDB is an open source experiment to expose python instances as databases. In particular it exposes DuckDB and Polars SQL via a MySQL interface.
Contents
Summary
- Polars / DuckDB are amazing for data analysis.
- We will be using Parquet + PyArrow in future.
Component Pieces
- Snake = Python - Data Analysis lingua franca
- Duck = DuckDB - Fast free column database. 2024-06-03 - Version 1.0
- PolarBear = Polars - “Dataframes for the new era”
- Numpy - 2005 - large, multi-dimensional arrays
- Pandas - ~2009 - Dataframes + functions for analysis
- Polars - 2024-07-01 - Version 1.0 + DF + SQL
- Dolphin = MySQL - Humongously popular database
Parquet File Structure
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.
Credit Michael Berk
Arrow Memory Structure
Apache Arrow contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.