PythonDB is an open source experiment to expose python instances as databases. In particular it exposes DuckDB and Polars SQL via a MySQL interface.

pythondb architecture

Summary

  1. Polars / DuckDB are amazing for data analysis.
  2. We will be using Parquet + PyArrow in future.

Component Pieces

cuddly toys for python polars pandas
  • Snake = Python - Data Analysis lingua franca
  • Duck = DuckDB - Fast free column database. 2024-06-03 - Version 1.0
  • PolarBear = Polars - “Dataframes for the new era”
    • Numpy - 2005 - large, multi-dimensional arrays
    • Pandas - ~2009 - Dataframes + functions for analysis
    • Polars - 2024-07-01 - Version 1.0 + DF + SQL
  • Dolphin = MySQL - Humongously popular database


Parquet File Structure

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.

parquet file format

Credit Michael Berk



Arrow Memory Structure

Apache Arrow contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.

arrow interchange format

Demo Code