Storing Market Data Efficiently

Below is a comprehensive listing of tick data storage options.
The assumption is that the market data is level 2 tick data and/or very large in size (multi-GB per daily.). If your data set is smaller, you should pick whichever format is easiest for you. That may even just be in-memory in whichever language you are using.

To quickly provide some overall guidance, we suggest:

Flat Files - For a small number of simple cases using all data, e.g. backtesting.
Standard Databases - If you need to query and slice the data and want to take advantage of standard SQL and tooling.
Time-Series Databases - If you want to perform specialized queries and get great performance.
Storing Market Data in Python - If you are working solely in Python.

For a smaller number of concurrent users, a shared file system can be used to scale a flat-file based solution.
Where you have many concurrent users a database solution is recommended

Flat Files

Flat files have the benefit of being extremely simple and can be processed in almost any language.

Beyond the simplest use-cases, we recommend parquet file storage, this stores the data column-oriented and allows easy compression, with APIs available for every language.

Technology	Description
Apache Arrow	Standardized column-oriented format optimized for memory access that is able to represent flat and hierarchical data for efficient analytics.
Apache Avro	Serialization format for record data, offers excellent schema evolution.
Apache Iceberg	Open table format for analytic datasets. (Hive/Spark)
Apache Orc	high Performance columnar storage for Hadoop.
Apache Parquet	Column-oriented data file format designed for efficient data storage and retrieval including data compression.
HDF5	High performance file format to manage, process, and store your heterogeneous data. HDF5 is built for fast I/O processing and storage.
CSV	Simple but not efficient. Sometimes zipped for data compression.
Language Serialization	For example: Pickling in python or java serialization. Not recommended unless you are very sure you will only ever have that specific access method.
TeaFiles	TeaFiles provide fast read/write access to time series data from any software package on any platform.

Folder / File Structure

Depending on your use-case you will want to be careful what folder structure you choose.
Will you be examining the data for less than 10 stocks at once? Within what time-frame?
Will you be wanting to look at data for 100s of stocks within a small time-frame?

Typically in the case you want to look at a small number, you may have the structure these in a folder structure symbol/YYYY-MM-DD.csv

e.g. daily split - GOOG/2023-11-28.csv AAPL/2023-11-28.csv
e.g. hourly - GOOG/2023-11-28T08.csv GOOG/2023-11-28T09.csv

This allows easy slicing of data by symbol and date/time.

Standard Databases

If you happen to have a lot of existing in-house expertise and pre-configured workflows you could consider using a standard database. PostgreSQL, MySQL, MS SQL will NOT scale to the largest tick data sets but perhaps you don't need every message stored. Storing aggregates or samples, while still being able to reuse all your existing tooling may be a worthwhile trade off.

One thing you may want to keep in mind is that for MySQL and PostgreSQL there are customized extensions available that you could use later should you need more performance. For example Citus Data is a column-oriented version of PostgreSQL more suitable for market data.

Time-Series Databases

These databases will have an increased cost, as they are non-standard or expensive commercial solutions. However with the cost comes benefits, they have been optimized for speed and performing the advanced queries you may need.

Top time series databases include our 3 recommendations shown below.

Product	Score	SQL	Time-Joins	Popularity	Description	License
Clickhouse (wp)	8	Some + Custom	asof	Popular	Very fast OLAP database with cloud version available. Started 10 years ago at Yandex to store the russian equivalent of google analytics.	Apache License 2.0
QuestDB	7	High + Extensions	asof+	New	Fast database with strong focus on time-series. Very similar ideas to kdb+ but open source.	Apache License 2.0
kdb+ (wp)	8	Some + qSQL	Yes. AJ/WJ	Finance	Very fast column-oriented database with custom language q and custom time-series joins. Steep learning curve and difficult to find experts.	Commercial

Storing Market Data in Python

We decided to add this category as a)Python is hugely popular b)this means custom solutions have recently been introduced that scale much better:

Technology	Description
ArcticDB	High performance, serverless DataFrame database built for the Python Data Science ecosystem.
DuckDB	DuckDB is an in-process SQL OLAP database management system.