Time Series Database

Below is a comprehensive listing of tick data storage options.
The assumption is that the market data is level 2 tick data and/or very large in size (multi-GB per daily.). If your data set is smaller, you should pick whichever format is easiest for you. That may even just be in-memory in whichever language you are using.

To quickly provide some overall guidance, we suggest:

For a smaller number of concurrent users, a shared file system can be used to scale a flat-file based solution.
Where you have many concurrent users a database solution is recommended

Flat Files

Flat files have the benefit of being extremely simple and can be processed in almost any language.

Beyond the simplest use-cases, we recommend parquet file storage, this stores the data column-oriented and allows easy compression, with APIs available for every language.

Apache ArrowStandardized column-oriented format optimized for memory access that is able to represent flat and hierarchical data for efficient analytics.
Apache AvroSerialization format for record data, offers excellent schema evolution.
Apache IcebergOpen table format for analytic datasets. (Hive/Spark)
Apache Orchigh Performance columnar storage for Hadoop.
Apache Parquet Column-oriented data file format designed for efficient data storage and retrieval including data compression.
HDF5High performance file format to manage, process, and store your heterogeneous data. HDF5 is built for fast I/O processing and storage.
CSVSimple but not efficient. Sometimes zipped for data compression.
Language SerializationNo For example: Pickling in python or java serialization. Not recommended unless you are very sure you will only ever have that specific access method.
TeaFilesTeaFiles provide fast read/write access to time series data from any software package on any platform.

Folder / File Structure

Depending on your use-case you will want to be careful what folder structure you choose.
Will you be examining the data for less than 10 stocks at once? Within what time-frame?
Will you be wanting to look at data for 100s of stocks within a small time-frame?

Typically in the case you want to look at a small number, you may have the structure these in a folder structure symbol/YYYY-MM-DD.csv

e.g. daily split - GOOG/2023-11-28.csv AAPL/2023-11-28.csv
e.g. hourly - GOOG/2023-11-28T08.csv GOOG/2023-11-28T09.csv

This allows easy slicing of data by symbol and date/time.

Standard Databases

If you happen to have a lot of existing in-house expertise and pre-configured workflows you could consider using a standard database. PostgreSQL, MySQL, MS SQL will NOT scale to the largest tick data sets but perhaps you don't need every message stored. Storing aggregates or samples, while still being able to reuse all your existing tooling may be a worthwhile trade off.

One thing you may want to keep in mind is that for MySQL and PostgreSQL there are customized extensions available that you could use later should you need more performance. For example Citus Data is a column-oriented version of PostgreSQL more suitable for market data.

Time-Series Databases

These databases will have an increased cost, as they are non-standard or expensive commercial solutions. However with the cost comes benefits, they have been optimized for speed and performing the advanced queries you may need.

Top time series databases include our 3 recommendations shown below.

Product Score SQL Time-Joins Popularity Description License
Clickhouse (wp) 8 Some + Custom Yes asof Yes Popular Very fast OLAP database with cloud version available. Started 10 years ago at Yandex to store the russian equivalent of google analytics. Apache License 2.0
QuestDB 7 High + Extensions Yes asof+ N/A New Fast database with strong focus on time-series. Very similar ideas to kdb+ but open source. Apache License 2.0
kdb+ (wp) 8 No Some + qSQL Yes Yes. AJ/WJ Yes Finance Very fast column-oriented database with custom language q and custom time-series joins.
Steep learning curve and difficult to find experts.

Storing Market Data in Python

We decided to add this category as a)Python is hugely popular b)this means custom solutions have recently been introduced that scale much better:

ArcticDBHigh performance, serverless DataFrame database built for the Python Data Science ecosystem.
DuckDBDuckDB is an in-process SQL OLAP database management system.