Storing Market Data Efficiently
Below is a comprehensive listing of tick data storage options.
The assumption is that the market data is level 2 tick data and/or very large in size (multi-GB per daily.).
If your data set is smaller, you should pick whichever format is easiest for you. That may even just be in-memory in whichever language you are using.
To quickly provide some overall guidance, we suggest:
- Flat Files - For a small number of simple cases using all data, e.g. backtesting.
- Standard Databases - If you need to query and slice the data and want to take advantage of standard SQL and tooling.
- Time-Series Databases - If you want to perform specialized queries and get great performance.
- Storing Market Data in Python - If you are working solely in Python.
For a smaller number of concurrent users, a shared file system can be used to scale a flat-file based solution.
Where you have many concurrent users a database solution is recommended
Flat Files
Flat files have the benefit of being extremely simple and can be processed in almost any language.
Beyond the simplest use-cases, we recommend parquet file storage, this stores the data column-oriented and allows easy compression, with APIs available for every language.
| Technology | Description |
|---|---|
| Apache Arrow | Standardized column-oriented format optimized for memory access that is able to represent flat and hierarchical data for efficient analytics. |
| Apache Avro | Serialization format for record data, offers excellent schema evolution. |
| Apache Iceberg | Open table format for analytic datasets. (Hive/Spark) |
| Apache Orc | High performance columnar storage for Hadoop. |
| Apache Parquet | Column-oriented data file format designed for efficient data storage and retrieval including data compression. |
| Lance | Cloud-native columnar format designed for fast random access and high-throughput scans. Optimized for large binary objects (embeddings, images, audio) with O(1) lookup via repetition indexes and efficient SIMD-friendly mini-blocks. |
| Apache Vortex | Extensible, state-of-the-art columnar format and compression framework. Supports SIMD-native decompression, cascading encodings (FSST, ALP, FastLanes), zero-copy Arrow interoperability, fast random access, and customizable layouts. |
| HDF5 | High performance file format to manage, process, and store your heterogeneous data. Built for fast I/O processing and storage. |
| CSV | Simple but not efficient. Sometimes zipped for data compression. |
| Language Serialization | For example: Pickling in python or java serialization. Not recommended unless you are very sure you will only ever have that specific access method. |
| TeaFiles | TeaFiles provide fast read/write access to time series data from any software package on any platform. |
Folder / File Structure
Depending on your use-case you will want to be careful what folder structure you choose.
Will you be examining the data for less than 10 stocks at once? Within what time-frame?
Will you be wanting to look at data for 100s of stocks within a small time-frame?
Typically in the case you want to look at a small number, you may have the structure these in a folder structure symbol/YYYY-MM-DD.csv
e.g. daily split - GOOG/2023-11-28.csv AAPL/2023-11-28.csv
e.g. hourly - GOOG/2023-11-28T08.csv GOOG/2023-11-28T09.csv
This allows easy slicing of data by symbol and date/time.
Standard Databases
If you happen to have a lot of existing in-house expertise and pre-configured workflows you could consider using a standard database. PostgreSQL, MySQL, MS SQL will NOT scale to the largest tick data sets but perhaps you don't need every message stored. Storing aggregates or samples, while still being able to reuse all your existing tooling may be a worthwhile trade off.
One thing you may want to keep in mind is that for MySQL and PostgreSQL there are customized extensions available that you could use later should you need more performance. For example Citus Data is a column-oriented version of PostgreSQL more suitable for market data.
Time-Series Databases
These databases will have an increased cost, as they are non-standard or expensive commercial solutions. However with the cost comes benefits, they have been optimized for speed and performing the advanced queries you may need.
Top time series databases include our 3 recommendations shown below.
| Product | Score | SQL | Time-Joins | Popularity | Description | License |
|---|---|---|---|---|---|---|
| Clickhouse (wp) | 8 | Some + Custom | Very fast OLAP database with cloud version available. Started 10 years ago at Yandex to store the russian equivalent of google analytics. | Apache License 2.0 | ||
| QuestDB | 7 | High + Extensions | Fast database with strong focus on time-series. Very similar ideas to kdb+ but open source. | Apache License 2.0 | ||
| kdb+ (wp) | 8 | Very fast column-oriented database with custom language q and custom time-series joins.
Steep learning curve and difficult to find experts. |
Commercial |
Storing Market Data in Python
We decided to add this category as a)Python is hugely popular b)this means custom solutions have recently been introduced that scale much better:
| Technology | Description |
|---|---|
| ArcticDB | High performance, serverless DataFrame database built for the Python Data Science ecosystem. |
| DuckDB | DuckDB is an in-process SQL OLAP database management system. |
| DuckLake | Lakehouse table format designed for DuckDB. Provides ACID transactions, schema evolution, and versioning on top of Parquet, enabling reproducible analytics and incremental updates. |
Next-Gen Columnar Formats
Parquet remains the default workhorse for analytical storage, but newer formats are emerging to address its limitations around random access, schema evolution, ML workloads, and extreme compression. These formats are particularly relevant for modern use-cases such as embeddings, vector search, interactive ML training, and cloud-native analytics.
| Technology | Strengths | Primary Use-Cases | Description |
|---|---|---|---|
| Apache Parquet | High scan throughput, mature ecosystem | Batch analytics, ETL, data lakes | The industry standard columnar format for analytical workloads. Excellent for large sequential scans but poor at random access, immutable, and awkward for schema evolution and ML-style access patterns. |
| Lance | O(1) random access, cloud-native, fast scans | Embeddings, images, audio, ML training, vector search | A cloud-native columnar format designed for fast random access and high-throughput scans. Uses repetition indexes to jump directly to any row and SIMD-friendly mini-blocks for dense data. Well-suited to modern ML and AI workloads. |
| Apache Vortex | Extreme compression, SIMD-native decompression | Analytical storage, bandwidth-limited systems, experimental engines | An extensible, research-driven columnar format and compression framework. Supports cascading encodings (FSST, ALP, FastLanes), zero-copy Arrow interoperability, fast random access to compressed data, and fully customizable layouts. Optimized to make decompression effectively “free” via SIMD. |
Rule of thumb:
- Use Parquet for classic batch analytics and data lake workloads.
- Use Lance when you need fast random access (e.g. embeddings, images, ML training, vector search).
- Use Apache Vortex when maximum compression and decompression speed matter more than ecosystem maturity.