Columnar Data Formats (Parquet, ORC) & Encodings
4 min readAug 18, 2024
In this article, we delve into the differences in data storage and encoding strategies, focusing on how they affect performance, especially for OLAP and OLTP workloads.
Breaking Down Data Storage in Columns
When storing a tuple in a columnar format, each attribute is stored separately. During query processing, these individual attributes are “stitched” back together to form the complete tuple.
- OLAP Workloads: Typically involve sequential scans of large, read-only data segments (e.g., column subsets). The DBMS only needs to reassemble tuples (“stitching”) at the final stages of the query — this is known as late materialization. The idea is to delay stitching as long as possible to avoid unnecessary I/O, especially when certain tuples might not be needed as we traverse the query plan.
- OLTP Workloads: Rely on indexes to find individual tuples directly, avoiding full sequential scans. These systems also require support for incremental updates. Tree-based indexes (e.g., B+ Trees) are well-suited for low-selectivity predicates common in OLTP queries.
Storage Models Overview
- N-ary Storage Model (NSM): Row-based storage (e.g., MySQL, PostgreSQL).
- Decomposition Storage Model (DSM): Fully columnar storage.
- Hybrid Storage Model (PAX): Used in formats like Parquet and ORC.
N-ary Storage Model (NSM)
- Stores all attributes of a tuple contiguously on a single page, making it ideal for OLTP workloads where transactions access individual entries and where there is frequent insertion.
- Page Sizes: Typically multiples of 4KB. If attributes are oversized and cannot fit within a 4KB page, auxiliary structures like
TOAST
(in PostgreSQL) handle this. - Page Size Rationale: Keeping page sizes small is beneficial since fetching a record involves loading the entire page into memory. Larger pages are less efficient for single-tuple processing, where we use the Iterator Volcano Processing Model.
Decomposition Storage Model (DSM)
- Stores data by attribute across multiple tuples, enabling better compression, particularly for read-only queries.
- Transactional Updates: For updates involving many attributes, DSM can be slow. For instance, updating one record with 100 attributes might require updating 100 pages. This inefficiency makes DSM unsuitable for OLTP-style updates.
- File Size Considerations: DSM typically deals with larger file sizes (e.g., 100MB or more), although they are divided into smaller chunks.
Two Key DSM Storage Strategies:
- Fixed-Length Offsets: Stores attributes and metadata in fixed-length arrays. All data must be fixed-length for consistent offsets.
- Embedded Tuple IDs: Stores values alongside a tuple ID, requiring auxiliary structures like hash tables to determine offsets.
- Use of Fixed-Length Offsets: This is more common, even in Parquet, to maintain consistent column storage. Embedded tuple IDs are mainly used in row stores that have added a column store component.
PAX Storage Model (Hybrid Approach)
- PAX partitions data vertically within each page, allowing the benefits of columnar storage while preserving spatial locality from row-based storage.
- Row Groups & Column Chunks: Data is horizontally partitioned into row groups (~10MB in size), with attributes stored in vertically partitioned column chunks.
- Metadata Placement: Metadata is stored at the file’s end because updates are append-only, requiring post-processing.
Open File Formats in OLAP: Parquet & ORC
In OLAP, portability is key. Proprietary file formats (e.g., Oracle or PostgreSQL) can hinder data movement across systems. Open formats like Parquet and ORC allow for seamless integration across different compute engines.
1. File Metadata
- In OLTP databases, file metadata is limited to data and schema, while additional info is kept in catalogs. OLAP files, in contrast, are self-contained, storing schema, row group offsets, tuple counts, and zone maps directly within the file.
2. Format Layout
- Parquet: Row groups are defined by the number of tuples (e.g., 1 million rows), which can lead to large row groups when there are many columns.
- ORC: Row groups are defined by physical storage size (e.g., 250MB), offering better control over group sizes.
3. Type Systems
- Both Parquet and ORC define types at both physical and logical levels. Parquet favors minimal physical types (e.g., INT32), while ORC supports a wider range.
4. Encoding Schemes
- Dictionary Encoding: Converts common values into codes stored in a dictionary.
- Run-Length Encoding (RLE): Efficient for compressing repeated values, though Parquet and ORC apply it differently.
- Other Schemes: Delta encoding, bitpacking, and Frame-of-Reference (FOR) encoding are used for specific scenarios.
5. Compression
- Both formats use block compression. While useful, block compression can slow down processing due to the extra decompression step. Parquet and ORC allow tuning for compression level and speed.
6. Filters
- Zone Maps: Track min-max values per column, helping filter out irrelevant data.
- Bloom Filters: Probabilistic data structures that efficiently determine if a value might exist.
7. Nested Data Models
- Both formats support semi-structured data (e.g., JSON), though they differ in how they handle repetition and optional fields.
Key Takeaways
- Dictionary encoding is effective for various data types, not just strings.
- Simpler encoding schemes perform better on modern hardware.
- Avoid excessive block compression as network speeds have improved, making CPU performance the primary bottleneck.