Skip to content

CookiePieWw/json-format-in-parquet-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark of JSON Formats in Memory and Parquet File

Purpose

The repo benchmarks the encoding/decoding performance and storage/query efficiency in parquet files of a variety of json encoding formats. It covers:

  • JSON, json in plain string
  • JSONB
  • JSONC, a variant of tape representation
  • Variant, dynamically parse json as arrow struct

Six different kinds of metrics for each format are evaluated:

  • Encode time, time to read json strings and encode into in-memory representations
  • Decode time, time to decode the in-memory representations into json strings
  • Flush time, time to flush the in-memory representations into a parquet file
  • Load time, time to load jsons from parquet files to in-memory representations
  • Parquet file size and compression ratio
  • Query time

The JSON payloads are characterized by the following params:

  • Data types
  • Max nested layer
  • NDV, the number of distinct values, which is important for the compression methods in parquet
  • Num of lines, num of jsons encoded into a parquet file, say, rows in a row group

Results in Different Payloads

Payload 1: Events of opentelemetry traces:

Data Types: Float, String, Object

Max Nested Layer: 2

NDV: [0.1, 0.3, 0.6, 1.0]

Num of Lines: [1024, 8192, 100000]

Metric: [Parquet file size and compression ratio]

Example of a line:

{
  "name": "jHl2oDDnPc1i2OzlP5Y",
  "timestamp": "2024-07-25T04:33:11.369386Z", # timestamp is distinct for all lines
  "attributes": { "event_attributes": 48.28667666464348 }
}

Results: File Size and Compression Ratio

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published