Benchmark catalog v2 for single and multi stream #1085

twitu · 2023-04-25T10:25:43Z

Pull Request

Adds benchmarks for single and multi stream. Benchmarks measure how long it takes for catalog to load a single stream of data and multiple streams of data. This is a good stress test for the stream merging datafusion based catalog.

The benchmarks were run on an AMD 5600U machine and gave the following. To run the tests you need the bench_data directory in the root of the repository and then cargo bench in the persistence crate for rust benching and pytest -k stream_catalog_v2 for python catalog benching.

Test	Rust	Python
single stream (10M)	2.4s	3.2s
multi stream (72M, 61 files)	23s	NA

The python tests are expected to take more time because of the overhead of initializing python objects and acquiring GIL.

The multi-stream python bench requires 8+ gb of ram and could not be run.

An appropriate measure for the catalog is its throughput which can be measured in terms of ticks per second. Here we can compare with previous iterations described in #705

Rust catalog v1 - arrow2 (single stream, only rust structs, no stream merge logic) ~ 6M/s
Rust catalog v2 - datafusion and kmerge ~ 3M/s
~~Rust catalog v1 - arrow2 and Python merge logic~~ dropping for significant implementation effort expected to be less than 3M/s
Catalog - Pure Python - 100K/s

Type of change

Performance tests

How has this change been tested?

The benchmarks were run and passed.

codecov · 2023-04-25T11:11:47Z

Codecov Report

Patch coverage has no change and project coverage change: -0.27 ⚠️

Comparison is base (efbef94) 90.26% compared to head (1c9afab) 90.00%.

❗ Current head 1c9afab differs from pull request most recent head c4d7ed2. Consider uploading reports for the commit c4d7ed2 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1085      +/-   ##
===========================================
- Coverage    90.26%   90.00%   -0.27%     
===========================================
  Files          257      257              
  Lines        26936    26817     -119     
===========================================
- Hits         24314    24136     -178     
- Misses        2622     2681      +59

see 42 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

twitu · 2023-05-07T11:20:29Z

Closes #705

Benchmark new catalog for single and multi stream

4e965cb

Fix precommit

18e5653

twitu marked this pull request as draft April 26, 2023 04:16

twitu added 2 commits April 26, 2023 12:23

Skip benchmark in CI

8cc919e

Merge branch 'develop' into catalog_benchmark

c4d7ed2

twitu marked this pull request as ready for review May 7, 2023 11:07

twitu linked an issue May 7, 2023 that may be closed by this pull request

Porting persistence layer to Rust #705

Closed

cjdsellers merged commit 91b3a2a into develop May 8, 2023

cjdsellers deleted the catalog_benchmark branch May 8, 2023 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark catalog v2 for single and multi stream #1085

Benchmark catalog v2 for single and multi stream #1085

twitu commented Apr 25, 2023 •

edited

Loading

codecov bot commented Apr 25, 2023 •

edited

Loading

twitu commented May 7, 2023

Benchmark catalog v2 for single and multi stream #1085

Benchmark catalog v2 for single and multi stream #1085

Conversation

twitu commented Apr 25, 2023 • edited Loading

Pull Request

Type of change

How has this change been tested?

codecov bot commented Apr 25, 2023 • edited Loading

Codecov Report

twitu commented May 7, 2023

twitu commented Apr 25, 2023 •

edited

Loading

codecov bot commented Apr 25, 2023 •

edited

Loading