Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic]: Complete ROW Format (Missing features) #1861

Closed
5 of 37 tasks
yjshen opened this issue Feb 18, 2022 · 2 comments
Closed
5 of 37 tasks

[Epic]: Complete ROW Format (Missing features) #1861

yjshen opened this issue Feb 18, 2022 · 2 comments
Assignees
Labels
datafusion Changes in the datafusion crate development-process Related to development process of DataFusion enhancement New feature or request performance Make DataFusion faster

Comments

@yjshen
Copy link
Member

yjshen commented Feb 18, 2022

Goal: a complete row implementation, fully used in pipeline breaker operators when possible.

Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

When operating with a Row based format, the per-tuple type dispatch overhead becomes quite important, so such operations are typically implemented using just in time compilation (JIT) or other unsafe mechanims to minimize the overhead

@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375

This ticket tracks the remaining work to fully support row formats, including JIT'ing

Getters and setters

  • Avoid unnecessary branching in row read/write if schema is null-free #1891
  • 1. Support all types that ScalarValue supports
    • 1.1 all basic types
      • 1.1.1 Decimal
      • 1.1.2 Timestamp
      • 1.1.3 Date
      • 1.1.4 Interval
      • 1.1.5 Null
    • 1.2 composite types: List / Struct
  • 2. Make varlena offset + length a type parameter for reader and writer, for space efficiency
  • 3. Assertion based on schema before getting. Think date64 as an example.

Formats

  • 1. basics: Support Multiple row layout #2188
  • 2. Compact: write once, never update, Eq comparable
    • 2.1 all type supports
  • 3. WordAligned: update heavy on cells
    • 3.1 all basic type supports
    • 3.2 Varlena out-of-place store in memory, and inline/de-inline while serializing/deserializing
  • 4. RawComparable: best effort comparable based on raw bytes
    • 4.1 null-inline
    • 4.2 float bytes comparable
    • 4.3 comparator with best effort &[u8] comp, and interleave with varlena compare field-by-field

Hook into execution (mainly the pipeline-breakers)

Cleanups

  • Getter / setter / accessor consolidation, DRY

JIT

@yjshen yjshen added the enhancement New feature or request label Feb 18, 2022
@yjshen yjshen changed the title Row format follow-ups Row format Apr 10, 2022
@yjshen yjshen mentioned this issue May 8, 2022
3 tasks
@yjshen yjshen changed the title Row format [ROW] Missing features May 8, 2022
@yjshen yjshen changed the title [ROW] Missing features ROW Missing features May 8, 2022
@yjshen yjshen self-assigned this May 8, 2022
@yjshen yjshen added datafusion Changes in the datafusion crate development-process Related to development process of DataFusion performance Make DataFusion faster labels May 8, 2022
@alamb alamb changed the title ROW Missing features ROW Format Missing features May 8, 2022
@alamb alamb changed the title ROW Format Missing features ROW Format / JIT Missing features Jun 6, 2022
@alamb alamb changed the title ROW Format / JIT Missing features ROW Format (+JIT) Missing features Jun 6, 2022
@alamb alamb changed the title ROW Format (+JIT) Missing features Epic: Complete ROW Format (Missing features) Jun 12, 2022
@alamb
Copy link
Contributor

alamb commented Jul 21, 2022

@iajoiner -- here is the main ticket that is tracking the row format progress. I think there are many PRs and other docs linked from here.

@alamb alamb changed the title Epic: Complete ROW Format (Missing features) [Epic]: Complete ROW Format (Missing features) Mar 5, 2023
@alamb
Copy link
Contributor

alamb commented Sep 5, 2023

I think we have chosen to focus on the arrow row format instead, and we removed the datafusion row format in #6968

@alamb alamb closed this as completed Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate development-process Related to development process of DataFusion enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

2 participants