[Epic]: Complete ROW Format (Missing features) #1861

yjshen · 2022-02-18T02:24:17Z

Goal: a complete row implementation, fully used in pipeline breaker operators when possible.

Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

When operating with a Row based format, the per-tuple type dispatch overhead becomes quite important, so such operations are typically implemented using just in time compilation (JIT) or other unsafe mechanims to minimize the overhead

@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375

This ticket tracks the remaining work to fully support row formats, including JIT'ing

Getters and setters

Formats

Hook into execution (mainly the pipeline-breakers)

Cleanups

Getter / setter / accessor consolidation, DRY

JIT

basics: JIT the tuple field get/set with schema, avoid branching for each field in each row. (Try to fix in Introduce JIT code generation #1849 )
TBD

The text was updated successfully, but these errors were encountered:

alamb · 2022-07-21T15:39:10Z

@iajoiner -- here is the main ticket that is tracking the row format progress. I think there are many PRs and other docs linked from here.

alamb · 2023-09-05T17:33:54Z

I think we have chosen to focus on the arrow row format instead, and we removed the datafusion row format in #6968

yjshen added the enhancement New feature or request label Feb 18, 2022

yjshen mentioned this issue Feb 18, 2022

Introduce Row format backed by raw bytes #1782

Merged

yjshen mentioned this issue Feb 27, 2022

Avoid unnecessary branching in row read/write if schema is null-free #1891

Merged

yjshen changed the title ~~Row format follow-ups~~ Row format Apr 10, 2022

yjshen mentioned this issue May 8, 2022

Support Multiple row layout #2188

Closed

3 tasks

yjshen changed the title ~~Row format~~ [ROW] Missing features May 8, 2022

yjshen changed the title ~~[ROW] Missing features~~ ROW Missing features May 8, 2022

yjshen self-assigned this May 8, 2022

yjshen added datafusion Changes in the datafusion crate development-process Related to development process of DataFusion performance Make DataFusion faster labels May 8, 2022

yjshen mentioned this issue May 8, 2022

Grouped Aggregate in row format #2375

Merged

alamb changed the title ~~ROW Missing features~~ ROW Format Missing features May 8, 2022

alamb changed the title ~~ROW Format Missing features~~ ROW Format / JIT Missing features Jun 6, 2022

alamb mentioned this issue Jun 6, 2022

[EPIC] JIT support for DataFusion #2703

Closed

2 tasks

alamb changed the title ~~ROW Format / JIT Missing features~~ ROW Format (+JIT) Missing features Jun 6, 2022

alamb mentioned this issue Jun 6, 2022

[MINOR]: Add documentation to datafusion-row modules #2704

Merged

alamb changed the title ~~ROW Format (+JIT) Missing features~~ Epic: Complete ROW Format (Missing features) Jun 12, 2022

alamb mentioned this issue Jun 12, 2022

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

Closed

alamb changed the title ~~Epic: Complete ROW Format (Missing features)~~ [Epic]: Complete ROW Format (Missing features) Mar 5, 2023

alamb closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic]: Complete ROW Format (Missing features) #1861

[Epic]: Complete ROW Format (Missing features) #1861

yjshen commented Feb 18, 2022 •

edited by alamb

Loading

alamb commented Jul 21, 2022

alamb commented Sep 5, 2023

[Epic]: Complete ROW Format (Missing features) #1861

[Epic]: Complete ROW Format (Missing features) #1861

Comments

yjshen commented Feb 18, 2022 • edited by alamb Loading

Getters and setters

Formats

Hook into execution (mainly the pipeline-breakers)

Cleanups

JIT

alamb commented Jul 21, 2022

alamb commented Sep 5, 2023

yjshen commented Feb 18, 2022 •

edited by alamb

Loading