Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update quarterly roadmap for Q2 #2133

Merged
merged 4 commits into from
Apr 4, 2022
Merged

Conversation

matthewmturner
Copy link
Contributor

Which issue does this PR close?

Closes #1971

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@hntd187
Copy link
Contributor

hntd187 commented Apr 1, 2022

I'd like to ideally finalize the implementation for the streaming API and get an experimental impl available via datafusion-streams basically requires me to finalize the API contract.

@jychen7
Copy link
Contributor

jychen7 commented Apr 1, 2022

LGTM

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- Add more operators for memory limited execution
- Performance
- Incorporate row-format into operators such as aggregate
- Add row-format benchmarks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add row-format benchmarks
- Add row-format benchmarks
- Explore JIT-compiling complex expressions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @matthewmturner

This is going to be an exciting Q2!


### DataFusion Core

- Publish official Arrow2 branch
- Implementation of memory manager (i.e. to enable spilling to disk as needed)
- IO Improvements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @tustvold

Copy link
Contributor

@tustvold tustvold Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure what this specifically is referring to, but I definitely intend to focus on improving the IO and scheduling stories in arrow-rs and DataFusion. See apache/arrow-rs#1473 and #2079. Not sure if we want to explicitly call out the scheduling side of this.

I may also get to proper filter pushdown to parquet if I have time - apache/arrow-rs#1191

Edit: I've proposed a change with a very high-level statement of what I hope to achieve w.r.t scheduling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold. I plan on finishing the work summarized on #1777 which is what that refers to

- Incorporate row-format into operators such as aggregate
- Add row-format benchmarks
- Explore LLVM for JIT, with inline Rust functions as the primary goal
- Documentation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Documentation
- Improve performance of Sort and Merge using Row Format / JIT expressions
- Documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to contribute improvements to the Sort performance (especially for multi-column sorts that include strings) this quarter as well. I don't have any writeup of that yet

- IO Improvements
- Reading, registering, and writing more file formats from both DataFrame API and SQL
- Additional options for IO including partitioning and metadata support
- Memory Management
Copy link
Contributor

@tustvold tustvold Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Memory Management
- Work Scheduling
- Improve predictability, observability and performance of IO and CPU-bound work
- Develop a more explicit story for managing parallelism during plan execution
- Memory Management

I've yet to create a ticket for this, as I'm still exploring the problem domain, but the precursor discussions can be found apache/arrow-rs#1473 and #2079.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Apr 2, 2022

thank you @alamb, @Dandandan, and @tustvold for the suggestions. I will get them added shortly.

@alamb
Copy link
Contributor

alamb commented Apr 4, 2022

Merging and we can keep iterating / updating in follow on PRs if needed.

Thanks again @matthewmturner

@alamb alamb merged commit f99c271 into apache:master Apr 4, 2022
@alamb
Copy link
Contributor

alamb commented May 3, 2022

I hope to contribute improvements to the Sort performance (especially for multi-column sorts that include strings) this quarter as well. I don't have any writeup of that yet

I filed #2427 with some of my thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prepare to update roadmap for Q2
8 participants