Skip to content

Commit

Permalink
Update quarterly roadmap for Q2 (#2133)
Browse files Browse the repository at this point in the history
* Update roadmap

* IO options comment

* Add streams

* Update with feedback
  • Loading branch information
matthewmturner authored Apr 4, 2022
1 parent 5ae3434 commit f99c271
Showing 1 changed file with 51 additions and 33 deletions.
84 changes: 51 additions & 33 deletions docs/source/specification/quarterly_roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,52 +21,70 @@

A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.

## 2022 Q1
## 2022 Q2

### DataFusion Core

- Publish official Arrow2 branch
- Implementation of memory manager (i.e. to enable spilling to disk as needed)
- IO Improvements
- Reading, registering, and writing more file formats from both DataFrame API and SQL
- Additional options for IO including partitioning and metadata support
- Work Scheduling
- Improve predictability, observability and performance of IO and CPU-bound work
- Develop a more explicit story for managing parallelism during plan execution
- Memory Management
- Add more operators for memory limited execution
- Performance
- Incorporate row-format into operators such as aggregate
- Add row-format benchmarks
- Explore JIT-compiling complex expressions
- Explore LLVM for JIT, with inline Rust functions as the primary goal
- Improve performance of Sort and Merge using Row Format / JIT expressions
- Documentation
- General improvements to DataFusion website
- Publish design documents
- Streaming
- Create `StreamProvider` trait

### Benchmarking
### Ballista

- Inclusion in Db-Benchmark with all quries covered
- All TPCH queries covered
- Make production ready
- Shuffle file cleanup
- Fill functional gaps between DataFusion and Ballista
- Improve task scheduling and data exchange efficiency
- Better error handling
- Task failure
- Executor lost
- Schedule restart
- Improve monitoring and logging
- Auto scaling support
- Support for multi-scheduler deployments. Initially for resiliency and fault tolerance but ultimately to support sharding for scalability and more efficient caching.
- Executor deployment grouping based on resource allocation

### Performance Improvements
### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))

- Predicate evaluation
- Improve multi-column comparisons (that can't be vectorized at the moment)
- Null constant support
#### [DataFusion-Python](https://github.com/datafusion-contrib/datafusion-python)

### New Features
- Add missing functionality to DataFrame and SessionContext
- Improve documentation

- Read JSON as table
- Simplify DDL with DataFusion-Cli
- Add Decimal128 data type and the attendant features such as Arrow Kernel and UDF support
- Add new experimental e-graph based optimizer
#### [DataFusion-S3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)

### Ballista
- Create Python bindings to use with datafusion-python

- Begin work on design documents and plan / priorities for development

### Extensions ([datafusion-contrib](https://github.com/datafusion-contrib]))
#### [DataFusion-Tui](https://github.com/datafusion-contrib/datafusion-tui)

- Stable S3 support
- Begin design discussions and prototyping of a stream provider
- Create multiple SQL editors
- Expose more Context and query metadata
- Support new data sources
- BigTable, HDFS, HTTP APIs

## Beyond 2022 Q1
#### [DataFusion-BigTable](https://github.com/datafusion-contrib/datafusion-bigtable)

There is no clear timeline for the below, but community members have expressed interest in working on these topics.
- Python binding to use with datafusion-python
- Timestamp range predicate pushdown
- Multi-threaded partition aware execution
- Production ready Rust SDK

### DataFusion Core

- Custom SQL support
- Split DataFusion into multiple crates
- Push based query execution and code generation

### Ballista
#### [DataFusion-Streams](https://github.com/datafusion-contrib/datafusion-streams)

- Evolve architecture so that it can be deployed in a multi-tenant cloud native environment
- Ensure Ballista is scalable, elastic, and stable for production usage
- Develop distributed ML capabilities
- Create experimental implementation of `StreamProvider` trait

0 comments on commit f99c271

Please sign in to comment.