-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Many DataFusion users are using DataFusion to execution workloads originally developed for Apache Spark. Examples include
- DataFusion Comet (@andygrove @comphead , etc)
- LakeHQ / Sail (@shehabgamin )
- Various internal pileines / engines (e.g. that @Omega359 and I think @Blizzara use)
They often do this for superior performance
- Part of running Spark workloads is emulating Spark sematics
- Emulating Spark semantics requires (among other things) functions compatible with Spark (which differs in semantics to the functions included in DataFusion)
Several projects are in the process of implementing Spark compatible function libraries using DataFusion's extension APIs. However. we concluded in #5600 that we could join forces and maintain a spark compatible funciton library in the core datafusion repo. @shehabgamin has implemented the first step in #15168 🙏
Describe the solution you'd like
This ticket tracks "completing" the spark function library started in #15168
Describe alternatives you've considered
datetime functions:
- [datafusion-spark] Implement Spark
datetimefunctionlast_day#16774 - [datafusion-spark] Implement Spark
datefunctionnext_day#16775
string functions:
math functions:
set functions:
- [datafusion-spark]: Implement collect_set #17924
- [datafusion-spark]: Implement collect_list/array_agg #17923
map/array functions:
Infrastructure and Testing:
- [DISCUSSION] Add separate crate to cover spark builtin functions #5600
- feat: Add
datafusion-sparkcrate #15168 - [datafusion-spark] Example of using Spark compatible function library #15915
- feat: Support test spark runner in
datafusion-sparkfor slt tests #17045 - SparkDateAdd does not check for overflow #17987
- Deduplicate Spark function code with native/default datafusion function code #17964
Related issues
- [datafusion-spark] Test integrating datafusion-spark code into comet datafusion-comet#1704
- [EPIC] Implement expressions as ScalarUDFImpl datafusion-comet#1819
- Spark-compatible CAST operation #11201
- SparkSha2 is not compliant with Spark and does not support Int32 type #16336
- Add xxhash algorithms in SQL and expression api #14367
- [datafusion-spark] [SQL] [TEST] IntervalMonthDayNano(0,0,0) give line blank #17455
Additional context
No response
andygrove, shehabgamin, Omega359, Adez017, cht42 and 4 more
Metadata
Metadata
Assignees
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request