Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

Closed
Tracked by #10779
alamb opened this issue May 13, 2024 · 3 comments
Closed
Tracked by #10779

Keynote presentation for SiMoD workshop at SIGMOD 2024 #10481

alamb opened this issue May 13, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@alamb
Copy link
Contributor

alamb commented May 13, 2024

I am giving an invited keynote talk at a workshop colocated with SIGMOD 2024 on Friday Jun 14, 2024 (after the main conference).

I need to prepare slides for this and figured people in the DataFusion community might be interested

DataFusion: The Case for Building Data Systems using Open Standards:

Abstract: Andrew will discuss engineering tradeoffs made when building Apache DataFusion, an open source and extensible query engine used as the basis of many commercial and open source projects. These decisions (mostly) favored simplicity and worked better than initially expected. He will cover the rationale for which parts of DataFusion use pre-existing standards such as Arrow and Parquet, and which parts are built “from scratch” such as vectorized hashing and normalized sort keys. He will also discuss DataFusion’s design philosophy of extensible APIs paired with simple default implementations. Finally, he will offer lessons learned and enumerate some things that worked well and what could have been improved.

@alamb alamb added the documentation Improvements or additions to documentation label May 13, 2024
@alamb
Copy link
Contributor Author

alamb commented May 13, 2024

Here are some notes I have on what I want to talk about

interfaces and then paradoxically allowed us to narrow the scope of potential optimizations (e.g. compute kernels) and have people focus on different areas.

Things we didn't implement:

  • File formats (instead focused on Parquet, avro, arrow, json, csv)
  • Memory format Arrow (not just externally but internally)
  • threadpool standard (tokio) vs our own thread pool
  • pull / exchange rather than morsel driven parallelism
  • standard I/O rather than buffer pool
  • latest / greatest window aggregates fanciness (todo get paper link)

Providing simple built in defaults, but hooks for more specialized implementations
Keeps DF simple, allows

  • Catalog
  • memory / disk manager

Things we did: places we spent time and complexity

  • normalized keys / row format
  • optimizing parquet reader
  • optimizing hashing
  • plan representation (logical plans, exprs, etc)
  • function library
  • ListingTable (maybe this should have been more

Things I would do differently next time:
Keep listing table out of the core
UDFs from the start

@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2024

Here is the presentation. I will post it more broadly once I have worked on it a bit more

https://docs.google.com/presentation/d/1K3EdknzkqU2LhWi_eNKXdcvNk0OEvk9AqTLqhZkPxuI/edit#slide=id.p

@alamb
Copy link
Contributor Author

alamb commented Jun 14, 2024

Its done! I'll try and record this talk too at some point and post it on http://andrew.nerdnetworks.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant