-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vision for datafusion-cli #1096
Comments
@jimexist did a lot of work around datafusion-cli. From what could tell, datafusion tries to be as close to postgres as possible, so it makes sense to me for us to support/match psql commands as well. |
Thanks for raising this issue and question @rupurt. First off, I don't believe there's any centralized view on what datafusion-cli is or isn't, and there's no process yet to determine that. However I can share some of my thinkings here. What datafusion-cli is notOur ecosystem is full of different tools to manipulate data, each fitting its own niche purpose. In my opinion, datafusion-cli should not try to be yet another general-purpose tool to just manipulate data, especially since datafusion itself is intended to be an embeddable component for other tools (e.g. cube, ballista, roapi), to avoid confusion or reduce fragmented tech investment. Specifically, datafusion-cli is not:
Also for 3. please note that (AFAIK) datafusion and datafusion-cli themselves do not concern with distributed computing, i.e. data sharding is something built on top of them - they can only do in-memory, uniformly accessible data manipulation. What datafusion-cli can beGiven the above assertions, I believe the place where datafusion-cli can shine is:
I think in many cases of ETL you do encounter these type of need for “glue” script that can just run with a simple shell script, either triggered manually, via Airflow, or crontab, etc. And it runs to the end, without any setup needed, takes in one or more parquet files, an SQL script, and spits out target data, either in JSON, CSV, or maybe Parquet. It can also be a REPL, an interactive shell to quickly verify commands and look into data Considerations on APII think there are two aspects of API to this CLI:
For 2. it's clear that Postgres-compatible SQL is the choice here, but for 1. I don't think we are necessarily bound to the peculiarity of
Having said that, I do think commands like What comes nextAlthough I'm not recently working on this area, some so call "roadmap" that I have in mind would be:
|
FWIW #1102 is related to the larger question of "datafusion roadmap" |
Added my translation in 65910be -- additional thoughts welcome |
I think this one is now done, so closing ticket -- would love to have your help @rupurt ! |
related Homebrew/homebrew-core#88184 |
Howdy,
I'm so pumped I found this project! Very excited to use more of it to improve the speed of my ETL pipelines. Thank you to the maintainers :)
Are there any resources for the vision of each component? Specifically it would be helpful to understand the vision for the datafusion-cli. In my head it feels like it would be natural to replicate as much
psql
functionality as possible but would like to understand your frame of mind as maintainers.Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Understand the vision of datafusion-cli so I can make useful contributions and improvements e.g. import from CSV and export to parquet via
datafusion-cli
Describe the solution you'd like
Support for common psql commands like
\copy, \d
etc... instead of writing a custom rust program.Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: