Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision for datafusion-cli #1096

Closed
rupurt opened this issue Oct 9, 2021 · 7 comments
Closed

Vision for datafusion-cli #1096

rupurt opened this issue Oct 9, 2021 · 7 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@rupurt
Copy link

rupurt commented Oct 9, 2021

Howdy,

I'm so pumped I found this project! Very excited to use more of it to improve the speed of my ETL pipelines. Thank you to the maintainers :)

Are there any resources for the vision of each component? Specifically it would be helpful to understand the vision for the datafusion-cli. In my head it feels like it would be natural to replicate as much psql functionality as possible but would like to understand your frame of mind as maintainers.

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Understand the vision of datafusion-cli so I can make useful contributions and improvements e.g. import from CSV and export to parquet via datafusion-cli

Describe the solution you'd like
Support for common psql commands like \copy, \d etc... instead of writing a custom rust program.

Describe alternatives you've considered

  • none

Additional context

  • none
@rupurt rupurt added the enhancement New feature or request label Oct 9, 2021
@houqp
Copy link
Member

houqp commented Oct 10, 2021

@jimexist did a lot of work around datafusion-cli. From what could tell, datafusion tries to be as close to postgres as possible, so it makes sense to me for us to support/match psql commands as well.

@jimexist
Copy link
Member

jimexist commented Oct 10, 2021

Thanks for raising this issue and question @rupurt.

First off, I don't believe there's any centralized view on what datafusion-cli is or isn't, and there's no process yet to determine that. However I can share some of my thinkings here.

What datafusion-cli is not

Our ecosystem is full of different tools to manipulate data, each fitting its own niche purpose. In my opinion, datafusion-cli should not try to be yet another general-purpose tool to just manipulate data, especially since datafusion itself is intended to be an embeddable component for other tools (e.g. cube, ballista, roapi), to avoid confusion or reduce fragmented tech investment.

Specifically, datafusion-cli is not:

  1. a general purpose Python enabled tool to query data, for that you'll have the Python binding for datafusion itself or polar for its speed and pandas compatibility, both leveraging Arrow and datafusion underneath
  2. a command line tool to manipulate small, tabular or structured data, for that you'll have xsv, jq, or rq, depending on the file formats that one wants
  3. a client to an HTTP or GraphQL enabled server backend, for that, you can have roapi or similar things, or in many cases Spark or Presto is just fine (when data size is large)

Also for 3. please note that (AFAIK) datafusion and datafusion-cli themselves do not concern with distributed computing, i.e. data sharding is something built on top of them - they can only do in-memory, uniformly accessible data manipulation.

What datafusion-cli can be

Given the above assertions, I believe the place where datafusion-cli can shine is:

  1. the data size is large enough so that simple tools like jq or xsv can't cut it (within reasonable amount of time), but still small enough that can be fit into memory (EC2 machine has up to 12TB) - if you do care about the speed
  2. when there's no need or necessity to keep a long running server or adopt full stack of Spark or Presto cluster due to their high maintenance cost

I think in many cases of ETL you do encounter these type of need for “glue” script that can just run with a simple shell script, either triggered manually, via Airflow, or crontab, etc. And it runs to the end, without any setup needed, takes in one or more parquet files, an SQL script, and spits out target data, either in JSON, CSV, or maybe Parquet.

It can also be a REPL, an interactive shell to quickly verify commands and look into data

Considerations on API

I think there are two aspects of API to this CLI:

  1. the command line API (flags, options, naming, etc.)
  2. the query syntax itself

For 2. it's clear that Postgres-compatible SQL is the choice here, but for 1. I don't think we are necessarily bound to the peculiarity of psql itself because:

  1. it's in many cases specific to Postgres and the fact that it's a client-server architecture
  2. psql is in many cases used interactively, but when it comes to shell automation, there are all sorts of other scripts in use, e.g. pg_ctl, pg_dumpall, etc. and here they all map to datafusion-cli

Having said that, I do think commands like \copy are already familiarised within users and nice to have, but I admit that in many cases I am still confused on the different flags and behaviors when that command is used interactively versus used as a CLI option. We need to be more consistent here, and just not necessarily consistent with psql itself.

What comes next

Although I'm not recently working on this area, some so call "roadmap" that I have in mind would be:

  1. to build a fully abstracted layer of repl parsing so that queries and commands are separated and handled correctly - currently it's kind of a hack (e.g. SIG_INT isn't properly handled)
  2. hook up the stats subsystem and have the cli print out more stats for query debugging, etc.
  3. better error handling for interactive use and shell scripting usage
  4. to widen the usage by publishing to apt, brew, and possible NuGet registry so that people can start using it more
  5. maybe adopt a shorter name, like dfcli?

@alamb
Copy link
Contributor

alamb commented Oct 11, 2021

FWIW #1102 is related to the larger question of "datafusion roadmap"

@alamb alamb added the question Further information is requested label Oct 11, 2021
@houqp
Copy link
Member

houqp commented Oct 12, 2021

I think we should PR @jimexist 's well written comment into #1104 or add to the roadmap.md file after 1104 is merged :)

@alamb
Copy link
Contributor

alamb commented Oct 12, 2021

Added my translation in 65910be -- additional thoughts welcome

@alamb
Copy link
Contributor

alamb commented Oct 26, 2021

I think this one is now done, so closing ticket -- would love to have your help @rupurt !

@alamb alamb closed this as completed Oct 26, 2021
@jimexist
Copy link
Member

related Homebrew/homebrew-core#88184

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants