Vision for datafusion-cli #1096

rupurt · 2021-10-09T22:38:31Z

Howdy,

I'm so pumped I found this project! Very excited to use more of it to improve the speed of my ETL pipelines. Thank you to the maintainers :)

Are there any resources for the vision of each component? Specifically it would be helpful to understand the vision for the datafusion-cli. In my head it feels like it would be natural to replicate as much psql functionality as possible but would like to understand your frame of mind as maintainers.

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Understand the vision of datafusion-cli so I can make useful contributions and improvements e.g. import from CSV and export to parquet via datafusion-cli

Describe the solution you'd like
Support for common psql commands like \copy, \d etc... instead of writing a custom rust program.

Describe alternatives you've considered

none

Additional context

none

The text was updated successfully, but these errors were encountered:

houqp · 2021-10-10T05:59:37Z

@jimexist did a lot of work around datafusion-cli. From what could tell, datafusion tries to be as close to postgres as possible, so it makes sense to me for us to support/match psql commands as well.

jimexist · 2021-10-10T06:56:55Z

Thanks for raising this issue and question @rupurt.

First off, I don't believe there's any centralized view on what datafusion-cli is or isn't, and there's no process yet to determine that. However I can share some of my thinkings here.

What datafusion-cli is not

Our ecosystem is full of different tools to manipulate data, each fitting its own niche purpose. In my opinion, datafusion-cli should not try to be yet another general-purpose tool to just manipulate data, especially since datafusion itself is intended to be an embeddable component for other tools (e.g. cube, ballista, roapi), to avoid confusion or reduce fragmented tech investment.

Specifically, datafusion-cli is not:

a general purpose Python enabled tool to query data, for that you'll have the Python binding for datafusion itself or polar for its speed and pandas compatibility, both leveraging Arrow and datafusion underneath
a command line tool to manipulate small, tabular or structured data, for that you'll have xsv, jq, or rq, depending on the file formats that one wants
a client to an HTTP or GraphQL enabled server backend, for that, you can have roapi or similar things, or in many cases Spark or Presto is just fine (when data size is large)

Also for 3. please note that (AFAIK) datafusion and datafusion-cli themselves do not concern with distributed computing, i.e. data sharding is something built on top of them - they can only do in-memory, uniformly accessible data manipulation.

What datafusion-cli can be

Given the above assertions, I believe the place where datafusion-cli can shine is:

the data size is large enough so that simple tools like jq or xsv can't cut it (within reasonable amount of time), but still small enough that can be fit into memory (EC2 machine has up to 12TB) - if you do care about the speed
when there's no need or necessity to keep a long running server or adopt full stack of Spark or Presto cluster due to their high maintenance cost

I think in many cases of ETL you do encounter these type of need for “glue” script that can just run with a simple shell script, either triggered manually, via Airflow, or crontab, etc. And it runs to the end, without any setup needed, takes in one or more parquet files, an SQL script, and spits out target data, either in JSON, CSV, or maybe Parquet.

It can also be a REPL, an interactive shell to quickly verify commands and look into data

Considerations on API

I think there are two aspects of API to this CLI:

the command line API (flags, options, naming, etc.)
the query syntax itself

For 2. it's clear that Postgres-compatible SQL is the choice here, but for 1. I don't think we are necessarily bound to the peculiarity of psql itself because:

it's in many cases specific to Postgres and the fact that it's a client-server architecture
psql is in many cases used interactively, but when it comes to shell automation, there are all sorts of other scripts in use, e.g. pg_ctl, pg_dumpall, etc. and here they all map to datafusion-cli

Having said that, I do think commands like \copy are already familiarised within users and nice to have, but I admit that in many cases I am still confused on the different flags and behaviors when that command is used interactively versus used as a CLI option. We need to be more consistent here, and just not necessarily consistent with psql itself.

What comes next

Although I'm not recently working on this area, some so call "roadmap" that I have in mind would be:

to build a fully abstracted layer of repl parsing so that queries and commands are separated and handled correctly - currently it's kind of a hack (e.g. SIG_INT isn't properly handled)
hook up the stats subsystem and have the cli print out more stats for query debugging, etc.
better error handling for interactive use and shell scripting usage
to widen the usage by publishing to apt, brew, and possible NuGet registry so that people can start using it more
maybe adopt a shorter name, like dfcli?

alamb · 2021-10-11T10:56:47Z

FWIW #1102 is related to the larger question of "datafusion roadmap"

houqp · 2021-10-12T05:30:56Z

I think we should PR @jimexist 's well written comment into #1104 or add to the roadmap.md file after 1104 is merged :)

alamb · 2021-10-12T17:08:21Z

Added my translation in 65910be -- additional thoughts welcome

alamb · 2021-10-26T12:46:58Z

I think this one is now done, so closing ticket -- would love to have your help @rupurt !

jimexist · 2021-10-28T05:05:28Z

related Homebrew/homebrew-core#88184

rupurt added the enhancement New feature or request label Oct 9, 2021

alamb added the question Further information is requested label Oct 11, 2021

alamb closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision for datafusion-cli #1096

Vision for datafusion-cli #1096

rupurt commented Oct 9, 2021 •

edited

Loading

houqp commented Oct 10, 2021

jimexist commented Oct 10, 2021 •

edited

Loading

alamb commented Oct 11, 2021

houqp commented Oct 12, 2021

alamb commented Oct 12, 2021

alamb commented Oct 26, 2021

jimexist commented Oct 28, 2021

Vision for datafusion-cli #1096

Vision for datafusion-cli #1096

Comments

rupurt commented Oct 9, 2021 • edited Loading

houqp commented Oct 10, 2021

jimexist commented Oct 10, 2021 • edited Loading

What datafusion-cli is not

What datafusion-cli can be

Considerations on API

What comes next

alamb commented Oct 11, 2021

houqp commented Oct 12, 2021

alamb commented Oct 12, 2021

alamb commented Oct 26, 2021

jimexist commented Oct 28, 2021

rupurt commented Oct 9, 2021 •

edited

Loading

jimexist commented Oct 10, 2021 •

edited

Loading