Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel & CSV query runner #2478

Merged
merged 11 commits into from
Jul 27, 2021
Merged

Excel & CSV query runner #2478

merged 11 commits into from
Jul 27, 2021

Conversation

deecay
Copy link
Contributor

@deecay deecay commented Apr 20, 2018

This PR adds Excel and CSV as possible datasource.

  • Datasource Name is the only configuration for both datasource.
  • Query must be in YAML format.
  • Specify path of excel/csv file, either local path or URI, for url parameter.
  • You may pass other parameters to the pandas excel/csv parsing function.

Online Excel data file with parameters
image

  • Reference for parameters: CSV and Excel.

Same data without parameters (table is broken)

  • https://www.unicef.org/sowc2012/pdfs/U5MR-rank_FINAL.xls
    image

  • TODO:

  • db-logo

@deecay deecay changed the title Excel query runner Excel & CSV query runner Sep 3, 2018
@deecay
Copy link
Contributor Author

deecay commented Sep 3, 2018

Need db-logo for Excel...

Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Supporting both CSV and Excel is great.

My main concern with this implementation is the use of Pandas, as I always felt it's a "heavy" dependency that might introduce issues with setting up Redash for some users. Although I did look into it again and I'm no longer sure about this.

@jezdez do you happen to have any insight on Pandas requirements?

(and, @jezdez , of course it's another case for #2921 ;-))

@deecay
Copy link
Contributor Author

deecay commented Nov 20, 2018

@jezdez, any opinions? Maybe resort to requirements_excel_ds.txt?

@jezdez
Copy link
Member

jezdez commented Nov 20, 2018

Yeah, installing Pandas and numpy is quite a lot for the purpose of reading and parsing Excel and CSV files alone. That's an additional ~23 MB (unpacked whl files on Linux for Python 2.7) for the Docker image and a higher maintenance burden given the high profile of Pandas and numpy for a relatively small, albeit useful feature.

Options forward:

A positive point is that both Pandas and numpy are available as precompiled whl files, so it's literally just a matter of downloading it.

@denisov-vlad
Copy link
Member

Pandas is must have for python query runner.

Maybe it's heavy but does not require additional libraries which should be installed via apt like other datasources like mssql.

@jezdez
Copy link
Member

jezdez commented Nov 20, 2018

Pandas is must have for python query runner.

Maybe it's heavy but does not require additional libraries which should be installed via apt like other datasources like mssql.

That may or may not be so, it's out of scope of this pull request review though.

@deecay
Copy link
Contributor Author

deecay commented Nov 20, 2018

Option 2 is acceptable for me too. I will wait until #2921 is done.

xlrd will be the candidate for Option 1, but this involves decent amount of 'reinventing the wheel' to get the nice features that Pandas have (skiprows, usecols, etc). Without these features, a "perfectly formatted" excel will be required, which can be very rare as I have displayed in the examples.

@arikfr
Copy link
Member

arikfr commented Nov 20, 2018

Option 2 is acceptable for me too. I will wait until #2921 is done.

We don't have to extract all the query runners for you to be able to do this. But we will probably need to extract the query runners base classes and helper methods to their own package, so they can be used "externally". @jezdez , am I correct here?

xlrd will be the candidate for Option 1, but this involves decent amount of 'reinventing the wheel' to get the nice features that Pandas have (skiprows, usecols, etc). Without these features, a "perfectly formatted" excel will be required, which can be very rare as I have displayed in the examples.

I missed the added functionality. That's actually nice :) I would suggest a different "query syntax" though, let's use YAML here, so the query becomes:

url: https://www.unicef.org/sowc2012/pdfs/U5MR-rank_FINAL.xls
names:
  - Country
  - Mortality
  - Rank
usecols: [0, 1, 2]
skiprows: 7
skipfooter: 2

@deecay
Copy link
Contributor Author

deecay commented Nov 20, 2018

@Arik, interesting idea. I'll take a look at yandex metrica query runner first, and see what I can do about yaml.

@deecay
Copy link
Contributor Author

deecay commented Nov 30, 2018

Done with the yaml part.
image

@deecay deecay changed the title Excel & CSV query runner [WIP] Excel & CSV query runner Nov 30, 2018
@deecay
Copy link
Contributor Author

deecay commented Oct 24, 2019

Is this okay to merge now (by pymapd), since we have both numpy and pandas in our standard install?

@machzqcq
Copy link

machzqcq commented Dec 15, 2019

Is this okay to merge now, since we have both numpy and pandas in our standard install?

Can someone confirm this is merged, reading excel and pandas df as data source will help a lot

@dersphere
Copy link

Would be great to have my initial code somehow integrated (thanks @deecay) :)

@deecay deecay changed the title [WIP] Excel & CSV query runner Excel & CSV query runner Dec 26, 2019
@deecay deecay requested a review from jezdez December 26, 2019 01:11
@xsvfat
Copy link

xsvfat commented Feb 24, 2020

Would be great to get this merged!

Copy link
Member

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this never went anywhere folks.

@deecay
Copy link
Contributor Author

deecay commented May 26, 2021

I'm sorry this never went anywhere folks.

Hi @jezdez, what is blocking this? Or should I close this for some reason?

Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: we can merge this! @deecay can you please confirm this still works on latest master?

And sorry it took so long.

Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there are few things missing:

  1. Logos for each one of them.
  2. Register them so they loaded by default.
  3. Add the local urls filtering (let me know if you need a pointer for what I mean).

Thanks!

@deecay
Copy link
Contributor Author

deecay commented Jul 24, 2021

  1. Logo for Excel added. (csv logo was already there)
  2. Updated init file to be enabled by default

Could you point to the local urls filtering?

@susodapop
Copy link
Contributor

Could you point to the local urls filtering?

I think Arik is referring to this which was discussed when implementing the JSON data source.

ua = args['user-agent']
args.pop('user-agent', None)

if is_private_address(path) and settings.ENFORCE_PRIVATE_ADDRESS_BLOCK:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susodapop , I guess you're right. Done.

@arikfr arikfr merged commit b9cb819 into getredash:master Jul 27, 2021
@arikfr
Copy link
Member

arikfr commented Jul 27, 2021

So it took over 3 years but it's merged now. Thank you for the the continued contribution and immense patience 😅

@dersphere
Copy link

😃

@deecay deecay deleted the datasource-excel branch July 28, 2021 10:10
@ainsofs
Copy link

ainsofs commented Sep 28, 2021

CSV and Excel do not appear on my list of data sources. Is there something I need to do to make that happen? I am using image redash/redash:8.0.0.b32245 . I also tried image version 10 beta with the same result

@susodapop
Copy link
Contributor

@ainsofs You need to run the tip of master since this PR isn't merged into the V10 branch (yet). It will be before we release.

susodapop pushed a commit that referenced this pull request Oct 1, 2021
* Excel query runner

* Param handling for read_excel

* CSV query runner

* Fix wrong module name

* Use yaml as query language

* Use yaml as query language for CSV

* Added icon and required modules

* Local address filtering

* Fix syntax error
susodapop pushed a commit that referenced this pull request Oct 1, 2021
The following PR's were cherry-picked:

* Excel & CSV query runner (#2478)
* Pin python3 image version (#5570)
* Fix: Edit Source button disappeared for users without CanEdit perms (#5568)
* Fix: Specify the protobuf version (#5608)

Plus one additional change exclusive to this branch:
* Replace reference to yarn with NPM

This happened because we cherry-picked #5570 but did not also incorporate
#5541 into V10.

Co-authored-by: deecay <deecay@users.noreply.github.com>
Co-authored-by: Levko Kravets <levko.ne@gmail.com>
Co-authored-by: zoomdot <gninggoon@gmail.com>
@nixftw
Copy link

nixftw commented Oct 18, 2021

@susodapop What do you mean by run the tip of master ? Got redash working , but CSV does not appear on the list.

@nixftw
Copy link

nixftw commented Oct 18, 2021

@ainsofs Did you manage do get it working ? If so , please share the solution :)

denisov-vlad pushed a commit to denisov-vlad/redash that referenced this pull request Oct 19, 2021
commit 9c928bd
Author: Jesse Whitehouse <jesse@whitehouse.dev>
Date:   Fri Oct 1 21:13:13 2021 -0500

    Bump version to 10.0.0

commit f312adf
Author: Jesse <jesse.whitehouse@databricks.com>
Date:   Fri Oct 1 18:02:27 2021 -0500

    Apply V10 beta period feedback / fixes (getredash#5611)

    The following PR's were cherry-picked:

    * Excel & CSV query runner (getredash#2478)
    * Pin python3 image version (getredash#5570)
    * Fix: Edit Source button disappeared for users without CanEdit perms (getredash#5568)
    * Fix: Specify the protobuf version (getredash#5608)

    Plus one additional change exclusive to this branch:
    * Replace reference to yarn with NPM

    This happened because we cherry-picked getredash#5570 but did not also incorporate
    getredash#5541 into V10.

    Co-authored-by: deecay <deecay@users.noreply.github.com>
    Co-authored-by: Levko Kravets <levko.ne@gmail.com>
    Co-authored-by: zoomdot <gninggoon@gmail.com>

commit 92e5d78
Author: Jesse <jesse.whitehouse@databricks.com>
Date:   Thu Jun 17 13:42:07 2021 -0500

    Update changelog details for snowflake (getredash#5519)

commit 0983e69
Author: Jesse <jesse.whitehouse@databricks.com>
Date:   Thu Jun 17 12:45:17 2021 -0500

    update changelog for v10-beta (getredash#5517)

commit dec8879
Author: Jesse <jesse.whitehouse@databricks.com>
Date:   Tue Jun 15 15:04:36 2021 -0500

    Fix: pagination is broken on the dashboard list page (getredash#5516)

    * Add test that reproduces issue getredash#5466

    * Fix: Duplicate dashboard rows were returned by Dashboard.all() (getredash#5466)

commit 64a1d7a
Author: Jesse Whitehouse <jesse@whitehouse.dev>
Date:   Tue Jun 1 11:21:49 2021 -0500

    Update version for CircleCI build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants