Loading multiple excel sheets #110

roll · 2019-09-21T07:29:00Z

connects Excel parser handle regular expression sheet name BCODMO/frictionless-usecases#23

coveralls · 2019-09-21T07:31:39Z

Pull Request Test Coverage Report for Build 363

49 of 49 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 84.444%

Totals
Change from base Build 359:	0.2%
Covered Lines:	1634
Relevant Lines:	1935

💛 - Coveralls

roll · 2019-09-21T07:32:46Z

@akariv
Please take a look (cc @cschloer)

akariv · 2019-09-24T10:16:33Z

@roll I have no special comment about the implementation - however I think that it's not the right location
for adding this sort of support.

Imagine the scenario of a big excel file with many sheets stored somewhere online. If the user uses this feature, this file will be downloaded n+1 times (n is the number of sheets), each sheet will be read and its schema will be inferred - all for getting one sheet only.

A tabulator based solution could look like this - we create a Streams interface in tabulator, which will expose a bunch of Streams (based on index and/or name).
A Streams object could be created from a list of Stream creation options - it will create separate Stream objects, possibly sharing the Loader (thus avoiding the need to re-download the same file). Some changes to the existing loaders will have to be made in order to support multiple use by more than one Stream.
Other creation options would be source based (e.g. for multi-sheet Excel/GSheet/ODS, multi-table sql etc.).
Naturally, Streams will also expose a 'get all stream names' which will be mapped to (for example) all sheet names when working with Excel.

wdyt?

roll · 2019-09-25T13:30:39Z

@akariv
I agree that it can very ineffective.

The tabulator.Streams is the best option probably. Although maybe the problem can be resolved with caching on the tabulator level without introducing new APIs but not sure...

Anyway, @cschloer, I would say that, for now, I see that the only option is to use this code in your custom BCO-DMO processor (load or separate load_multiple like) because the required changes to tabulator are massive and probably it's not possible to do it in this iteration of work. WDYT?

cschloer · 2019-09-26T12:42:36Z

Hmm okay. I've been resisting overwriting the dataflows load in my custom load processor (so I can keep getting updates from new dataflows versions), but maybe I can figure out a way to get this working within my custom load processor. And maybe something that avoids streaming all of the data, and just gets the sheet names? Thanks for the ideas @roll , I'll update you when I have something and you can look it over if interested

cschloer · 2019-09-26T13:28:24Z

Just throwing this out here:

I was able to use the xlrd open_workbook fucntion with on_demand=True to not load the entire spreadsheet. Then using sheet_names() to get the sheet names and run regex on them.

        if parameters.get('sheet_regex', False):
            '''
            Handling a regular expression sheet name
            '''
            xls = xlrd.open_workbook(url, on_demand=True)
            sheet_names = xls.sheet_names()
            sheet_regex = parameters.pop('sheet', '')
            for sheet_name in sheet_names:
                if re.match(sheet_regex, sheet_name):
                    ....add to the load....

roll · 2019-09-30T06:49:48Z

@cschloer
The problem with tabulator that this program handles all data source as abstract objects. For example, BCO-DMO software can have knowledge that all files are local and treat them accordingly. But tabulator has to support remote cases, byte streams etc. And to do this in some generalized way.

BTW, is it possible to run this code on BCO-DMO side? And then just use the standard load processor?

roll · 2019-10-04T14:47:20Z

I close it for now as WONTFIX

cschloer · 2019-10-04T15:37:39Z

Hey, sorry I didn't understand that xlrd_open_workbook() was only working on local files. You are totally correct! This is not too much of a limitation for us, as most of our files come from local paths as you said. I've already added this to my own load processor and am able to keep using the dataflows load so I am good to go 👍

cschloer · 2020-04-29T09:51:29Z

@roll We are moving our infrastructure to start using remote files inside of local file paths (s3 urls) so the solution I original made does not work (or rather it loads the file n+1 times). Do you think we could revisit the solution suggested by @akariv of handling multiple excel sheets within tabulator?

roll added 10 commits September 20, 2019 17:05

Added a test excel file

2aada0b

Added a draft test

d125b4f

Refactored getting resource's descriptor/itrator for load

a999080

Enhanced excel tests

c14b2d7

Initial sheets implementation

defbdc8

Updated to tabulator@1.26

7f8bf69

Implemented regex for sheets

8d2ee3e

Added sheets to the readme

05542a7

Fixed resource's path

1a7af41

Fixed readme

1844ac7

roll mentioned this pull request Oct 4, 2019

Excel parser handle regular expression sheet name BCODMO/frictionless-usecases#23

Closed

roll closed this Oct 4, 2019

cschloer mentioned this pull request Apr 29, 2020

[dataflows] revisit loading multiple sheets with one load step frictionlessdata/datapackage-pipelines#188

Open

roll mentioned this pull request May 27, 2020

Sheets parameter for the load processor #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading multiple excel sheets #110

Loading multiple excel sheets #110

roll commented Sep 21, 2019

coveralls commented Sep 21, 2019 •

edited

Loading

roll commented Sep 21, 2019

akariv commented Sep 24, 2019

roll commented Sep 25, 2019 •

edited

Loading

cschloer commented Sep 26, 2019

cschloer commented Sep 26, 2019

roll commented Sep 30, 2019

roll commented Oct 4, 2019 •

edited

Loading

cschloer commented Oct 4, 2019

cschloer commented Apr 29, 2020

Loading multiple excel sheets #110

Loading multiple excel sheets #110

Conversation

roll commented Sep 21, 2019

coveralls commented Sep 21, 2019 • edited Loading

Pull Request Test Coverage Report for Build 363

💛 - Coveralls

roll commented Sep 21, 2019

akariv commented Sep 24, 2019

roll commented Sep 25, 2019 • edited Loading

cschloer commented Sep 26, 2019

cschloer commented Sep 26, 2019

roll commented Sep 30, 2019

roll commented Oct 4, 2019 • edited Loading

cschloer commented Oct 4, 2019

cschloer commented Apr 29, 2020

coveralls commented Sep 21, 2019 •

edited

Loading

roll commented Sep 25, 2019 •

edited

Loading

roll commented Oct 4, 2019 •

edited

Loading