Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading multiple excel sheets #110

Closed
wants to merge 10 commits into from
Closed

Conversation

roll
Copy link
Contributor

@roll roll commented Sep 21, 2019

@coveralls
Copy link

coveralls commented Sep 21, 2019

Pull Request Test Coverage Report for Build 363

  • 49 of 49 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.444%

Totals Coverage Status
Change from base Build 359: 0.2%
Covered Lines: 1634
Relevant Lines: 1935

💛 - Coveralls

@roll
Copy link
Contributor Author

roll commented Sep 21, 2019

@akariv
Please take a look (cc @cschloer)

@akariv
Copy link
Member

akariv commented Sep 24, 2019

@roll I have no special comment about the implementation - however I think that it's not the right location
for adding this sort of support.

Imagine the scenario of a big excel file with many sheets stored somewhere online. If the user uses this feature, this file will be downloaded n+1 times (n is the number of sheets), each sheet will be read and its schema will be inferred - all for getting one sheet only.

A tabulator based solution could look like this - we create a Streams interface in tabulator, which will expose a bunch of Streams (based on index and/or name).
A Streams object could be created from a list of Stream creation options - it will create separate Stream objects, possibly sharing the Loader (thus avoiding the need to re-download the same file). Some changes to the existing loaders will have to be made in order to support multiple use by more than one Stream.
Other creation options would be source based (e.g. for multi-sheet Excel/GSheet/ODS, multi-table sql etc.).
Naturally, Streams will also expose a 'get all stream names' which will be mapped to (for example) all sheet names when working with Excel.

wdyt?

@roll
Copy link
Contributor Author

roll commented Sep 25, 2019

@akariv
I agree that it can very ineffective.

The tabulator.Streams is the best option probably. Although maybe the problem can be resolved with caching on the tabulator level without introducing new APIs but not sure...

Anyway, @cschloer, I would say that, for now, I see that the only option is to use this code in your custom BCO-DMO processor (load or separate load_multiple like) because the required changes to tabulator are massive and probably it's not possible to do it in this iteration of work. WDYT?

@cschloer
Copy link
Contributor

Hmm okay. I've been resisting overwriting the dataflows load in my custom load processor (so I can keep getting updates from new dataflows versions), but maybe I can figure out a way to get this working within my custom load processor. And maybe something that avoids streaming all of the data, and just gets the sheet names? Thanks for the ideas @roll , I'll update you when I have something and you can look it over if interested

@cschloer
Copy link
Contributor

Just throwing this out here:

I was able to use the xlrd open_workbook fucntion with on_demand=True to not load the entire spreadsheet. Then using sheet_names() to get the sheet names and run regex on them.

        if parameters.get('sheet_regex', False):
            '''
            Handling a regular expression sheet name
            '''
            xls = xlrd.open_workbook(url, on_demand=True)
            sheet_names = xls.sheet_names()
            sheet_regex = parameters.pop('sheet', '')
            for sheet_name in sheet_names:
                if re.match(sheet_regex, sheet_name):
                    ....add to the load....

@roll
Copy link
Contributor Author

roll commented Sep 30, 2019

@cschloer
The problem with tabulator that this program handles all data source as abstract objects. For example, BCO-DMO software can have knowledge that all files are local and treat them accordingly. But tabulator has to support remote cases, byte streams etc. And to do this in some generalized way.

BTW, is it possible to run this code on BCO-DMO side? And then just use the standard load processor?

@roll
Copy link
Contributor Author

roll commented Oct 4, 2019

I close it for now as WONTFIX

@roll roll closed this Oct 4, 2019
@cschloer
Copy link
Contributor

cschloer commented Oct 4, 2019

Hey, sorry I didn't understand that xlrd_open_workbook() was only working on local files. You are totally correct! This is not too much of a limitation for us, as most of our files come from local paths as you said. I've already added this to my own load processor and am able to keep using the dataflows load so I am good to go 👍

@cschloer
Copy link
Contributor

@roll We are moving our infrastructure to start using remote files inside of local file paths (s3 urls) so the solution I original made does not work (or rather it loads the file n+1 times). Do you think we could revisit the solution suggested by @akariv of handling multiple excel sheets within tabulator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants