Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Data from Excel Files #1994

Closed
4 tasks done
universalmind303 opened this issue Oct 27, 2023 · 7 comments
Closed
4 tasks done

Read Data from Excel Files #1994

universalmind303 opened this issue Oct 27, 2023 · 7 comments
Assignees
Labels
feat New feature or request

Comments

@universalmind303
Copy link
Contributor

universalmind303 commented Oct 27, 2023

Description

It'd be nice to be able to natively read excel files

preferred usage

select * from read_excel('path/to/file.xlxs') -- Defaults to the first sheet

We'd likely need some options to specify the sheet

select * from read_excel('path/to/file.xlxs', sheet => 'Sheet2') -- Uses the specified sheet

Components

@universalmind303 universalmind303 added the feat New feature or request label Oct 27, 2023
@universalmind303
Copy link
Contributor Author

we should be able to use calamine to read the sheets.

I'm not familiar with the excel file format, but ideally we should do some level of parallelization over reading the sheets. First iteration could be as simple as just reading it in, then we could look into further optimizations like parallelization & projection pushdowns.

@jordandakota
Copy link

There's also the office crate

https://docs.rs/office/latest/office/

@scsmithr
Copy link
Member

scsmithr commented Nov 8, 2023

There's also the office crate

https://docs.rs/office/latest/office/

Any reason to switch off of calamine for this? Are there features we're missing (or bugs)?

@tychoish
Copy link
Contributor

tychoish commented Nov 9, 2023

Any reason to switch off of calamine for this? Are there features we're missing (or bugs)?

I believe that they're the same thing. Office was the old name, and it was renamed.

@greyscaled
Copy link
Contributor

Any reason to switch off of calamine for this? Are there features we're missing (or bugs)?

I believe that they're the same thing. Office was the old name, and it was renamed.

That checks out. If you click the Repository URL for that crate, it takes you to the Calamine Repository:

Repository

https://github.com/tafia/calamine

And that crate hasn't been updated for 7 years: https://crates.io/crates/office

@tychoish tychoish changed the title add read_excel function Read Data from Excel Files Feb 14, 2024
@tychoish
Copy link
Contributor

tychoish commented Feb 14, 2024

Just to update the status of this:

  • the read_excel table function is running in the product
  • there are some SLT tests.

I think to close this properly we should:

  • support reading from object store (using GenericObjectStoreAccess as with some other files,)
  • add support for create external table with xslt.
  • add xslt to the extensions that we parse/direct on when handling paths.

I think it wouldn't be absurd if we added create external database support for an excel sheet where each sheet became a table, but maybe that's a stretch goal. Most of these tasks are pretty small and straightforward, so I'll make new issues and add the good first issue label. I'm assigning myself for tracking purposes.

@jgranduel
Copy link

Hi,
will you consider supporting Excel tables ? I think more and more data are stored in this structure, and they have many advantages over ranges (being relatively close to a database table, with column title, no merged columns). Tables are named, and AFAIK, their name is unique within a workbook (see for instance, in Apache POI : https://poi.apache.org/apidocs/dev/org/apache/poi/ss/usermodel/Table.html, ExcelJS: https://github.com/exceljs/exceljs?tab=readme-ov-file#tables, .NET importExcel :https://www.powershellgallery.com/packages/ImportExcel/6.5.0/Content/GetExcelTable.ps1 or https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.spreadsheet.table?view=openxml-3.0.1 for giving a few example I came across). I don't know about any Rust implementation though (does Calamine support it?). It's frustrating having data in Excel tables, or data loaded into Excel through Power Query which creates Excel tables and not being able to load them by name.

Expected behaviour would :

select * from read_excel('path/to/file.xlxs', table => 'table1') -- Uses the specified table

Hope it's doable without too much effort. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants