Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type detection fails in XlsxReader #751

Closed
Nadavbi87 opened this issue Feb 4, 2020 · 5 comments · Fixed by #931
Closed

Column type detection fails in XlsxReader #751

Nadavbi87 opened this issue Feb 4, 2020 · 5 comments · Fixed by #931
Labels

Comments

@Nadavbi87
Copy link

Hi,
First of all, thanks for this awesome package.

I'm using the XlsxReader to import excel file, in one of my tests I came across the following issue:
When the reader auto-detect the column types and the first column value is empty, it automatically detects it as a String Column and creates a String typed column, till here everything is fine.
When any other value in that column is a Numeric type(in my case all of them) it gets ignored and instead, it returns an empty value.
So, as a result, I get a table with a column that all of the values are empty.
I would except that I get all the values at least as a string and not empty value.

The logic is in the XlsxReader.private Column<?> appendValue(Column<?> column, Cell cell).

I try to solve it by using the "specifying the datatypes for each column" approach
I did the following :

Source source1 = new Source(new ByteArrayInputStream(byteArrayOutputStream.toByteArray()));
List<ColumnType> typesTotalData = Arrays.asList(new ColumnType[] {ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.INTEGER, ColumnType.INTEGER});
Table table = Table.read().usingOptions(XlsxReadOptions.builder(source1).sheetIndex(0).columnTypesToDetect(typesTotalData));

Then I realize that there is no use in the columnTypedToDetect in the XlsxReader.

*Another suggestion, it will be good if I'm using multiple sheets and using public List<Table> readMultiple(XlsxReadOptions options) throws IOException
That I can pass a list of options that each one of them is bounded to a specific sheet, Because in the current situation if I want for example different name ( or different column type list) for each sheet/table I need to perform a multiple reads ( as the number of the sheets ) and set the options for each one even though under the hood for each call it goes over all the sheets and return the sheet that correspond the index the user pass in the options.

I 'm using the latest version 0.37.3.

Thanks,
Nadav

@lwhite1
Copy link
Collaborator

lwhite1 commented Feb 23, 2020

Hi, @Nadavbi87,
Sorry you're having trouble with the library.

Is it possible to see if you can recreate this by first exporting the Excel file to CSV and see if it has the same problem on import?

@ccleva
Copy link
Contributor

ccleva commented Apr 2, 2020

Hi all,
I stumbled on this issue a couple of days ago and took the opportunity to dig a bit further:

I could reproduce the problem always by using one of the empty string indicators (TypeUtils.MISSING_INDICATORS) as the first value in the column (after the header). In this case the column is identified as a StringColumn and the following numeric values are ignored (the column is filled with empty values).

The problem does not occur with csv as the column type identification is done in a different way than with the excel reader (which rely on the cellType).

I think there are 2 different issues there:

  1. Creating a StringColumn despite a missing value indicator. Creation could be delayed until a non-missing value is read
  2. Columns with mixed-content are not handled properly: StringColumn are not filled with numerical values as String, and similarly numeric columns will not be converted to StringColumn if non numeric values are found (I have not checked how this is handled with csv).

I can provide a couple of (failing) tests and try to implement a solution to the 1st issue if you want

@benmccann benmccann changed the title Using XlsxReader Ignore column types to detect option and causing undesirable behavior. Column type detection fails in XlsxReader Apr 10, 2020
@benmccann
Copy link
Collaborator

We have a ColumnTypeDetector that is used by CsvReader, FixedWidthReader, and HtmlReader and handles both issues that you mention. I think that perhaps XlsxReader was written before ColumnTypeDetector existed.

The difficulty of using ColumnTypeDetector is that it takes Iterator<String[]> rows and in XlsxReader we have a org.apache.poi.ss.usermodel.Row instead of String[] for each row. I wonder if we could make ColumnTypeDetector a little more generic and subclass it for XlsxReader. There's a lot of logic in ColumnTypeDetector that'd be nice to share so that things work consistently across all the readers (e.g. when to use a StringColumn vs TextColumn, sampling, handling of mixed-content columns, etc.)

@lujop
Copy link
Contributor

lujop commented May 7, 2021

@benmccann Current type detection isn't based on the values, it's based on cell type on excel.
I think that this is not bad, but would be needed to check all the values and select the column type that can accommodate all.
In that case, I think that using ColumnTypeDetector isn't needed, isn't it?

@benmccann
Copy link
Collaborator

Ah, I wasn't aware Excel provided a cell type

lujop added a commit to lujop/tablesaw that referenced this issue May 9, 2021
Column type detection uses all column cells to determine the columnType instead of only the first one.
lujop added a commit to lujop/tablesaw that referenced this issue May 9, 2021
Column type detection uses all column cells to determine the columnType instead of only the first one.
@lujop lujop mentioned this issue May 9, 2021
1 task
lwhite1 pushed a commit that referenced this issue May 11, 2021
Column type detection uses all column cells to determine the columnType instead of only the first one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants