Column type detection fails in XlsxReader #751

Nadavbi87 · 2020-02-04T16:04:50Z

Hi,
First of all, thanks for this awesome package.

I'm using the XlsxReader to import excel file, in one of my tests I came across the following issue:
When the reader auto-detect the column types and the first column value is empty, it automatically detects it as a String Column and creates a String typed column, till here everything is fine.
When any other value in that column is a Numeric type(in my case all of them) it gets ignored and instead, it returns an empty value.
So, as a result, I get a table with a column that all of the values are empty.
I would except that I get all the values at least as a string and not empty value.

The logic is in the XlsxReader.private Column<?> appendValue(Column<?> column, Cell cell).

I try to solve it by using the "specifying the datatypes for each column" approach
I did the following :

Source source1 = new Source(new ByteArrayInputStream(byteArrayOutputStream.toByteArray()));
List<ColumnType> typesTotalData = Arrays.asList(new ColumnType[] {ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.STRING, ColumnType.INTEGER, ColumnType.INTEGER});
Table table = Table.read().usingOptions(XlsxReadOptions.builder(source1).sheetIndex(0).columnTypesToDetect(typesTotalData));

Then I realize that there is no use in the columnTypedToDetect in the XlsxReader.

*Another suggestion, it will be good if I'm using multiple sheets and using public List<Table> readMultiple(XlsxReadOptions options) throws IOException
That I can pass a list of options that each one of them is bounded to a specific sheet, Because in the current situation if I want for example different name ( or different column type list) for each sheet/table I need to perform a multiple reads ( as the number of the sheets ) and set the options for each one even though under the hood for each call it goes over all the sheets and return the sheet that correspond the index the user pass in the options.

I 'm using the latest version 0.37.3.

Thanks,
Nadav

The text was updated successfully, but these errors were encountered:

lwhite1 · 2020-02-23T20:54:44Z

Hi, @Nadavbi87,
Sorry you're having trouble with the library.

Is it possible to see if you can recreate this by first exporting the Excel file to CSV and see if it has the same problem on import?

ccleva · 2020-04-02T10:41:59Z

Hi all,
I stumbled on this issue a couple of days ago and took the opportunity to dig a bit further:

I could reproduce the problem always by using one of the empty string indicators (TypeUtils.MISSING_INDICATORS) as the first value in the column (after the header). In this case the column is identified as a StringColumn and the following numeric values are ignored (the column is filled with empty values).

The problem does not occur with csv as the column type identification is done in a different way than with the excel reader (which rely on the cellType).

I think there are 2 different issues there:

Creating a StringColumn despite a missing value indicator. Creation could be delayed until a non-missing value is read
Columns with mixed-content are not handled properly: StringColumn are not filled with numerical values as String, and similarly numeric columns will not be converted to StringColumn if non numeric values are found (I have not checked how this is handled with csv).

I can provide a couple of (failing) tests and try to implement a solution to the 1st issue if you want

benmccann · 2020-04-10T15:39:28Z

We have a ColumnTypeDetector that is used by CsvReader, FixedWidthReader, and HtmlReader and handles both issues that you mention. I think that perhaps XlsxReader was written before ColumnTypeDetector existed.

The difficulty of using ColumnTypeDetector is that it takes Iterator<String[]> rows and in XlsxReader we have a org.apache.poi.ss.usermodel.Row instead of String[] for each row. I wonder if we could make ColumnTypeDetector a little more generic and subclass it for XlsxReader. There's a lot of logic in ColumnTypeDetector that'd be nice to share so that things work consistently across all the readers (e.g. when to use a StringColumn vs TextColumn, sampling, handling of mixed-content columns, etc.)

lujop · 2021-05-07T11:35:08Z

@benmccann Current type detection isn't based on the values, it's based on cell type on excel.
I think that this is not bad, but would be needed to check all the values and select the column type that can accommodate all.
In that case, I think that using ColumnTypeDetector isn't needed, isn't it?

benmccann · 2021-05-08T05:14:02Z

Ah, I wasn't aware Excel provided a cell type

Column type detection uses all column cells to determine the columnType instead of only the first one.

benmccann changed the title ~~Using XlsxReader Ignore column types to detect option and causing undesirable behavior.~~ Column type detection fails in XlsxReader Apr 10, 2020

lwhite1 added the excel label May 4, 2021

lujop mentioned this issue May 7, 2021

Excel reader column autodetect logic is very limited #930

Closed

lujop mentioned this issue May 7, 2021

XlsxReader casting integers cells to blanks #882

Open

lujop added a commit to lujop/tablesaw that referenced this issue May 9, 2021

Fix jtablesaw#751 and jtablesaw#882

b9451f9

Column type detection uses all column cells to determine the columnType instead of only the first one.

lujop added a commit to lujop/tablesaw that referenced this issue May 9, 2021

Fix jtablesaw#751 and jtablesaw#882

ae68fc7

Column type detection uses all column cells to determine the columnType instead of only the first one.

lujop mentioned this issue May 9, 2021

Fixes #751 and #882 #931

Merged

1 task

lwhite1 closed this as completed in #931 May 11, 2021

lwhite1 pushed a commit that referenced this issue May 11, 2021

Fix #751 and #882 (#931)

4d0b9ab

Column type detection uses all column cells to determine the columnType instead of only the first one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column type detection fails in XlsxReader #751

Column type detection fails in XlsxReader #751

Nadavbi87 commented Feb 4, 2020

lwhite1 commented Feb 23, 2020

ccleva commented Apr 2, 2020

benmccann commented Apr 10, 2020

lujop commented May 7, 2021

benmccann commented May 8, 2021

Column type detection fails in XlsxReader #751

Column type detection fails in XlsxReader #751

Comments

Nadavbi87 commented Feb 4, 2020

lwhite1 commented Feb 23, 2020

ccleva commented Apr 2, 2020

benmccann commented Apr 10, 2020

lujop commented May 7, 2021

benmccann commented May 8, 2021