[0.2.27] behavior change in EMPTY management #75

remicollet · 2020-07-06T08:41:59Z

Looks like the change in 0.2.27 creates a behavior change resulting in infinite loop.

Found trying to run https://pecl.php.net/package/xlswriter test suite against 0.2.27 (was ok against 0.2.26)

2 tests are failing with process timeout

@viest for information, if you can also have a look

brechtsanders · 2020-07-06T08:57:39Z

Can you tell me what these tests were exactly and how I can reproduce the problem?
On which OS an with which zip library was this?
Note: changes between 0.2.26 and 0.2.27 were only in the reader.

remicollet · 2020-07-06T09:09:25Z

Can you tell me what these tests were exactly and how I can reproduce the problem?

To reproduce, from https://github.com/viest/php-ext-xlswriter (with php 7 devel installed)

$ phpize
$ ./configure --with-libxlsxio=/path/to/libxlsio/install/dir    --enable-reader 
$ make
$ make test

Failing test are

tests/open_xlsx_get_data_skip_empty.phpt
tests/open_xlsx_next_row_skip_empty.phpt

On which OS an with which zip library was this?

On Fedora 31, using libzip 1.7.1

brechtsanders · 2020-07-06T09:38:53Z

I did not write the PHP library, so I need to reproduce this in C to get to the bottom of this.
Can you send me the tutorial.xlsx file used in the tests? I couldn't find it right away.

remicollet · 2020-07-06T09:42:48Z

tutorial.xlsx

brechtsanders · 2020-07-06T09:49:54Z

Thanks. Now I can reproduce the issue when using XLSXIOREAD_SKIP_EMPTY_CELLS.

remicollet · 2020-07-06T10:12:34Z

Great...

If you have a fix, I can try it before a new release.

brechtsanders · 2020-07-06T11:22:21Z

Okay, I will let you know when I committed new code.

brechtsanders · 2020-07-06T13:46:19Z

More work than I thought. Not finished yet, but if you like you can already test master if it no longer crashes/hangs.

remicollet · 2020-07-06T13:57:42Z

Using a6ab8e1
PHP extension test suite passes again.

brechtsanders · 2020-07-06T14:04:07Z

Final issues resolved. It took me a while to make sure all combinations of XLSXIOREAD_SKIP_EMPTY_* worked well.
If it still passes I will release the changes.

remicollet · 2020-07-06T14:16:58Z

Looks like f49a3d1 breaks something else

TEST 67/84 [tests/open_xlsx_next_row_with_data_type_date_array_index.phpt]
========DIFF========
007+ array(2) {
008+   [0]=>
009+   string(0) ""
010+   [1]=>
011+   string(0) ""
007- array(0) {
========DONE========

remicollet · 2020-07-06T14:21:09Z

Test file used by this test
tutorial.xlsx

brechtsanders · 2020-07-06T14:39:08Z

Can't reproduce in C. What exactly is being tested here?

remicollet · 2020-07-06T15:02:04Z

Can't reproduce in C. What exactly is being tested here?

Not clear to me

@viest can you please help on this ?

IIUC, second row have no value, but 2 empty rows are returned (xlsxioread_sheet_next_cell)

brechtsanders · 2020-07-06T16:33:03Z

To clarify, this is the correct behavior.
XLSX I/O assumes the Excel files are in a table format containing header rows.
So when nothing is skipped an empty line will contain as many empty cells as there are cells in the first row.

remicollet · 2020-07-07T05:51:49Z

To clarify, this is the correct behavior.

Ok, so fixing the tests of the ext side
viest/php-ext-xlswriter#283

viest · 2020-07-08T03:24:10Z

When nothing is skipped, the empty row will contain as many empty cells as the first row.

This seems unreasonable, there is nothing in worksheet.xml, so you should not add cells.

brechtsanders · 2020-07-08T05:43:55Z

The basic idea is to read tables from Excel files.
The end-user doesn't care much if something is there or not if the cell is part of the table.
Maybe another option can be added to ignore the number of columns defined on the first row, but so far nobody has ever resuested this.

remicollet · 2020-07-08T05:55:50Z

but so far nobody has ever resuested this.

Perhaps because it was the old behavior with version <= 0.2.27 ?
;)

viest · 2020-07-08T07:04:01Z

In the sheet1.xml file, the second row does not declare any cells, so it should not return the same number of empty cells as the first row.

If you ignore the sheet.xml file and insert some dummy empty cells, the expenditure in terms of traversal performance will increase.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
           xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <dimension ref="A1:E3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0"/>
    </sheetViews>
    <sheetFormatPr defaultRowHeight="15"/>
    <sheetData>
        <row r="1" spans="1:5">
            <c r="B1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="3" spans="1:5">
            <c r="A3" t="s">
                <v>1</v>
            </c>
            <c r="E3" s="1">
                <v>43726.6259375</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>

brechtsanders · 2020-07-08T07:25:59Z

Let me clarify a bit more the library was designed to read tables, which means the have header rows and then data below them.

+------+------+
| colA | colB |
+------+------+
| data | data |
+------+------+
| data | data |
+------+------+

In this case the table would look like:

+---+---+
|   | 0 |
+---+---+
|   |   |
+---+---+
| 1 |   |  (with some extra data outside the table)
+---+---+

For the behaviour you're expecting an extra flag could be created to ignore the header line.

viest · 2020-07-08T07:37:33Z

If the original idea was to return an equal number of cells based on the table header, you might also need a configuration that allows users to customize the row where the table header is located. In the actual environment, you may encounter many problems. In many cases, the first row is not the table header.

And this modification will destroy the existing application, which needs to be evaluated.

brechtsanders · 2020-07-08T08:04:39Z

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.
Adding a function to skip the first N lines has crossed my mind too

remicollet · 2020-07-08T08:14:58Z

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.

IMHO: yes, as this will restore old behavior, and make the "table" one new and optional (avoid BC break)

viest · 2020-07-08T08:16:04Z

Yes, keeping the API single responsibility helps to achieve more valuable functions, such as the function of php-ext-xlswriter: skip N lines.

remicollet · 2020-07-08T13:13:52Z

To be clear about the change

Until version 0.2.26, xlsxioread_sheet_last_column_index return 0 for not existing cell
Since version 0.2.28, it returns 1, 2...

xlsxioread_sheet_next_cell behavior is the same

And the PHP extension was used to ignore cell with 0 index.

brechtsanders · 2020-07-10T14:29:39Z

For now the table behaviour (determining the number of columns based on header row) is only active when using the XLSXIOREAD_SKIP_EXTRA_CELLS flag. This is now in release 0.2.29.

remicollet · 2020-07-11T05:50:31Z

IMHO everything looks consistent

XLSXIOREAD_SKIP_NONE

	1
	  A: 
	  B: Cost
	2
	3
	  A: viest
	  B: 
	  C: 
	  D: 
	  E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_ROWS

	1
	  A: 
	  B: Cost
	2
	  A: viest
	  B: 
	  C: 
	  D: 
	  E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_CELLS

	1
	  B: Cost
	2
	3
	  A: viest
	  E: 43726.6259375

XLSXIOREAD_SKIP_ALL_EMPTY

	1
	  B: Cost
	2
	  A: viest
	  E: 43726.6259375

XLSXIOREAD_SKIP_EXTRA_CELLS

	1
	  A: 
	  B: Cost
	2
	  A: 
	  B: 
	3
	  A: viest
	  B:

Notice: xlsxioread_sheet_last_row_index when empty row 2 is skipped, shoudn't row 3 index be preserved (return 2 for now) as xlsxioread_sheet_last_column_index do for coloumn index ?

brechtsanders · 2020-07-11T08:34:32Z

xlsxioread_sheet_last_row_index is the last row seen by the parser, but the skipping of an empty row is done when reading the first cell of a row. So before reading that first cell xlsxioread_sheet_last_row_index will still show you the index of the last row that wasn't empty.

viest · 2020-07-11T09:23:31Z

It's really bad, skip does not mean that he does not exist, so the index also needs to be updated.

brechtsanders · 2020-07-11T09:43:11Z

As the design is to support streaming data the engine doesn't read ahead, so it doesn't know what it's going to skip.
The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

remicollet · 2020-07-12T06:09:48Z

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

OK, But for now,

No skip:

	  1 A: 
	  1 B: Cost
	  3 A: viest
	  3 B: 
	  3 C: 
	  3 D: 
	  3 E: 43726.6259375

Skipping all

	  1 B: Cost
	  2 A: viest
	  2 E: 43726.6259375

When we expect

	  1 B: Cost
	  3 A: viest
	  3 E: 43726.6259375

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

Using attached test code
readxls.txt

myk002 · 2020-07-13T03:40:10Z

I'm actually really happy to find this bug -- I'm hoping to use this library to read spreadsheets that don't have headers at all. The data will represent physical layouts, with codes in each cell to indicate what should be placed at that location. The first line will contain only one cell with a description string. It looks like the behavior I'm looking for will be supported with either SKIP_NONE or SKIP_EMPTY_CELLS.

The text in the README.md might need updating, though:

intended to process .xlsx files as a data table, which assumes the following:
  assumes the first row contains header names
  assumes the next rows contain values in the same columns as where the header names are supplied

remicollet · 2020-08-07T06:24:27Z

@brechtsanders any idea about to fix this ?

remicollet mentioned this issue Jul 7, 2020

fix test expectation with libxlsxio 0.2.28 viest/php-ext-xlswriter#283

Merged

remicollet closed this as completed Jul 7, 2020

brechtsanders reopened this Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.2.27] behavior change in EMPTY management #75

[0.2.27] behavior change in EMPTY management #75

remicollet commented Jul 6, 2020 •

edited

Loading

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020 •

edited

Loading

brechtsanders commented Jul 6, 2020

remicollet commented Jul 7, 2020

viest commented Jul 8, 2020

brechtsanders commented Jul 8, 2020

remicollet commented Jul 8, 2020

viest commented Jul 8, 2020 •

edited

Loading

brechtsanders commented Jul 8, 2020

viest commented Jul 8, 2020

brechtsanders commented Jul 8, 2020

remicollet commented Jul 8, 2020

viest commented Jul 8, 2020 •

edited

Loading

remicollet commented Jul 8, 2020

brechtsanders commented Jul 10, 2020

remicollet commented Jul 11, 2020

brechtsanders commented Jul 11, 2020

viest commented Jul 11, 2020

brechtsanders commented Jul 11, 2020

remicollet commented Jul 12, 2020 •

edited

Loading

myk002 commented Jul 13, 2020

remicollet commented Aug 7, 2020

[0.2.27] behavior change in EMPTY management #75

[0.2.27] behavior change in EMPTY management #75

Comments

remicollet commented Jul 6, 2020 • edited Loading

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020

remicollet commented Jul 6, 2020

brechtsanders commented Jul 6, 2020

remicollet commented Jul 6, 2020 • edited Loading

brechtsanders commented Jul 6, 2020

remicollet commented Jul 7, 2020

viest commented Jul 8, 2020

brechtsanders commented Jul 8, 2020

remicollet commented Jul 8, 2020

viest commented Jul 8, 2020 • edited Loading

brechtsanders commented Jul 8, 2020

viest commented Jul 8, 2020

brechtsanders commented Jul 8, 2020

remicollet commented Jul 8, 2020

viest commented Jul 8, 2020 • edited Loading

remicollet commented Jul 8, 2020

brechtsanders commented Jul 10, 2020

remicollet commented Jul 11, 2020

brechtsanders commented Jul 11, 2020

viest commented Jul 11, 2020

brechtsanders commented Jul 11, 2020

remicollet commented Jul 12, 2020 • edited Loading

myk002 commented Jul 13, 2020

remicollet commented Aug 7, 2020

remicollet commented Jul 6, 2020 •

edited

Loading

remicollet commented Jul 6, 2020 •

edited

Loading

viest commented Jul 8, 2020 •

edited

Loading

viest commented Jul 8, 2020 •

edited

Loading

remicollet commented Jul 12, 2020 •

edited

Loading