Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.2.27] behavior change in EMPTY management #75

Open
remicollet opened this issue Jul 6, 2020 · 34 comments
Open

[0.2.27] behavior change in EMPTY management #75

remicollet opened this issue Jul 6, 2020 · 34 comments

Comments

@remicollet
Copy link
Contributor

remicollet commented Jul 6, 2020

Looks like the change in 0.2.27 creates a behavior change resulting in infinite loop.

Found trying to run https://pecl.php.net/package/xlswriter test suite against 0.2.27 (was ok against 0.2.26)

2 tests are failing with process timeout

@viest for information, if you can also have a look

@brechtsanders
Copy link
Owner

Can you tell me what these tests were exactly and how I can reproduce the problem?
On which OS an with which zip library was this?
Note: changes between 0.2.26 and 0.2.27 were only in the reader.

@remicollet
Copy link
Contributor Author

Can you tell me what these tests were exactly and how I can reproduce the problem?

To reproduce, from https://github.com/viest/php-ext-xlswriter (with php 7 devel installed)

$ phpize
$ ./configure --with-libxlsxio=/path/to/libxlsio/install/dir    --enable-reader 
$ make
$ make test

Failing test are

  • tests/open_xlsx_get_data_skip_empty.phpt
  • tests/open_xlsx_next_row_skip_empty.phpt

On which OS an with which zip library was this?

On Fedora 31, using libzip 1.7.1

@brechtsanders
Copy link
Owner

I did not write the PHP library, so I need to reproduce this in C to get to the bottom of this.
Can you send me the tutorial.xlsx file used in the tests? I couldn't find it right away.

@remicollet
Copy link
Contributor Author

tutorial.xlsx

@brechtsanders
Copy link
Owner

Thanks. Now I can reproduce the issue when using XLSXIOREAD_SKIP_EMPTY_CELLS.

@remicollet
Copy link
Contributor Author

Great...

If you have a fix, I can try it before a new release.

@brechtsanders
Copy link
Owner

Okay, I will let you know when I committed new code.

@brechtsanders
Copy link
Owner

More work than I thought. Not finished yet, but if you like you can already test master if it no longer crashes/hangs.

@remicollet
Copy link
Contributor Author

Using a6ab8e1
PHP extension test suite passes again.

@brechtsanders
Copy link
Owner

Final issues resolved. It took me a while to make sure all combinations of XLSXIOREAD_SKIP_EMPTY_* worked well.
If it still passes I will release the changes.

@remicollet
Copy link
Contributor Author

Looks like f49a3d1 breaks something else

TEST 67/84 [tests/open_xlsx_next_row_with_data_type_date_array_index.phpt]
========DIFF========
007+ array(2) {
008+   [0]=>
009+   string(0) ""
010+   [1]=>
011+   string(0) ""
007- array(0) {
========DONE========

@remicollet
Copy link
Contributor Author

Test file used by this test
tutorial.xlsx

@brechtsanders
Copy link
Owner

Can't reproduce in C. What exactly is being tested here?

@remicollet
Copy link
Contributor Author

remicollet commented Jul 6, 2020

Can't reproduce in C. What exactly is being tested here?

Not clear to me

@viest can you please help on this ?

IIUC, second row have no value, but 2 empty rows are returned (xlsxioread_sheet_next_cell)

@brechtsanders
Copy link
Owner

To clarify, this is the correct behavior.
XLSX I/O assumes the Excel files are in a table format containing header rows.
So when nothing is skipped an empty line will contain as many empty cells as there are cells in the first row.

@remicollet
Copy link
Contributor Author

To clarify, this is the correct behavior.

Ok, so fixing the tests of the ext side
viest/php-ext-xlswriter#283

@viest
Copy link
Contributor

viest commented Jul 8, 2020

When nothing is skipped, the empty row will contain as many empty cells as the first row.

This seems unreasonable, there is nothing in worksheet.xml, so you should not add cells.

@brechtsanders
Copy link
Owner

The basic idea is to read tables from Excel files.
The end-user doesn't care much if something is there or not if the cell is part of the table.
Maybe another option can be added to ignore the number of columns defined on the first row, but so far nobody has ever resuested this.

@remicollet
Copy link
Contributor Author

but so far nobody has ever resuested this.

Perhaps because it was the old behavior with version <= 0.2.27 ?
;)

@viest
Copy link
Contributor

viest commented Jul 8, 2020

In the sheet1.xml file, the second row does not declare any cells, so it should not return the same number of empty cells as the first row.

If you ignore the sheet.xml file and insert some dummy empty cells, the expenditure in terms of traversal performance will increase.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
           xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <dimension ref="A1:E3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0"/>
    </sheetViews>
    <sheetFormatPr defaultRowHeight="15"/>
    <sheetData>
        <row r="1" spans="1:5">
            <c r="B1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="3" spans="1:5">
            <c r="A3" t="s">
                <v>1</v>
            </c>
            <c r="E3" s="1">
                <v>43726.6259375</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
</worksheet>

@brechtsanders
Copy link
Owner

Let me clarify a bit more the library was designed to read tables, which means the have header rows and then data below them.

+------+------+
| colA | colB |
+------+------+
| data | data |
+------+------+
| data | data |
+------+------+

In this case the table would look like:

+---+---+
|   | 0 |
+---+---+
|   |   |
+---+---+
| 1 |   |  (with some extra data outside the table)
+---+---+

For the behaviour you're expecting an extra flag could be created to ignore the header line.

@viest
Copy link
Contributor

viest commented Jul 8, 2020

If the original idea was to return an equal number of cells based on the table header, you might also need a configuration that allows users to customize the row where the table header is located. In the actual environment, you may encounter many problems. In many cases, the first row is not the table header.

And this modification will destroy the existing application, which needs to be evaluated.

@brechtsanders
Copy link
Owner

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.
Adding a function to skip the first N lines has crossed my mind too

@remicollet
Copy link
Contributor Author

Okay, so in your opinion, would it be better to require a new flag to get the table behaviour and to default to not using the table behaviour?.

IMHO: yes, as this will restore old behavior, and make the "table" one new and optional (avoid BC break)

@viest
Copy link
Contributor

viest commented Jul 8, 2020

Yes, keeping the API single responsibility helps to achieve more valuable functions, such as the function of php-ext-xlswriter: skip N lines.

@remicollet
Copy link
Contributor Author

To be clear about the change

  • Until version 0.2.26, xlsxioread_sheet_last_column_index return 0 for not existing cell
  • Since version 0.2.28, it returns 1, 2...

xlsxioread_sheet_next_cell behavior is the same

And the PHP extension was used to ignore cell with 0 index.

@brechtsanders
Copy link
Owner

For now the table behaviour (determining the number of columns based on header row) is only active when using the XLSXIOREAD_SKIP_EXTRA_CELLS flag. This is now in release 0.2.29.

@remicollet
Copy link
Contributor Author

IMHO everything looks consistent

XLSXIOREAD_SKIP_NONE

	1
	  A: 
	  B: Cost
	2
	3
	  A: viest
	  B: 
	  C: 
	  D: 
	  E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_ROWS

	1
	  A: 
	  B: Cost
	2
	  A: viest
	  B: 
	  C: 
	  D: 
	  E: 43726.6259375

XLSXIOREAD_SKIP_EMPTY_CELLS

	1
	  B: Cost
	2
	3
	  A: viest
	  E: 43726.6259375

XLSXIOREAD_SKIP_ALL_EMPTY

	1
	  B: Cost
	2
	  A: viest
	  E: 43726.6259375

XLSXIOREAD_SKIP_EXTRA_CELLS

	1
	  A: 
	  B: Cost
	2
	  A: 
	  B: 
	3
	  A: viest
	  B: 

Notice: xlsxioread_sheet_last_row_index when empty row 2 is skipped, shoudn't row 3 index be preserved (return 2 for now) as xlsxioread_sheet_last_column_index do for coloumn index ?

@brechtsanders
Copy link
Owner

xlsxioread_sheet_last_row_index is the last row seen by the parser, but the skipping of an empty row is done when reading the first cell of a row. So before reading that first cell xlsxioread_sheet_last_row_index will still show you the index of the last row that wasn't empty.

@viest
Copy link
Contributor

viest commented Jul 11, 2020

It's really bad, skip does not mean that he does not exist, so the index also needs to be updated.

@brechtsanders
Copy link
Owner

As the design is to support streaming data the engine doesn't read ahead, so it doesn't know what it's going to skip.
The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

@remicollet
Copy link
Contributor Author

remicollet commented Jul 12, 2020

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

OK, But for now,

No skip:

	  1 A: 
	  1 B: Cost
	  3 A: viest
	  3 B: 
	  3 C: 
	  3 D: 
	  3 E: 43726.6259375

Skipping all

	  1 B: Cost
	  2 A: viest
	  2 E: 43726.6259375

When we expect

	  1 B: Cost
	  3 A: viest
	  3 E: 43726.6259375

The value xlsxioread_sheet_last_row_index will be correct if you check it when the first cell of the row has been read.

Using attached test code
readxls.txt

@brechtsanders brechtsanders reopened this Jul 12, 2020
@myk002
Copy link

myk002 commented Jul 13, 2020

I'm actually really happy to find this bug -- I'm hoping to use this library to read spreadsheets that don't have headers at all. The data will represent physical layouts, with codes in each cell to indicate what should be placed at that location. The first line will contain only one cell with a description string. It looks like the behavior I'm looking for will be supported with either SKIP_NONE or SKIP_EMPTY_CELLS.

The text in the README.md might need updating, though:

intended to process .xlsx files as a data table, which assumes the following:
  assumes the first row contains header names
  assumes the next rows contain values in the same columns as where the header names are supplied

@remicollet
Copy link
Contributor Author

@brechtsanders any idea about to fix this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants