Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: line breaks in Excel cause ingest to fail #6874

Open
rdemgenski opened this issue Apr 29, 2020 · 11 comments
Open

Ingest: line breaks in Excel cause ingest to fail #6874

rdemgenski opened this issue Apr 29, 2020 · 11 comments
Labels
Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.

Comments

@rdemgenski
Copy link

rdemgenski commented Apr 29, 2020

Hello, first time here - working with @adam3smith at QDR

Line breaks in Excel lead to a reading mismatch and cause ingest to fail:

image

Issue replicated on demo (https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FOIWIG6) with two files that are identical except for the line break. The one with line break at text end fails to ingest, the other ingests successfully.

image

Converting the xlsx file to csv and opening it in a text editor shows that line breaks in xlsx create line breaks in the text where they should not (i.e. they create new rows), which is likely the root issue.

image

@djbrooke
Copy link
Contributor

Welcome @rdemgenski and thanks for the detailed report. If we work on this, we'd want to consider it at the same time as #3383.

@BPeuch
Copy link
Contributor

BPeuch commented Jul 7, 2020

Version: 4.20

Hello everybody,

Thank you for reporting this, @rdemgenski. We have the same problem here.

Interestingly, the MD5 checksum produced by Dataverse for my test file (which contains a linebreak, thus leading to the error message and the failure to convert the file in .tab format) is the same hash that I get with another MD5 parser, http://onlinemd5.com/. So the ingest seems to be actually successful, apart from the change documented by @rdemgenski.

@adam3smith
Copy link
Contributor

Re-reading this, is this not actually just a duplicate of #3383 ? I think Robert's description & MWE are a lot clearer than the original error report, but it's the same problem

@BPeuch
Copy link
Contributor

BPeuch commented Aug 9, 2021

And is it not the same as #7386, too?

@pdurbin
Copy link
Member

pdurbin commented Aug 10, 2021

@BPeuch #7386 is more about a missing header.

As multiple people have pointed out above, this does seem to be a duplicate of #3383.

@BPeuch
Copy link
Contributor

BPeuch commented Aug 11, 2021

Ah my bad. Thanks for clarifying @pdurbin

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2022

There's some recent discussion here:

@bencomp
Copy link
Contributor

bencomp commented Oct 11, 2022

I wanted to suggest looking at Apache POI for parsing XLSX files, but I now see it is already in use.
However, it appears that XLSXFileReader only uses POI to get the "raw" XML and then parses the XML itself.
My impression from the spreadsheet quickstart is that it could be a lot easier to use POI's APIs for parsing the sheet (or sheets!) and cells.

(I'm not volunteering to refactor the XLSXFileReader, by the way...)

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
@github-project-automation github-project-automation bot moved this from 🔍 Interest to Done in Recherche Data Gouv Aug 20, 2024
@bencomp
Copy link
Contributor

bencomp commented Aug 20, 2024

Just wondering: was this really completed? Because it seems you are saying it is not planned.

@cmbz
Copy link

cmbz commented Aug 20, 2024

Hi @bencomp we are closing issues created before 2020-08-18 that do not have the Type: Feature label. As a separate task, we are also reviewing bugs. I think my script caught some bugs by mistake. I'll reopen. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Bug a defect User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: Papercuts (Smaller issues)
Status: Done
Development

No branches or pull requests

7 participants