Ingest: Provide more robust ingest for Excel and CSV #585

eaquigley · 2014-07-09T15:39:35Z

Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro

The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.

More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.

This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.

As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.

eaquigley · 2014-08-27T18:44:36Z

@posixeleni Any update on this?

posixeleni · 2014-09-02T14:28:06Z

@eaquigley Will need to review the documentation mentioned above to see if it can help.

@landreev have you had any luck with this?

posixeleni · 2014-09-12T15:46:33Z

Will need to move this to another milestone.

posixeleni · 2014-10-15T19:56:39Z

@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:

All your data should be in a single spreadsheet of a single file. [Eleni: what would we do if someone sent us an excel spreadsheet with multiple workbooks? only load the first one and put in our documentation that this is what we do?]
Enter variable names in the first row of the spreadsheet.
Variable names cannot contain spaces, but may use the underscore character. [Eleni: What would we do if variable (header) names did contains spaces? Replace spaces with underscores?]
No blank rows should appear in the data. [Eleni: Would we programmatically remove all blank rows?]
For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, stat packages will read the column as character data so you
won’t, for example, be able to calculate even the mean of a column. [Eleni: can we make sure we record these as N/A values?]
You can enter dates with slashes (6/13/2003) and times with colons (12:15 AM). [Eleni: Is this something our system can look for and store them as dates and times?]

@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.

scolapasta · 2015-01-06T21:20:45Z

Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here

pdurbin · 2017-04-05T13:29:40Z

@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.

Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107

There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.

landreev · 2017-04-11T21:06:47Z

There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.

landreev · 2017-04-11T21:09:27Z

I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)

landreev · 2017-04-12T15:30:49Z

#3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing.

oscardssmith · 2017-07-20T15:02:16Z

Does #3767 close this?

adam3smith · 2018-03-07T05:18:56Z

I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something

.tab vs. .tsv: I like that Dataverse normalized .xlsx files and I think tab separated values are a good, robust choice, but why did you choose the .tab file extension (which is ambiguous and very rarely used) over the much more common .tsv?
The download dialog for ingested tabular files has an option to download the "original" file. I think that should be the original xlsx file, as uploaded. Currently it's just the .tab file again.

pdurbin · 2018-03-07T06:20:37Z

@adam3smith thanks, these are related:

Switch preservation format extension from .tab to .tsv Switch preservation format extension from .tab to .tsv #2720
Default Download Type should be "original" Multiple formats under "Download All" dropdown #4000

adam3smith · 2018-03-07T16:30:54Z

Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?

jggautier · 2018-03-07T17:07:33Z

Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?

This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.

I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

adam3smith · 2018-03-07T19:58:02Z

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?

mheppler · 2018-03-07T20:14:18Z

@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.

landreev · 2019-05-31T15:17:50Z

Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)

pdurbin · 2019-05-31T16:40:58Z

@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.

@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?

Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.

kcondon · 2019-05-31T20:14:02Z

@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)

adam3smith · 2019-05-31T22:01:57Z

@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.

eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

eaquigley assigned posixeleni Jul 9, 2014

scolapasta modified the milestones: Beta 3 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014

eaquigley modified the milestones: Beta 3 - Dataverse 4.0, Beta 5 - Dataverse 4.0 Aug 19, 2014

eaquigley modified the milestones: Beta 5 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Aug 27, 2014

posixeleni modified the milestones: Beta 8 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Sep 12, 2014

posixeleni modified the milestones: Dataverse 4.0: Final, Beta 8 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Oct 15, 2014

posixeleni assigned scolapasta and unassigned posixeleni Oct 15, 2014

scolapasta modified the milestones: In Review - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 3, 2014

scolapasta assigned landreev and unassigned scolapasta Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, In Review - Dataverse 4.0 Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final Jan 23, 2015

pdurbin mentioned this issue Apr 21, 2017

CSV ingest improvements #3767

Closed

pdurbin added Feature: File Upload & Handling and removed Component: Documentation labels Jun 9, 2017

pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Jul 4, 2017

pdurbin mentioned this issue Aug 10, 2017

Ingest: Excel ingest can't accept more than 26 (Z) columns. #3382

Closed

adam3smith mentioned this issue Mar 7, 2018

Switch preservation format extension from .tab to .tsv #2720

Closed

djbrooke closed this as completed Nov 13, 2020

DS-INRAE added this to Recherche Data Gouv Jul 10, 2024

DS-INRAE moved this to Done in Recherche Data Gouv Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest: Provide more robust ingest for Excel and CSV #585

Ingest: Provide more robust ingest for Excel and CSV #585

eaquigley commented Jul 9, 2014

eaquigley commented Aug 27, 2014

posixeleni commented Sep 2, 2014

posixeleni commented Sep 12, 2014

posixeleni commented Oct 15, 2014

scolapasta commented Jan 6, 2015

pdurbin commented Apr 5, 2017

landreev commented Apr 11, 2017

landreev commented Apr 11, 2017

landreev commented Apr 12, 2017

oscardssmith commented Jul 20, 2017

adam3smith commented Mar 7, 2018

pdurbin commented Mar 7, 2018

adam3smith commented Mar 7, 2018

jggautier commented Mar 7, 2018 •

edited

Loading

adam3smith commented Mar 7, 2018

mheppler commented Mar 7, 2018

landreev commented May 31, 2019

pdurbin commented May 31, 2019

kcondon commented May 31, 2019

adam3smith commented May 31, 2019

Ingest: Provide more robust ingest for Excel and CSV #585

Ingest: Provide more robust ingest for Excel and CSV #585

Comments

eaquigley commented Jul 9, 2014

eaquigley commented Aug 27, 2014

posixeleni commented Sep 2, 2014

posixeleni commented Sep 12, 2014

posixeleni commented Oct 15, 2014

scolapasta commented Jan 6, 2015

pdurbin commented Apr 5, 2017

landreev commented Apr 11, 2017

landreev commented Apr 11, 2017

landreev commented Apr 12, 2017

oscardssmith commented Jul 20, 2017

adam3smith commented Mar 7, 2018

pdurbin commented Mar 7, 2018

adam3smith commented Mar 7, 2018

jggautier commented Mar 7, 2018 • edited Loading

adam3smith commented Mar 7, 2018

mheppler commented Mar 7, 2018

landreev commented May 31, 2019

pdurbin commented May 31, 2019

kcondon commented May 31, 2019

adam3smith commented May 31, 2019

jggautier commented Mar 7, 2018 •

edited

Loading