Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Provide more robust ingest for Excel and CSV #585

Closed
eaquigley opened this issue Jul 9, 2014 · 39 comments
Closed

Ingest: Provide more robust ingest for Excel and CSV #585

eaquigley opened this issue Jul 9, 2014 · 39 comments
Labels
Feature: File Upload & Handling Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.

Comments

@eaquigley
Copy link
Contributor


Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro


The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.

More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.

This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.

As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.

@eaquigley eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014
@scolapasta scolapasta modified the milestones: Beta 3 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014
@eaquigley
Copy link
Contributor Author

@posixeleni Any update on this?

@posixeleni
Copy link
Contributor

@eaquigley Will need to review the documentation mentioned above to see if it can help.

@landreev have you had any luck with this?

@posixeleni
Copy link
Contributor

Will need to move this to another milestone.

@posixeleni posixeleni modified the milestones: Dataverse 4.0: Final, Beta 8 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Oct 15, 2014
@posixeleni posixeleni assigned scolapasta and unassigned posixeleni Oct 15, 2014
@posixeleni
Copy link
Contributor

@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:

  • All your data should be in a single spreadsheet of a single file. [Eleni: what would we do if someone sent us an excel spreadsheet with multiple workbooks? only load the first one and put in our documentation that this is what we do?]
  • Enter variable names in the first row of the spreadsheet.
  • Variable names cannot contain spaces, but may use the underscore character. [Eleni: What would we do if variable (header) names did contains spaces? Replace spaces with underscores?]
  • No blank rows should appear in the data. [Eleni: Would we programmatically remove all blank rows?]
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, stat packages will read the column as character data so you
    won’t, for example, be able to calculate even the mean of a column. [Eleni: can we make sure we record these as N/A values?]
  • You can enter dates with slashes (6/13/2003) and times with colons (12:15 AM). [Eleni: Is this something our system can look for and store them as dates and times?]

@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.

@scolapasta scolapasta modified the milestones: In Review - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 3, 2014
@scolapasta
Copy link
Contributor

Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here

@scolapasta scolapasta assigned landreev and unassigned scolapasta Jan 6, 2015
@scolapasta scolapasta modified the milestones: Beta 11 - Dataverse 4.0, In Review - Dataverse 4.0 Jan 6, 2015
@scolapasta scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final Jan 23, 2015
@pdurbin
Copy link
Member

pdurbin commented Apr 5, 2017

@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.

Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107

There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.

@landreev
Copy link
Contributor

There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.

@landreev
Copy link
Contributor

I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)

@landreev
Copy link
Contributor

#3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing.

@oscardssmith
Copy link
Contributor

Does #3767 close this?

@adam3smith
Copy link
Contributor

I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something

  1. .tab vs. .tsv: I like that Dataverse normalized .xlsx files and I think tab separated values are a good, robust choice, but why did you choose the .tab file extension (which is ambiguous and very rarely used) over the much more common .tsv?
  2. The download dialog for ingested tabular files has an option to download the "original" file. I think that should be the original xlsx file, as uploaded. Currently it's just the .tab file again.

@pdurbin
Copy link
Member

pdurbin commented Mar 7, 2018

@adam3smith thanks, these are related:

@adam3smith
Copy link
Contributor

Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?

@jggautier
Copy link
Contributor

jggautier commented Mar 7, 2018

Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?

This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.

I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

screen shot 2018-03-07 at 12 14 42 pm

@adam3smith
Copy link
Contributor

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?

@mheppler
Copy link
Contributor

mheppler commented Mar 7, 2018

@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.

@landreev
Copy link
Contributor

Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)

@pdurbin
Copy link
Member

pdurbin commented May 31, 2019

@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.

@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?

Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.

@kcondon
Copy link
Contributor

kcondon commented May 31, 2019

@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)

@adam3smith
Copy link
Contributor

@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: Done
Development

No branches or pull requests