-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest: Provide more robust ingest for Excel and CSV #585
Comments
@posixeleni Any update on this? |
@eaquigley Will need to review the documentation mentioned above to see if it can help. @landreev have you had any luck with this? |
Will need to move this to another milestone. |
@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:
@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1. |
Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here |
@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would. Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107 There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331. |
There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore. |
I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic) |
#3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing. |
Does #3767 close this? |
I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something
|
@adam3smith thanks, these are related:
|
Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug? |
Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format? This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22. I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format. |
Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point? |
@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2. |
Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved. |
@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects. @adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above? Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks. |
@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;) |
Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro
The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.
More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.
This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.
As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.
The text was updated successfully, but these errors were encountered: