-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Import Batch job in support of rsync #3353
Comments
During the 2016-09-08 SBGrid Sprint Planning meeting ( https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit ) this issue was given an effort level of "5". The code has already been written and can be found at https://github.com/bmckinney/dataverse-canonical/tree/dcm-beta/src/main/java/edu/harvard/iq/dataverse/batch As @djbrooke mentioned in the description, this issue does NOT cover support for file versioning when importing files. For now it's more about review of the code mentioned above so I'm assigning it this issue to myself and @scolapasta to at least get the code executing on our laptops so we can play around with it and get a deeper understanding of how it works. |
@bmckinney @djbrooke and I discussed this issue. I indicated that I plan to get on this branch and play around with import code, executing the unit tests at least and what not. @bmckinney will be scheduling a meeting with @landreev and ideally myself and @scolapasta to spent 10 or so minutes discussing the code and leaving plenty of time to discuss the implications. |
@bmckinney pushed a branch that I'd like to poke around at: https://github.com/IQSS/dataverse/tree/3353-batch-job-import Bill also created some great writeups here:
In https://waffle.io/IQSS/dataverse I'm dragging this to Code Review. |
I'm working my way through https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests on my laptop running f2fde2e and it says to review the dataset landing page at the end. Here's how it looks: These are the surprises from my perspective:
Dataverse doesn't know MIME type: Dataverse knows MIME type: Here's the parent dataverse: Here are the files I ran the import on:
|
I've been testing the code by following the instructions at https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests and I highly recommend that anyone testing or reviewing the code read that document as well as https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation Anyway, I thought I'd go ahead and explain concretely what I've been doing which only make sense if you read the "manual tests" wiki page above.
It's very important to understand that files that are "imported" via this code do not follow the normal Dataverse rules when it comes to file handling. Specifically,
There might be other rules that are not followed that I'm not aware of but I think these are the big ones. This shouldn't be a the datasets in question hosted by SBGrid where the data is not tabular, Excel docs, images, PDFs, etc. For another perspective on this, please see the "General Constraints" and "Big Data Constraints" at https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation Bugs:
Now on to code review. I spend around a day looking at the code as of 02a3c30. Answers for todos:
Code style:
Questions:
There seem to be 5 new API endpoints:
Here's how a dataset looks on my laptop: |
Great code review meeting with @bmckinney @scolapasta @djbrooke @mheppler @sekmiller @landreev @kcondon @pameyer and myself. Notes and TODOs are in this "2016-11-16 Code Review/Discussion - SBGrid" doc: https://docs.google.com/a/harvard.edu/document/d/19ug5XOzIUusAu6vqZRqfaEI0MWV5J6kqY2EtJW9R9I4/edit?usp=sharing |
Talked in Standup today, @scolapasta is going to take a final look and then move to QA. |
We discussed this in the sprint planning today. The tasks that were called out were:
|
Hey @landreev - just making sure that this small change (from Kevin's comment above) is included in the #3497 PR: -Link to dataset in batch upload email notification is not a complete URL and not resolvable: Thanks! |
(this is going into QA very shortly; I pulled it back into dev as I'm addressing the last remaining tasks - syncing the branch up to the dev. branch; muting the failing restassured tests; and the broken link, per Kevin's QA feedback) |
I'm pulling this into QA. Notifications should be working, both on the user page and via email. |
OK, note that a record of changes was made to the pull request and am posting them here: OK, we agreed to make a number of changes during the last code review, 5 total, addressed in my latest commit: Reversed to the logic of trusting the external file copier to create the top-level folder for the files in the destination dataset directory. No extra folder is being created now. Also, Phil just found something else that I missed - Bill created some RestAssured tests that are failing, now that the import process has been changed. Let's discuss how to proceed; should we move it to QA and treat the RA tests as a separate issue? Or fix it now? Also, the branch needs to be synced up to the develop, before it can be QAed. |
Issues I'm seeing: |
Committed the code change;
|
Version support will come in a later iteration.
The text was updated successfully, but these errors were encountered: