-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Hierarchy: I want to be able to preserve my dataset's files' directory structure, for easy import, computation, and navigation. #2249
Comments
One highly related thing which is extremely prevalent in microscopy and I would guess other fields, is that in addition to encoding metadata into the directory hierarchy, they have also encoded it into the filenames, usually underscore separated. E.g. Some of this will be important metadata, some of it may not be. The ability to automatically or semi-automatically import some of this metadata (but hopefully not the junk) in the form of tag annotations sounds useful so that search/filtering can make use of them. |
It might be useful to have a look at this tool, built for the Open Microscopy Environment. The UI is not beautiful, but it does this kind of metadata extraction into Tag Annotations: https://www.openmicroscopy.org/site/products/partner/omero.webtagging/ There is also a "search" tool which should really be called "navigation" because it allows the user to browse the graph of tags from any origin point. This resembled filesystem navigation somewhat and seemed to satisfy some users. Caveat Emptor: The queries to do this navigation because the tags are stored in a relational DB can get very slow if there are large numbers of tags and/or large amounts of data tagged with them. It would be ideal to be storing and updating a graph DB for this functionality to make these queries performant. |
FRD for this feature (work in progress): https://docs.google.com/document/d/1PqL6EljP-N51rt3puy3HedStrnV5DOJ3Gf7H_zPHcA0/edit?usp=sharing |
Feedback from @pameyer: "preserving file naming and directory structure (with the exception of Mostly I just want to make it clear that download is a use case. (We probably need a separate issue to talk about running computation on files.) In the FRD above this is currently a question ("Do these carry over into a folder structure when downloaded as a zip?") and the answer for many users, I think, is that they want/expect to be able to upload a zip and later download a zip that has the same directory structure inside it. Some months ago @cchoirat was talking about the importance of this for her (though she may not have been talking about zip files specifically). It's a common expectation. Right now Dataverse flattens your files into a single namespace/directory on upload. |
I think this would be really valuable. It was how things worked with versions < 4.0, as I recall, and makes it somewhat unpredictable what will happen currently when uploading a project (e.g., via the API). One possibility might be to do what S3 does with object keys that can have slashes in them:
The examples they give of object keys are:
This would allow a "flat" Dataset to contain files that can be batch downloaded into a hierarchical structure. Of course, I don't know if that works on the backend. |
This issue was raised yesterday by @pameyer and others from @sbgrid . @bmckinney if you want you could assign yourself to this issue to at least think about. I remember @dpwrussell of OMERO fame talking about it during the 2015 Dataverse Community Meeting. @leeper you're right. From what I've heard from @landreev in the DVN 3.x days a zip download would sort of reconstruct the file system hierarchy. I'm obviously fuzzy on the details. |
I'm wondering whether there is a way to upload the entire directory (For example, by dragging a folder. Currently, it only supports dragging a file) so that the structure is maintained. The user can browse the directory and files by simply clicking into like Dropbox and github without explicitly downloading and unzipping the data. |
👍 on this request. The directory structure is often important. Sometimes there are even multiple subdirectories that contain identically named files, e.g., for different experimental subjects or different versions of an experiment. @pdurbin is correct that download is a use case. So is online browsing of the dataset to get a feel for what's there -- the directory structure provides very useful organization. |
In the short term, we are considering using the file hierarchy information as metadata, stored in the database, rather than having the files in a hierarchy on disk. This would allow users to view the hierarchy in a preview, with the file display in the table (adding filtering and sorting capabilities).
This doesn't address all desired, however we are interested in getting comments on this proposal. Here is a more detailed description: Depositor drags a zip (not double zip) in to the dataset. The file will unzip and preserve the directory structure (see #3448). Individual files will be ingested (if necessary) and displayed just like any other file in a dataset – flat. Individual files can be downloaded. If all or any files are downloaded, the hierarchy will be re-created in a zip, matching the structure of the file that was uploaded in the first place. A user wanting to access data selects “Download all” and downloads the original zip hierarchy. The system behavior is transparent to depositors. Add files
Move files Versioning View hierarchy
Replace/Unzip existing .zip Download |
@dpwrussell @pameyer @leeper @wddabc @jeisner @nmedeiro @christophergandrud @pdeffebach @setgree @mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders! Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files: This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs: Please just leave a comment below! Thanks! |
@pdurbin Good that this works now with a zip file. Ideally I would like to see this also working with drag&drop. And that you can browser through folders in the interface in stead of listing the folder name for each file. But hey, thanks for making this already possible! |
@mdehollander great suggestion! Please feel free to open a new issue for this. Everyone, while I'm writing I'll mention that I also wrote about the progress so far in this "Control over dataset file hierarchy + directory structure (new feature in Dataverse 4.12)" thread and feedback is welcome there as well: https://groups.google.com/d/msg/dataverse-community/8gn5pq0cVc0/MCMQAQHRAQAJ If anyone want to reply via Twitter, I would suggest piling on to one of these tweets:
We're currently working on "Enable the display of file hierarchy metadata on the dataset page" in #5572. |
Phil, this is great! It worked perfectly with the test dataset I uploaded
to the demo site. Thanks very much to you and your team for getting this
much-needed functionality into Dataverse. It's critical to the
computational reproducibility we're teaching.
All the best,
Norm
…__________________________________________________________________________
*Norm Medeiros*
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 7:35 AM Philip Durbin ***@***.***> wrote:
@dpwrussell <https://github.com/dpwrussell> @pameyer
<https://github.com/pameyer> @leeper <https://github.com/leeper> @wddabc
<https://github.com/wddabc> @jeisner <https://github.com/jeisner>
@nmedeiro <https://github.com/nmedeiro> @christophergandrud
<https://github.com/christophergandrud> @pdeffebach
<https://github.com/pdeffebach> @setgree <https://github.com/setgree>
@mdehollander <https://github.com/mdehollander> (and anyone else who is
following this issue) good news! Dataverse 4.12 has support for organizing
files into folders!
Can you all please try it out at https://demo.dataverse.org and give us
feedback? Here are some screenshots that show how to introduce a folder
hierarchy to your dataset's files:
[image: Screen Shot 2019-04-05 at 7 26 47 AM]
<https://user-images.githubusercontent.com/21006/55624812-fcc36f00-5774-11e9-927c-1a5747ea98da.png>
[image: Screen Shot 2019-04-05 at 7 27 03 AM]
<https://user-images.githubusercontent.com/21006/55624811-fc2ad880-5774-11e9-97d3-cb8c6d504e97.png>
This feature is documented as "File Path" at
http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path
and here's a screenshot of the docs:
[image: Screen Shot 2019-04-05 at 7 32 27 AM]
<https://user-images.githubusercontent.com/21006/55624810-fc2ad880-5774-11e9-9123-09983378ba2a.png>
Please just leave a comment below! Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL-rjETidrUUwPj-aDJ7yZwbYsoN2eR-ks5vdzT1gaJpZM4E_vhX>
.
|
@nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄 Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at #4714 (comment) . Here's a screenshot: |
I loaded one to the demo site
https://doi.org/10.5072/FK2/86JG25
Feel free to use for testing.
…__________________________________________________________________________
*Norm Medeiros*
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 12:00 PM Philip Durbin ***@***.***> wrote:
@nmedeiro <https://github.com/nmedeiro> fantastic! If you have a public
sample zip file with a folder hierarchy that you give to your students that
we can also use in our own testing, please let us know where to download
it. 😄
Yes, I've been thinking that this is an important step toward more
automated reproducibility. Code Ocean, for example, wants a "data" folder
and "code" folder, as I wrote about at #4714 (comment)
<#4714 (comment)> .
Here's a screenshot:
[image: 49315649-70e6c100-f4bc-11e8-9c04-9034186e1571]
<https://user-images.githubusercontent.com/21006/55640790-4f635200-579a-11e9-8305-3ce74daf0936.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL-rjOhkgCDMUMSRVbbmzIDT5d2bkDgPks5vd3MdgaJpZM4E_vhX>
.
|
@nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue? |
Sure.
…__________________________________________________________________________
*Norm Medeiros*
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 12:46 PM Philip Durbin ***@***.***> wrote:
@nmedeiro <https://github.com/nmedeiro> thanks! It's only 6.5 MB. Can I
make it public by attaching it to this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL-rjIHMbeXCe9MJhqN_nlNNtqZsglCVks5vd33PgaJpZM4E_vhX>
.
|
@nmedeiro thanks! Here it is: dataverse_files.zip Inside the "Replication Documentation for Midlife Crisis Paper" directory are the following files:
|
Thanks all for the feedback as we evaluated and implemented this in Dataverse. Very exciting to see this feature added. #5572 (view hierarchy) has been merged and will be included in the next release. Retaining file hierarchy for zips and and editing hierarchy has been added in previous releases so I'm closing this issue. |
@nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13: Thanks again! |
Beautiful! Thanks for your efforts with this. It's very important to our
work with students.
…__________________________________________________________________________
*Norm Medeiros*
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, May 10, 2019 at 5:58 AM Philip Durbin ***@***.***> wrote:
@nmedeiro <https://github.com/nmedeiro> here's how the files and folders
look in the "tree" view we shipped in Dataverse 4.13:
[image: Screen Shot 2019-05-10 at 5 57 45 AM]
<https://user-images.githubusercontent.com/21006/57519094-8fad7700-72e8-11e9-8e1f-a49dbc9fff05.png>
Thanks again!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC72XDAN3ZIG4ESOFOPRF5LPUVBNTANCNFSM4BH67BLQ>
.
|
Hi, so the canonical solution to this problem is to upload a zip file? I was trying to upload some files and folders recently -- which I've organized carefully in order to ensure reproducibility -- and I was unable to figure out how to upload the files in a nested way. |
@setgree you can also use https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader or https://github.com/gdcc/python-dvuploader Or if you're in control of the installation, you can install https://github.com/gdcc/dvwebloader |
Thank you, I appreciate your quick response.This answer surprised me. IMHO:
Does the Dataverse team intend to integrate folder preservation into the default flow, or is the team happy with the way things stand? (Perhaps this has been discussed elsewhere, my apologies if I missed it) |
FWIW: There are potential security issues in allowing web apps to scan your disk for files. The DVWebloader uses a ~defacto standard that is supported by most browsers to allow you to upload a whole directory, after clicking OK in a browser mandated popup. (Conversely, when the user specifies the exact files involved as in our normal upload, no popup is required, but an app doesn't get to know the local path.) I'm sure with the work on creating a new React front end for Dataverse, we'll be looking at supporting directories more cleanly, as possible. (Also note w.r.t. surfacing - when DVWebloader is installed, the upload page shows an 'Upload a Folder' option, so it is visible.) |
As @qqmyers says, DVWebloader already integrates folder preservation into the default flow, but it's an optional component that needs to be installed (see https://guides.dataverse.org/en/6.1/user/dataset-management.html#folder-upload ). If you're curious what it looks like, there are some screenshots in this pull request: And yes, I agree that when we get to implementing file upload in the new frontend ( https://github.com/IQSS/dataverse-frontend ), we should strongly consider folder upload. Better reproducibility without workarounds. 100%. @setgree all this is to say, yes, we are fully supportive of your ideas! 😄 As far as things being discussed elsewhere, a good place for discussion is https://groups.google.com/g/dataverse-community or https://chat.dataverse.org . You are very welcome to join and post! |
Just a remark that if Dataverse were being built today, it would undoubtedly be built on top of git. Obviously git already handles all of the concerns above, including directory structure and avoiding duplicate storage between similar versions, so reinventing all the functionality may be unnecessary. To use git-Dataverse, a project would need to host its own git repo anywhere else. It could be a private repo. That repo would tag a small number of revisions as releases. git-Dataverse would then host a public, read-only "sparse mirror" that contained only the release revisions (and only the public parts of them) but was guaranteed to be archival, which is the point of Dataverse, I think? So a user of the sparse mirror could download a snapshot -- or could download the whole sparse mirror and see the diffs between releases. I am not sure how to construct such a sparse mirror, which collapses the intermediate history between releases and removes private material from each release. However, https://github.com/newren/git-filter-repo looks like a possible starting point. BTW, this is a feature that I could imagine github providing -- a kind of compromise between public and private repos -- but maybe they don't do this because they want to encourage open-source development, with fully public repos. Even if they did provide it, Dataverse may support bigger datasets and may have other features I don't know about. |
@jeisner ha, this remind me of my "a thought experiment: datasets as git repos" (doc, email) from 10 years ago. I even made a little logo: That is to say, we thought about it! OSF is actually built on top of git and they dissuaded us from doing the same when we did our big rewrite back then. Dataverse is still in the git game though! See https://github.com/datalad/datalad-dataverse for a recent integration. I'm planning on learning more in Germany in April at distribits. Anyway, the Google Group and the chat room are good places to talk about this. Please feel free to kick off a discussion! 😄 |
User community request to be able to organize files in a dataset in a hierarchical manner so when a user exports files from Google Drive, Dropbox, or OSF (or any other location), the file structure is maintained without causing the user extra work on the Dataverse side.
The text was updated successfully, but these errors were encountered: