-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple sheets in one file #3
Comments
Tables separated by blank linesSee http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc. Example
Advantages
Disadvantages
|
Tables in MarkdownSee https://help.github.com/articles/organizing-information-with-tables/ Example
Advantages
Disadvantages
|
Do you think it has to be one file? What if, instead, it were one directory? The way that Git repos or RStudio projects are just regular directories on your computer, with some other, mostly hidden stuff lying around to help Git and RStudio do their jobs. An R package is a special case that has even more specific structure about the files and directories. Maybe that's a model for a sanesheet? A sanesheet-anticipating tool serves the different files to you via conventional tabs. And tabs have an interact / edit mode appropriate to the file type. As delimited file for the TSVs, as markdown for the notes. So I guess that suggests there has to be a .sanesheets file, like foo.Rproj or the entire .git directory, to facilitate coordinated actions across the whole set of files. I think the point is that someone who wants to experience a sanesheet like a spreadsheet could. Well, if this mythical inpsector/editor existed! But the code-based analyst could experience it like a subdirectory of delimited files and some notes. And all could enjoy the benefit of version control. |
MIME (Multipurpose Internet Mail Extensions)See https://en.wikipedia.org/wiki/MIME Example
Advantages
Disadvantages
|
Multi-part MIME is pretty intriguing. I've always wanted to interact with |
Directory of filesExample
Advantages
Disadvantages
|
I like "Tables separated by blank lines" for its simplicity, but I believe it has the least support (of the options given) by programs in the wild. I like "Tables in Markdown" for its aesthetic qualities and existing adoption. Markdown editors already exist, and I believe some already have support for editing tables. This option is probably the most well implemented by existing tools. I like MIME because the format is simple, well defined and widely implemented. There's lots of existing code/libraries to read/write these files. It can contain binary files, like graphical plots. |
A directory of files certainly has the lowest barrier to entry. I'd like to bundle up multiple files into a single container file so that they're not easily separated, so that for examples when someone e-mails you the data sheet it doesn't get easily separated from the metadata sheet. People are accustomed to a single spreadsheet file (XLS for example) that contains all the related sheets of one project. I'd like to give people an option that has this same behaviour, but with a plain text file format. |
Twitter poll! https://twitter.com/sjackman/status/776860608629055488 |
JSON (JavaScript Object Notation)Example{
"sheet1": {
"A": [1, 2],
"B": ["X", "Y"]
},
"sheet2": {
"C": [3],
"D": [4],
"E": [5]
},
"sheet3": {
"F": [0],
"G": [1]
}
} Advantages
Disadvantages
|
When you are ready for provenance, attribution, and extensible metadata then http://www.researchobject.org/ will be waiting for you :-) |
A Research Object Bundle is a zip file of arbitrary files (like TSV files) that also contains a JSON file describing the provenance/attribution of those files. |
No ... but maybe it could one day! |
CSVY (CSV with YAML frontmatter)See http://csvy.org Example
Advantages
Disadvantages
sed '/^---$/,/^---$/d' |
@jennybc Any thoughts/preferences on these file formats? My favourites are
I really like the look of TSVY and its similarity to Markdown with YAML front matter. I quite like the idea of storing the code in RMarkdown with YAML frontmatter and storing the data in TSV with YAML frontmatter. I could even see concatenating the code and data files for the occasions where you may want to store both in a single file, and then I think you may have a real competitor for an Excel spreadsheet. |
I have JSON filed along with 😱 XML in my head. As in, if I don't have a nested or recursive structure, why would I go there? It's also not that human readable. For normal humans. So I like these tab and comma delimited formats with some meta-data in a header, yes. For the bits that are data. I think it is important to think about the motley assortment of things people park in a spreadsheet. I honestly think the neat packaging of disparate objects is part of what users like. I see a fair number of spreadsheets where the data worksheets really could be csv files. But then there's always the "README as worksheet" lurking at the front or the back. |
Yes, I agree with needing a format to wrap up prose, code, data and report in a single file. Currently that lives in 3+ files: the prose and code in one RMarkdown file, the data in multiple TSV files, and the the rendered report in an HTML file. I'd like a format to stuff all that in a single text file. I prefer having multiple files when building a data analysis pipeline, but when sending a report to a collaborator, I prefer a single file. |
This is why I think your multipart MIME idea is not crazy. This is also why xlsx is a zip archive. Unless people can get comfortable with a directory, you have to shove it all into something 😕. |
I like the MIME idea for storing files of different types all in a single text file: the RMarkdown, the data and the HTML. For the related but simpler problem of how to stuff multiple sheets (TSV tables) in a single file, I prefer the TSV tables separated by YAML frontmatter blocks. |
Do you think it's necessary to stuff multiple sheets in a single file? Why? |
|
See this Twitter conversation thread: @BaCh_mira
@mike_schatz
@BaCh_mira
|
Oh yes I definitely think this whole bundle of files needs to be packaged in some way yes. I thought you meant that, within that main receptacle, you wanted to get all the TSV files into one file. I misunderstood. This is why the MIME idea is interesting, because it already anticipates very disparate things, with a pre-existing vocabulary for declaring what things are, signalling where they begin/end, etc. |
In the case when you're bundling all different types of files (TSV, RMarkdown and HTML), I like the MIME format. The TSV in that MIME file can be just simple TSV without YAML blocks. In the case when you're bundling just TSV files (no RMardkon, no HTML) I prefer one file of TSV tables separated by YAML blocks (no MIME). If we only cared about bundling tables (sheets) into a single file, I'd prefer the TSV/YAML solution. If we want to tackle the whole enchilada, MIME is looking good. |
Success reading a multipart MIME sanesheet! So excite! Are we on to something? eg.sanesheet
read_sanesheet.rlibrary(purrr)
library(readr)
library(webutils)
multipart <- parse_multipart(read_file("sanesheet.tsv.multipart"), boundary = "END")
sheets <- map(multipart, function(x) read_tsv(x$value))
sheets
read_sanesheet.shThis sanesheet can also be extracted to one-file-per-sheet at the command line using
|
@jennybc What do you think of the name and file extension |
I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc). |
It's difficult to e-mail a directory files. I don't have any inside scoop, but I think that's why Apple migrated .pages, .numbers et c from their directories of files (bundles) to flat files, and that's even with special OS support to make the directory look to the user like a single file. A directory of files will eventually need to be zipped up to send to someone. A more likely outcome is that someone e-mail the data sheet and just leave out the metadata file, and the two will be separated. See this Twitter conversation thread: ODS and XLSX use a zip file of files as the file format. Michael @mr-c mentioned http://www.researchobject.org/, which is a zip of files. zip as a container format is alright, but it's binary and can't easily be edited with a text editor or committed to version control. MIME is a nice container format because it is plain text, and it faithfully represents a directory of files. To woo/convert spreadsheet users, one key feature currently missing is the ability to store multiple sheets in a single file. I'd like the format of that file to be plain text. |
Yes, that's a downside, but I think the downsides of the other options (i.e. one massive opaque file that requires special software to read) are worse. |
Zip files are quite opaque, which is why I'm not a big fan of zip as a container format for plain text files. MIME is quite readable, as far as standard plain-text container formats go. What exactly do you mean by opaque? The contents of the MIME file example in #3 (comment) are pleasingly transparent in my opinion. |
They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature? The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file. People don't seem to have issues sharing RStudio projects? |
|
That's also true of a |
People with the technical ability to create RStudio projects don't have an issue sharing RStudio projects. I'm hoping to target spreadsheet users with two key features:
|
If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project, ... Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime. |
Gmail's file size attachment limit is 25 MB. That's good enough I would hazard for many typical Excel spreadsheets without being compressed (though XLSX is compressed). Larger files can be transferred via a link on Dropbox, same for any data file larger than the e-mail attachment limit. |
I don't feel that's onerous. Only two operations are needed: create a new blank sheet and delete a sheet. Good old copy-and-paste can be used to copy a sheet between two projects. |
Texts (http://www.texts.io) can edit Markdown tables. |
|
@sjackman Sorry to jump in so late, but Tabular Data Packages might be an option here. You can store multiple TSVs in a single directory and define a
|
You may be interested in this new project Data Curator. We're building on top of Data Packages, Comma Chameleon and other goodness. Development starting this week. |
That's very exciting! Thanks for the heads up, Stephen. I'll be a willing beta tester, if you're looking for early outside users. |
Hope to go Beta at release 0.3.0 should be of some value at that stage for single data files. Everyone is very welcome to test and report issues. |
Excellent. I don't know of a way to watch releases or completed milestones in GitHub. Could you suggest an issue, or create an issue to which I could subscribe, that would be updated/closed when 0.3.0 is released? |
Watch releases with https://github.com/ODIQueensland/data-curator/releases.atom |
I and I imagine other users don't use an RSS feed reader as part of my workflow. I'd prefer to use the GitHub web interface to watch notifications. Would you consider creating a low-volume locked issue that you comment on once per release, so that users like myself can subscribe to that issue? |
I'd like to have multiple sheets in one file to keep data and metadata together in one file. This suggests that we need a container format to keep multiple TSV files together in one file. We could use an off-the-shelf container format, such as tar or zip. I would like however for the container format to be a plain text format to enable editing the entire document in a text editor and checking it into version control. I'm going to some requirements below, and then a random assortment of formats that came to mind.
The text was updated successfully, but these errors were encountered: