Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#9? Put big data where and how you can compute on it rather than moving complete raw datasets around. #39

Closed
dlebauer opened this issue Mar 31, 2015 · 14 comments

Comments

@dlebauer
Copy link
Collaborator

Covers proximity / mounting and architecture. Don't hog active space if slow storage is sufficient - scan for untouched files to put in longer term storage.

Don't move it around if you can help it. If you must, use appropriate tools, Store local 'cached' copies (eg use knitr argument) instead of writing scripts that always download archived data. Only do so if there are changes.

Do sub setting server-side, computing in database (dplyr lazy eval) etc.

@dlebauer dlebauer changed the title #9? Put big data where and how you can compute on it rather than icing it back and forth. #9? Put big data where and how you can compute on it rather than moving complete raw datasets around. Mar 31, 2015
@PBarmby
Copy link
Collaborator

PBarmby commented Apr 2, 2015

Related: #16, #19, #25

@emhart
Copy link
Owner

emhart commented Apr 10, 2015

Can I assign this paragraph to you @dlebauer I think it's a good issue.

@emhart
Copy link
Owner

emhart commented Apr 10, 2015

I might also frame this as 'how important is the movement of data'. This is kind of touched on in #32, but this issue needs more detail. This might also be another place to talk about data access via an API.

@dlebauer
Copy link
Collaborator Author

I'll take it

On Fri, Apr 10, 2015 at 12:01 AM, Edmund Hart notifications@github.com
wrote:

I might also frame this as 'how important is the movement of data'. This
is kind of touched on in #32
#32, but
this issue needs more detail. This might also be another place to talk
about data access via an API.


Reply to this email directly or view it on GitHub
#39 (comment)
.

@jhollist
Copy link
Collaborator

jhollist commented May 5, 2015

Wondering if we can come up with a pithy title for this rule that doesn't include the "size matters" stuff.

Some suggestions:

  1. Moving data isn't free
  2. The larger your data, the more you need to think about where to store it.
  3. Moving data has costs

Well, those aren't very good, but ...

@dlebauer
Copy link
Collaborator Author

dlebauer commented May 6, 2015

I agree that "size matters" isn't the best for a title, but the section goes beyond just moving data, so I think the first and third suggestions are too narrow in scope. I'll think about it

@PBarmby
Copy link
Collaborator

PBarmby commented May 6, 2015

How about "data volume has consequences" ?

The people who worked on the Sloan Digital Sky Survey used to say that (at least, as of 10 years ago) there wasn't much that could beat the bandwidth of a FedEx truck filled with tapes or hard drives. I always find that amusing and I wonder if we can work it in somehow.

@snim2
Copy link
Collaborator

snim2 commented May 6, 2015

When I was a student we found that the bandwidth of sending a CD through the post was very favourable; these days USB sticks can hold huge volumes of data, but it's an interesting calcluation.

@jhollist
Copy link
Collaborator

jhollist commented May 6, 2015

@PBarmby I like that one and like the FedEx truck/post analogies. Maybe even bring up "sneakernet"

@dlebauer
Copy link
Collaborator Author

dlebauer commented May 6, 2015

any such citations for referencing the sneakernet or FedEx stories?
On Wed, May 6, 2015 at 9:37 AM Jeffrey W Hollister notifications@github.com
wrote:

@PBarmby https://github.com/PBarmby I like that one and like the FedEx
truck/post analogies. Maybe even bring up "sneakernet"


Reply to this email directly or view it on GitHub
#39 (comment)
.

@drj11
Copy link

drj11 commented May 6, 2015

"station wagon full of tapes" is Tanenbaum according to Wikipedia: https://en.wikipedia.org/wiki/Sneakernet#Non-fiction

I've probably read that text, but I can't verify the quotation now.

@jhollist
Copy link
Collaborator

jhollist commented May 6, 2015

And the authoritative source, xkcd!

https://what-if.xkcd.com/31/

Also some refs in the ref section of the wikipedia article that @drj11 linked.

@PBarmby
Copy link
Collaborator

PBarmby commented Jul 14, 2015

@dlebauer do you need help with this one? I can start on it if need be.

@dlebauer
Copy link
Collaborator Author

@PBarmby sorry I forgot to flush this one out. Would be great if you want to work on this, otherwise I can work on it next week.

@emhart emhart closed this as completed Oct 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants