forked from nicercode/nicercode.github.com.old
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
1338 lines (1109 loc) · 77.3 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title><![CDATA[Nice R Code]]></title>
<link href="http://nicercode.github.io/atom.xml" rel="self"/>
<link href="http://nicercode.github.io/"/>
<updated>2016-09-16T12:05:39+10:00</updated>
<id>http://nicercode.github.io/</id>
<author>
<name><![CDATA[Rich FitzJohn & Daniel Falster]]></name>
</author>
<generator uri="http://octopress.org/">Octopress</generator>
<entry>
<title type="html"><![CDATA[Figure functions]]></title>
<link href="http://nicercode.github.io/blog/2013-07-09-figure-functions/"/>
<updated>2013-07-09T16:41:00+10:00</updated>
<id>http://nicercode.github.io/blog/figure-functions</id>
<content type="html"><![CDATA[<p>Transitioning from an interactive plot in R to a publication-ready
plot can create a messy script file with lots of statements and use of
global variables. This post outlines an approach that I have used to
simplify the process and keeps code readable.</p>
<!-- more -->
<p>The usual way of plotting to a file is to open a plotting device (such
as <code>pdf</code> or <code>png</code>) run a series of commands that generate plotting
output, and then close the device with <code>dev.off()</code>. However, the way
that most plots are developed is purely interactively. So you start
with:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class=""><span class="line">set.seed(10)
</span><span class="line">x <- runif(100)
</span><span class="line">y <- rnorm(100, x)
</span><span class="line">par(mar=c(4.1, 4.1, .5, .5))
</span><span class="line">plot(y ~ x, las=1)
</span><span class="line">fit <- lm(y ~ x)
</span><span class="line">abline(fit, col="red")
</span><span class="line">legend("topleft", c("Data", "Trend"),
</span><span class="line"> pch=c(1, NA), lty=c(NA, 1), col=c("black", "red"), bty="n")</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Then to convert this into a figure for publication we copy and paste
this between the device commands:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class=""><span class="line">pdf("my-plot.pdf", width=6, height=4)
</span><span class="line"> # ...pasted commands from before
</span><span class="line">dev.off()</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>This leads to bits of code that often look like this:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class=""><span class="line"># pdf("my-plot.pdf", width=6, height=4) # uncomment to make plot
</span><span class="line">set.seed(10)
</span><span class="line">x <- runif(100)
</span><span class="line">y <- rnorm(100, x)
</span><span class="line">par(mar=c(4.1, 4.1, .5, .5))
</span><span class="line">plot(y ~ x, las=1)
</span><span class="line">fit <- lm(y ~ x)
</span><span class="line">abline(fit, col="red")
</span><span class="line">legend("topleft", c("Data", "Trend"),
</span><span class="line"> pch=c(1, NA), lty=c(NA, 1), col=c("black", "red"), bty="n")
</span><span class="line"># dev.off() # uncomment to make plot</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>which is all pretty ugly. On top of that, we’re often making a bunch
of variables that are global but are really only useful in the context
of the figure (in this case the <code>fit</code> object that contains the trend
line). An arguably worse solution would be simply to duplicate the
plotting bits of code.</p>
<h2 id="a-partial-solution">A partial solution:</h2>
<p>The solution that I usually use is to make a function that generates
the figure.</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class=""><span class="line">fig.trend <- function() {
</span><span class="line"> set.seed(10)
</span><span class="line"> x <- runif(100)
</span><span class="line"> y <- rnorm(100, x)
</span><span class="line"> par(mar=c(4.1, 4.1, .5, .5))
</span><span class="line"> plot(y ~ x, las=1)
</span><span class="line"> fit <- lm(y ~ x)
</span><span class="line"> abline(fit, col="red")
</span><span class="line"> legend("topleft", c("Data", "Trend"),
</span><span class="line"> pch=c(1, NA), lty=c(NA, 1), col=c("black", "red"), bty="n")
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Then you can easily see the figure</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">fig.trend() # generates figure</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>or</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">source("R/figures.R") # refresh file that defines fig.trend
</span><span class="line">fig.trend()</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>and you can easily generate plots:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class=""><span class="line">pdf("figs/trend.pdf", width=6, height=8)
</span><span class="line">fig.trend()
</span><span class="line">dev.off()</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>However, this still gets a bit unweildly when you have a large number
of figures to make (especially for talks where you might make 20 or 30
figures).</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">pdf("figs/trend.pdf", width=6, height=4)
</span><span class="line">fig.trend()
</span><span class="line">dev.off()
</span><span class="line">
</span><span class="line">pdf("figs/other.pdf", width=6, height=4)
</span><span class="line">fig.other()
</span><span class="line">dev.off()</span></code></pre></td></tr></table></div></figure></notextile></div>
<h2 id="a-full-solution">A full solution</h2>
<p>The solution I use here is a little function called <code>to.pdf</code>:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.pdf <- function(expr, filename, ..., verbose=TRUE) {
</span><span class="line"> if ( verbose )
</span><span class="line"> cat(sprintf("Creating %s\n", filename))
</span><span class="line"> pdf(filename, ...)
</span><span class="line"> on.exit(dev.off())
</span><span class="line"> eval.parent(substitute(expr))
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Which can be used like so:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.pdf(fig.trend(), "figs/trend.pdf", width=6, height=4)
</span><span class="line">to.pdf(fig.other(), "figs/other.pdf", width=6, height=4)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>A couple of nice things about this approach:</p>
<ul>
<li>It becomes much easier to read and compare the parameters to the
plotting device (width, height, etc).</li>
<li>We’re reduced things from 6 repetitive lines to 2 that capture our
intent better.</li>
<li>The to.pdf function demands that you put the code for your figure in a function.</li>
<li>Using functions, rather than statements in the global environment,
discourages dependency on global variables. This in turn helps
identify reusable chunks of code.</li>
<li>Arguments are all passed to <code>pdf</code> via <code>...</code>, so we don’t need to
duplicate <code>pdf</code>’s argument list in our function.</li>
<li>The <code>on.exit</code> call ensures that the device is always closed, even if
the figure function fails.</li>
</ul>
<p>For talks, I often build up figures piece-by-piece. This can be done
like so (for a two-part figure)</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class=""><span class="line">fig.progressive <- function(with.trend=FALSE) {
</span><span class="line"> set.seed(10)
</span><span class="line"> x <- runif(100)
</span><span class="line"> y <- rnorm(100, x)
</span><span class="line"> par(mar=c(4.1, 4.1, .5, .5))
</span><span class="line"> plot(y ~ x, las=1)
</span><span class="line"> if ( with.trend ) {
</span><span class="line"> fit <- lm(y ~ x)
</span><span class="line"> abline(fit, col="red")
</span><span class="line"> legend("topleft", c("Data", "Trend"),
</span><span class="line"> pch=c(1, NA), lty=c(NA, 1), col=c("black", "red"), bty="n")
</span><span class="line"> }
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Now – if run with as</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">fig.progressive(FALSE)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>just the data are plotted, and if run as</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">fig.progressive(TRUE)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>the trend line and legend are included. Then with the <code>to.pdf</code>
function, we can do:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.pdf(fig.progressive(TRUE), "figs/progressive-1.pdf", width=6, height=4)
</span><span class="line">to.pdf(fig.progressive(FALSE), "figs/progressive-2.pdf", width=6, height=4)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>which will generate the two figures.</p>
<p>The general idea can be expanded to more devices:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.dev <- function(expr, dev, filename, ..., verbose=TRUE) {
</span><span class="line"> if ( verbose )
</span><span class="line"> cat(sprintf("Creating %s\n", filename))
</span><span class="line"> dev(filename, ...)
</span><span class="line"> on.exit(dev.off())
</span><span class="line"> eval.parent(substitute(expr))
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>where we would do:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.dev(fig.progressive(TRUE), pdf, "figs/progressive-1.pdf", width=6, height=4)
</span><span class="line">to.dev(fig.progressive(FALSE), pdf, "figs/progressive-2.pdf", width=6, height=4)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Note that with this <code>to.dev</code> function we can rewrite the <code>to.pdf</code>
function more compactly:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.pdf <- function(expr, filename, ...)
</span><span class="line"> to.dev(expr, pdf, filename, ...)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Or write a similar function for the <code>png</code> device:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">to.png_function(expr, filename, ...)
</span><span class="line"> to.dev(expr, png, filename)</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>(As an alternative, the <code>dev.copy2pdf</code> function can be useful for
copying the current contents of an interactive plotting window to a
pdf).</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Modifying data with lookup tables]]></title>
<link href="http://nicercode.github.io/blog/2013-07-09-modifying-data-with-lookup-tables/"/>
<updated>2013-07-09T08:20:00+10:00</updated>
<id>http://nicercode.github.io/blog/modifying-data-with-lookup-tables</id>
<content type="html"><![CDATA[<!-- The problem:
- importing new data
- amount of code to be written (opportunities for mistake)
- separating data from scripts
- maintaining record of where data came from
Common approach
- long sequence of data modifying code
Solution
- use lookup table, find and replace
-->
<p>In many analyses, data is read from a file, but must be modified before it can be used. For example you may want to add a new column of data, or do a “find” and “replace” on a site, treatment or species name. There are 3 ways one might add such information. The first involves editing the original data frame – although you should <em>never</em> do this, I suspect this method is quite common. A second – and widely used – approach for adding information is to modify the values using code in your script. The third – and nicest – way of adding information is to use a lookup table.</p>
<!-- more -->
<p>One of the most common things we see in the code of researchers working with data are long slabs of code modifying a data frame based on some logical tests.Such code might correct, for example, a species name:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">raw<span class="o">$</span>species<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span> <span class="s">"Banksia oblongifolia"</span>
</span><span class="line">raw<span class="o">$</span>species<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"2"</span><span class="p">]</span> <span class="o"><-</span> <span class="s">"Banksia ericifolia"</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>or add some details to the data set, such as location, latitude, longitude and mean annual precipitation:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">raw<span class="o">$</span>location<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span><span class="s">"NSW"</span>
</span><span class="line">raw<span class="o">$</span>latitude<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span> <span class="m">-37</span>
</span><span class="line">raw<span class="o">$</span>longitude<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span> <span class="m">40</span>
</span><span class="line">raw<span class="o">$</span>map<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span> <span class="m">1208</span>
</span><span class="line">raw<span class="o">$</span>map<span class="p">[</span>raw<span class="o">$</span>id<span class="o">==</span><span class="s">"1"</span><span class="p">]</span> <span class="o"><-</span> <span class="m">1226</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>In large analyses, this type of code may go for hundreds of lines.</p>
<p><img src="../../images/2013-07-09-modifying-data-with-lookup-tables/messy_script.png" /></p>
<p>Now before we go on, let me say that this approach to adding data is <em>much</em> better than editing your datafile directly, for the following two reasons:</p>
<ol>
<li>It maintains the integrity of your raw data file</li>
<li>You can see where the new value came from (it was added in a script), and modify it later if needed.</li>
</ol>
<p>There is also nothing <em>wrong</em> with adding data this way. However, it is what we would consider <em>messy</em> code, for these reasons:</p>
<ul>
<li>Long chunks of code modifying data is inherently difficult to read.</li>
<li>There’s a lot of typing involved, so lot’s of work, and thus opportunities for error.</li>
<li>It’s harder to change variable names when they are embedded in code all over the place.</li>
</ul>
<p>A far <em>nicer</em> way to add data to an existing data frame is to use a lookup table. Here is an example of such a table, achieving similar (but not identical) modifications to the code above:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">read.csv<span class="p">(</span><span class="s">"dataNew.csv"</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<div>
<pre><code class="text">## lookupVariable lookupValue newVariable newValue
## 1 id 1 species Banksia oblongifolia
## 2 id 2 species Banksia ericifolia
## 3 id 3 species Banksia serrata
## 4 id 4 species Banksia grandis
## 5 NA family Proteaceae
## 6 NA location NSW
## 7 id 4 location WA
## source
## 1 Daniel Falster
## 2 Daniel Falster
## 3 Daniel Falster
## 4 Daniel Falster
## 5 Daniel Falster
## 6 Daniel Falster
## 7 Daniel Falster</code></pre>
</div>
<p>The columns of this table are</p>
<ul>
<li><strong>lookupVariable</strong> is the name of the variable in the parent data we want to match against. If left blank, change all rows.</li>
<li><strong>lookupValue</strong> is the value of lookupVariable to match against</li>
<li><strong>newVariable</strong> is the variable to be changed</li>
<li><strong>newValue</strong> is the value of <code>newVariable</code> for matched rows</li>
<li><strong>source</strong> includes any notes about where the data came from (e.g., who made the change)</li>
</ul>
<p>So the table documents the changes we want to make to our dataframe. The function <a href="https://gist.github.com/dfalster/5589956">addNewData.R</a> takes the file name for this table as an argument and applies it to the data frame. For example let’s assume we have a data frame called <code>data</code></p>
<div>
<pre><code class="r">myData</code></pre>
</div>
<div>
<pre><code class="text">## x y id
## 1 0.93160 5.433 1
## 2 0.24875 3.868 2
## 3 0.92273 5.944 2
## 4 0.85384 5.541 2
## 5 0.30378 3.985 2
## 6 0.41205 4.415 2
## 7 0.35158 4.440 2
## 8 0.13920 3.007 2
## 9 0.16579 2.976 2
## 10 0.66290 5.315 3
## 11 0.25720 3.755 3
## 12 0.88086 5.345 3
## 13 0.11784 3.183 3
## 14 0.01423 3.749 4
## 15 0.23359 4.264 4
## 16 0.33614 4.433 4
## 17 0.52122 4.393 4
## 18 0.11616 3.603 4
## 19 0.90871 6.379 4
## 20 0.75664 5.838 4</code></pre>
</div>
<p>and want to apply the table given above, we simply write</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">source<span class="p">(</span><span class="s">"addNewData.r"</span><span class="p">)</span>
</span><span class="line">allowedVars <span class="o"><-</span> c<span class="p">(</span><span class="s">"species"</span><span class="p">,</span> <span class="s">"family"</span><span class="p">,</span> <span class="s">"location"</span><span class="p">)</span>
</span><span class="line">addNewData<span class="p">(</span><span class="s">"dataNew.csv"</span><span class="p">,</span> myData<span class="p">,</span> allowedVars<span class="p">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<div>
<pre><code class="text">## x y id species family location
## 1 0.93160 5.433 1 Banksia oblongifolia Proteaceae NSW
## 2 0.24875 3.868 2 Banksia ericifolia Proteaceae NSW
## 3 0.92273 5.944 2 Banksia ericifolia Proteaceae NSW
## 4 0.85384 5.541 2 Banksia ericifolia Proteaceae NSW
## 5 0.30378 3.985 2 Banksia ericifolia Proteaceae NSW
## 6 0.41205 4.415 2 Banksia ericifolia Proteaceae NSW
## 7 0.35158 4.440 2 Banksia ericifolia Proteaceae NSW
## 8 0.13920 3.007 2 Banksia ericifolia Proteaceae NSW
## 9 0.16579 2.976 2 Banksia ericifolia Proteaceae NSW
## 10 0.66290 5.315 3 Banksia serrata Proteaceae NSW
## 11 0.25720 3.755 3 Banksia serrata Proteaceae NSW
## 12 0.88086 5.345 3 Banksia serrata Proteaceae NSW
## 13 0.11784 3.183 3 Banksia serrata Proteaceae NSW
## 14 0.01423 3.749 4 Banksia grandis Proteaceae WA
## 15 0.23359 4.264 4 Banksia grandis Proteaceae WA
## 16 0.33614 4.433 4 Banksia grandis Proteaceae WA
## 17 0.52122 4.393 4 Banksia grandis Proteaceae WA
## 18 0.11616 3.603 4 Banksia grandis Proteaceae WA
## 19 0.90871 6.379 4 Banksia grandis Proteaceae WA
## 20 0.75664 5.838 4 Banksia grandis Proteaceae WA</code></pre>
</div>
<p>The large block of code is now reduced to a single line that clearly expresses what we want to achieve. Moreover, the new values (data) are stored as a table of <em>data</em> in a file, which is preferable to having data mixed in with our code.</p>
<p>You can use this approach
You can find the example files used here, as a <a href="https://gist.github.com/dfalster/5589956">github gist</a>.</p>
<p><strong>Acknowledgements:</strong> Many thanks to Rich FitzJohn and Diego Barneche for valuable discussions.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Organizing the project directory]]></title>
<link href="http://nicercode.github.io/blog/2013-05-17-organising-my-project/"/>
<updated>2013-05-17T08:20:00+10:00</updated>
<id>http://nicercode.github.io/blog/organising-my-project</id>
<content type="html"><![CDATA[<p>This is a guest post by Marcela Diaz, a PhD student at Macquarie University. </p>
<p>Until recently, I hadn’t given much attention to organising files in my project. All the documents and files from my current project were spread out in two different folders, with very little sub folder division. All the files where together in the same place and I had multiple versions of the same file, with different dates. As you can see, things were getting a bit out of control.</p>
<!--more -->
<p><img src="../../images/2013-05-17-organising-my-project/directory1_before.png" /></p>
<p><img src="../../images/2013-05-17-organising-my-project/directory2_before.png" /></p>
<p>Following <a href="../2013-04-05-projects/">advice from by Rich and Daniel</a>, I decided to spend a little time getting organised, adopting a directory layout with the following folders:</p>
<ul>
<li>Data: which contains both my base (raw) data and the processed data </li>
<li>Output: data and figures generated in R</li>
<li>R: R scripts with all new functions I created as part of the cleaning directory process and in an attempt to write nicer code. </li>
<li>Analysis (R file): R script sourcing all the functions necessary for the analysis </li>
</ul>
<p><img src="../../images/2013-05-17-organising-my-project/directory_after.png" /></p>
<p>At the same time I <a href="../../git">started using version control with git</a>. As a result, I no longer need to create a new file every time I make a change, and each of the files in the analysis directory is unique.</p>
<p>Setting up the new directory and sorting the existing files in the new folders didn’t take long and was relatively easy. Now it is really simple to find files and keep track of current and old figures. I no longer need to use spotlight to find the latest version of each script. From my experience this improved the organization and efficiency of my project; I highly recommend keeping a good project layout. </p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[How long is a function?]]></title>
<link href="http://nicercode.github.io/blog/2013-05-07-how-long-is-a-function/"/>
<updated>2013-05-07T11:10:00+10:00</updated>
<id>http://nicercode.github.io/blog/how-long-is-a-function</id>
<content type="html"><![CDATA[<p>Within the R project and contributed packages, how long do functions
tend to be? In our experience, people seem to think that functions
are only needed when you need to use a piece of code multiple times,
or when you have a really large problem. However, many functions are
actually very small.</p>
<!-- more -->
<p>R allows a lot of “computation on the language”, simply meaning that
we can look inside objects easily. Here is a function that returns
the number of lines in a function.</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">function.length <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>f<span class="p">)</span> <span class="p">{</span>
</span><span class="line"> <span class="kr">if</span> <span class="p">(</span>is.character<span class="p">(</span>f<span class="p">))</span>
</span><span class="line"> f <span class="o"><-</span> match.fun<span class="p">(</span>f<span class="p">)</span>
</span><span class="line"> length<span class="p">(</span>deparse<span class="p">(</span>f<span class="p">))</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>This works because <code>deparse</code> converts an object back into text (that
could in turn be parsed):</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">writeLines<span class="p">(</span>deparse<span class="p">(</span>function.length<span class="p">))</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class=""><span class="line">function (f)
</span><span class="line">{
</span><span class="line"> if (is.character(f))
</span><span class="line"> f <- match.fun(f)
</span><span class="line"> length(deparse(f))
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>so the <code>function.length</code> function is itself 6 lines long by this
measure. Note that the formatting is actually a bit different, in
particular indentation, braces position and spacing is different,
following the likes of the R-core style guide.</p>
<p>Most packages consist mostly of functions: here is a function that
extracts all functions from a package:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">package.functions <- function(package) {
</span><span class="line"> pkg <- sprintf("package:%s", package)
</span><span class="line"> object.names <- ls(name=pkg)
</span><span class="line"> objects <- lapply(object.names, get, pkg)
</span><span class="line"> names(objects) <- object.names
</span><span class="line"> objects[sapply(objects, is.function)]
</span><span class="line">}</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Finally, we can get the lengths of all functions in a package:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">package.function.lengths <- function(package)
</span><span class="line"> vapply(package.functions(package), function.length, integer(1))</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Looking at the recommended package “boot”</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">library<span class="p">(</span>boot<span class="p">)</span>
</span><span class="line">package.function.lengths<span class="p">(</span><span class="s">"boot"</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class=""><span class="line"> abc.ci boot boot.array boot.ci
</span><span class="line"> 54 126 56 80
</span><span class="line"> censboot control corr cum3
</span><span class="line"> 137 72 8 8
</span><span class="line"> cv.glm EEF.profile EL.profile empinf
</span><span class="line"> 42 16 27 79
</span><span class="line"> envelope exp.tilt freq.array glm.diag
</span><span class="line"> 56 49 7 19
</span><span class="line"> glm.diag.plots imp.moments imp.prob imp.quantile
</span><span class="line"> 69 37 34 39
</span><span class="line">imp.weights inv.logit jack.after.boot k3.linear
</span><span class="line"> 34 2 69 14
</span><span class="line"> lik.CI linear.approx logit nested.corr
</span><span class="line"> 36 34 2 28
</span><span class="line"> norm.ci saddle saddle.distn simplex
</span><span class="line"> 33 179 281 65
</span><span class="line"> smooth.f tilt.boot tsboot var.linear
</span><span class="line"> 36 57 97 14 </span></code></pre></td></tr></table></div></figure></notextile></div>
<p>I have 138 packages installed on my computer (mostly through
dependencies – small compared with the ~4000 on CRAN!). We need to
load them all before we can access the functions within:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">library<span class="p">(</span>utils<span class="p">)</span>
</span><span class="line">packages <span class="o"><-</span> rownames<span class="p">(</span>installed.packages<span class="p">())</span>
</span><span class="line"><span class="kr">for</span> <span class="p">(</span>p <span class="kr">in</span> packages<span class="p">)</span>
</span><span class="line"> library<span class="p">(</span>p<span class="p">,</span> character.only<span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>Then we can apply the <code>package.function.lengths</code> to each package.</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">lens <span class="o"><-</span> lapply<span class="p">(</span>packages<span class="p">,</span> package.function.lengths<span class="p">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>The median function length is only 12 lines (and remember that
includes things like the function arguments)!</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="r"><span class="line">median<span class="p">(</span>unlist<span class="p">(</span>lens<span class="p">))</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">[1] 12</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>The distribution of function lengths is strongly right skewed, with
most functions being very short. Ignoring the 1% of functions that
are longer than 200 lines long, the distribution of function lengths
looks like this:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">tmp <- unlist(lens)
</span><span class="line">hist(tmp[tmp <= 200], main="", xlab="Function length (lines)")</span></code></pre></td></tr></table></div></figure></notextile></div>
<p><img src="http://nicercode.github.io/images/2013-05-07-how-long-is-a-function/function-length-distribution.png" /></p>
<p>Then plot the distribution of the per-package median (that is, for
each package compute the median function length in terms of lines of
code and plot the distribution of those medians).</p>
<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class=""><span class="line">lens.median <- sapply(lens, median)
</span><span class="line">hist(lens.median, main="", xlab="Per-package median function length")</span></code></pre></td></tr></table></div></figure></notextile></div>
<p><img src="http://nicercode.github.io/images/2013-05-07-how-long-is-a-function/function-length-median.png" /></p>
<p>The median package has a median function length of 16 lines. There
are handful of extremely long functions in most packages; over all
packages, the median “longest function” is 120 lines.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Excel and line endings]]></title>
<link href="http://nicercode.github.io/blog/2013-04-30-excel-and-line-endings/"/>
<updated>2013-04-30T09:39:00+10:00</updated>
<id>http://nicercode.github.io/blog/excel-and-line-endings</id>
<content type="html"><![CDATA[<p>On a Mac, Excel produces csv files with the wrong line endings, which
causes problems for git (amongst other things).</p>
<p>This issue plagues at least
<a href="http://developmentality.wordpress.com/2010/12/06/excel-2008-for-macs-csv-bug/">Excel 2008</a>
and 2011, and possibly other versions.</p>
<p>Basically, saving a file as comma separated values (csv) uses a
carriage return <code>\r</code> rather than a line feed <code>\n</code> as a newline. Way
back before OS X, this was actually the correct Mac file ending, but
after the move to be more unix-y, the correct line ending should be
<code>\n</code>.</p>
<!-- more -->
<p>Given that nothing has used this as the proper line endings for over a
decade, this is a bug. It’s a real pity that Microsoft does not see
fit to fix it.</p>
<h2 id="why-this-is-a-problem">Why this is a problem</h2>
<p>This breaks a number of scripts that require specific line endings.</p>
<p>This also causes problems when version controlling your data. In
particular, tools like <code>git diff</code> basically stop working as they work
line-by-line and see only one long line
(e.g. <a href="http://stackoverflow.com/questions/11531084/strange-git-line-ending-issue">here</a>).
Not having <code>diff</code> work properly makes it really hard to see where
changes have occurred in your data.</p>
<p>Git has really nice facilities for translating between different line
endings – in particular between Windows and Unix/(new) Mac endings.
However, they do basically nothing with old-style Mac endings because
<em>no sane application should create them</em>. See
<a href="https://github.com/git/git/blob/master/convert.c#L93">here</a>, for
example.</p>
<h2 id="a-solution">A solution</h2>
<p>There are at leat two stack overflow questions that deal with this
(<a href="http://stackoverflow.com/questions/10491564/git-and-cr-vs-lf-but-not-crlf?rq=1">1</a>
and
(<a href="http://stackoverflow.com/questions/11531084/strange-git-line-ending-issue">2</a>).</p>
<p>The solution is to edit <code>.git/config</code> (within your repository) to add
lines saying:</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class=""><span class="line">[filter "cr"]
</span><span class="line"> clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
</span><span class="line"> smudge = tr '\\n' '\\r'</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>and then create a file <code>.gitattributes</code> that contains the line</p>
<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class=""><span class="line">*.csv filter=cr</span></code></pre></td></tr></table></div></figure></notextile></div>
<p>This translates the line endings on import and back again on export
(so you never change your working file). Things like <code>git diff</code> use
the “clean” version, and so magically start working again.</p>
<p>While the <code>.gitattributes</code> file can be (and should be) put under
version control, the <code>.git/config</code> file needs to be set up separately
on <em>every clone</em>. There are good reasons for this (see
<a href="http://stackoverflow.com/questions/6547933/is-it-possible-to-clone-git-config-from-remote-location">here</a>.
It would be possible to automate this to some degree with the
<code>--config</code> argument to <code>git clone</code>, but that’s still basically manual.</p>
<h2 id="issues">Issues</h2>
<p>This seems to generally work, but twice in use large numbers of files
have been marked as changed when the filter got out-of-sync. We never
worked out what caused this, but one possible culprit seems to be
<a href="http://www.dropbox.com">Dropbox</a> (but you probably should not keep
repositories on dropbox anyway).</p>
<h2 id="alternative-solutions">Alternative solutions</h2>
<p>The nice thing about the clean/smudge solution is that it leaves files
in the working directory unmodified. An alternative approach would be
to set up a pre-commit-hook that ran csv files through a similar
filter. This will modify the contents of the working directory (and
may require reloading the files in Excel) but from that point on the
file will have proper line endings.</p>
<p>More manually, if files are saved as “Windows comma separated (.csv)”
you will get windows-style line endings (<code>\r\n</code>) which are at least
treated properly by git and are in common usage this century.
However, this requires more remembering and makes saving csv files
from Excel even more tricky than normal.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[git]]></title>
<link href="http://nicercode.github.io/blog/2013-04-23-git/"/>
<updated>2013-04-23T17:51:00+10:00</updated>
<id>http://nicercode.github.io/blog/git</id>
<content type="html"><![CDATA[<p>Thanks to everyone who came along and was such good sports with
learning git today. Hopefully you now have enough tools to help you use git in your own
projects. The notes are available (in fairly raw form)
<a href="http://nicercode.github.io/git">here</a>. Please let us know where they are unclear and we will
update them.</p>
<p>To re-emphasise our closing message – start using it on a
project, start thinking about what you want to track, and start
thinking about what constitutes a logical commit. Once you get into a
rhythm it will seem much easier. Bring your questions along to the
class in 2 weeks time.</p>
<p>Also, to re-emphasise that git is not a backup system. Make sure that
you have your work backed up, just in case something terrible happens.
I recommend <a href="http://www.crashplan.com/">crash plan</a> which you can use
for free for backing up onto external hard drives (and for a fee).</p>
<h2 id="feedback">Feedback</h2>
<p>We welcome any and all feedback on the material and how we present it.
You can give <em>anonymous</em> feedback by emailing G2G admin (you should
have the address already – I’m only not putting it up here in a vain
effort to slow down spam bots). Alternatively, you are welcome to
email either or both of us, or leave a comment on a relevant page.</p>
]]></content>
</entry>
<entry>
<title type="html"><![CDATA[Why I want to write nice R code]]></title>
<link href="http://nicercode.github.io/blog/2013-04-05-why-nice-code/"/>
<updated>2013-04-05T14:46:00+11:00</updated>
<id>http://nicercode.github.io/blog/why-nice-code</id>
<content type="html"><![CDATA[<!--
Why are students here
Goals: performance, learning, affective, social
Value: attainment, intrinsic, instrumental
Instrumental - allows you to accomplish other important goals (extrinsic
rewards), i.e. learn about world, write papers, impress others
Intrinsic - value nice code for itself (craftsmanship)
Attainment - satisfaction in getting something to work
-->
<p>Writing code is fast becoming a key - if not the most important - skill for
doing research in the 21st century. As scientists, we live in extraordinary
times. The amount of data (information) available to us is increasing
exponentially, allowing for rapid advances in our understanding of the world
around us. The amount of information contained in a standard scientific paper
also seems to be on the rise. Researchers therefore need to be able to handle
ever larger amounts of data to ask novel questions and get papers published.
Yet, the standard tools used by many biologists - point and click programs for
manipulating data, doing stats and making plots - do not allow us to scale-up
our analyses to match data availability, at least not without many, many more
‘clicks’.</p>
<!-- more -->
<p><span class="caption-wrapper right"><img class="caption" src="http://nicercode.github.io/images/2013-04-05-why-nice-code/geeks-vs-nongeeks-repetitive-tasks.png" width="" height="" alt="Why writing code saves you time with repetitive tasks, by [Bruno Oliveira](https://plus.google.com/+BrunoOliveira/posts/MGxauXypb1Y)" title="Why writing code saves you time with repetitive tasks, by [Bruno Oliveira](https://plus.google.com/+BrunoOliveira/posts/MGxauXypb1Y)" /><span class="caption-text">Why writing code saves you time with repetitive tasks, by <a href="https://plus.google.com/+BrunoOliveira/posts/MGxauXypb1Y">Bruno Oliveira</a></span></span></p>
<p>The solution is to write scripts in programs like
<a href="http://www.r-project.org/">R</a>, <a href="http://www.python.org/">python</a> or
<a href="http://www.mathworks.com.au/products/matlab/">matlab</a>. Scripting allows you to
automate analyses, and therefore scale-up without a big increase in
effort.</p>
<p>Writing code also offers other benefits to research. When your
analyses are documented in a script, it is easier to pick up a project and
start working on it again. You have a record of what you did and why. Chunks
of code can also be reused in new projects, saving vast amount of time. Writing
code also allows for effective collaboration with people from all over the
world. For all these reasons, many researchers are now learning how to write
code.</p>
<p>Yet, most researchers have no or limited formal training in computer science,
and thus struggle to write nice code (<a href="http://dx.doi.org/10.1038/467775a">Merali 2010</a>). Most of us are self-taught, having used a
mix of books, advice from other amateur coders, internet posts, and lots of
trial and error. Soon after have we written our first R script, our hard drives
explode with large bodies of barely readable code that we only half understand,
that also happens to be full of bugs and is generally difficult to use. Not
surprisingly, many researchers find writing code to be a relatively painful
process, involving lots of trial and error and, inevitably, frustration.</p>
<p>If this sounds familiar to you, don’t worry, you are not alone. There are many
<a href="http://nicercode.github.io/intro/resources.html">great R resources</a> available, but most show you how
to do some fancy trick, e.g. run some complicated statistical test or make a
fancy plot. Few people - outside of computer science departments - spend time
discussing the qualities of nice code and teaching you good coding habits.
Certainly no one is teaching you these skills in your standard biology research
department.</p>
<blockquote class="twitter-tweet"><p>Learn to code! I worry that most biologists leave uni lacking #1 skill for 21st cent biology. For inspiration <a href="http://t.co/7lzRutYuIw" title="http://code.org">code.org</a> <a href="https://twitter.com/search/%23CODE">#CODE</a></p>— Daniel Falster (@adaptive_plant) <a href="https://twitter.com/adaptive_plant/status/306854385076543488">February 27, 2013</a></blockquote>
<script async="" src="http://nicercode.github.io//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Observing how colleagues were struggling with their code, we
(<a href="http://nicercode.github.io/about#Team">Rich FitzJohn and Daniel Falster</a>) have teamed up to bring you
the <a href="http://nicercode.github.io/">nice R code</a> course and blog. We are
targeting researchers who are already using R and want to take their coding to
the next level. Our goal is to help you write nicer code.</p>
<blockquote>
<p>By ‘nicer’ we mean
code that is easy to read, easy to write, runs fast, gives reliable results, is
easy to reuse in new projects, and is easy to share with collaborators.</p>
</blockquote>
<p>We
will be focussing on elements of workflow, good coding habits and some tricks,
that will help transform your code from messy to nice.</p>
<p>The inspiration for nice R code came in part from attending a boot camp run by
Greg Wilson from the <a href="http://software-carpentry.org/">software carpentry team</a>.
These boot camps aim to help researchers be more productive by teaching them
basic computing skills. Unlike other software courses we had attended, the
focus in the boot camps was on good programming habits and design. As
biologists, we saw a need for more material focussed on R, the language that
has come to dominate biological research. We are not experts, but have more
experience than many biologists. Hence the nice R code blog.</p>
<blockquote class="twitter-tweet"><p>@<a href="https://twitter.com/phylorich">phylorich</a> Being able to code (in any language) is most important skill for current biology. R is good choice: widely used, high level, free</p>— Daniel Falster (@adaptive_plant) <a href="https://twitter.com/adaptive_plant/status/312438921059520512">March 15, 2013</a></blockquote>
<script async="" src="http://nicercode.github.io//platform.twitter.com/widgets.js" charset="utf-8"></script>
<h2 id="key-elements-of-nice-r-code">Key elements of nice R code</h2>
<p>We will now briefly consider some of the key principles of writing nice code.</p>
<h3 id="nice-code-is-easy-to-read">Nice code is easy to read</h3>
<blockquote><p>Programs should be written for people to read, and only incidentally for<br />machines to execute.</p><footer><strong>Abelson and Sussman</strong> <cite>Structure and Interpretation of Computer Programs</cite></footer></blockquote>
<p>Readability is by far the most important guiding principle for writing nicer
code. <strong>Anyone (especially you) should be able to pick up any of your
projects, understand what the code does and how to run it</strong>. Most code
written for research purposes is not easy to read.</p>
<p>In our opinion, there are no fixed rules for what nice code should look like.
There
is just a single test: is it easy to read? To check how nice your code
is, pass it to a collaborator, or pick up some code you haven’t used for
over a year. Do they (you) understand it?</p>
<p>Below are some general guidelines for making your code more readable. We
will explore each of these in more detail here on the blog:</p>
<ul>
<li>Use a sensible directory structure for organising project related
materials.</li>
<li>Abstract your code into many small functions with helpful descriptive
names</li>
<li>Use comments, design features, and meaningful variable or function names
to capture the intent of your code, i.e. describe what it is <em>meant</em> to do</li>
<li>Use version control. Of the many reasons for using version control, one is
that it archives older versions of your code, permitting you to ruthlessly
yet safely delete old files. This helps reduce clutter and improves readability.</li>
<li>Apply a consistent style, such as that described in the</li>
<li><a href="http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html">google R style guide</a>.</li>
</ul>
<h3 id="nice-code-is-reliable-ie-bug-free">Nice code is reliable, i.e. bug free</h3>
<blockquote class="twitter-tweet"><p>Occma’s raz0r: if your program isn’t working, it’s probably just a typo in the code, not an undiscovered bug or thing you’re doing wrong</p>— Alison Abreu-Garcia (@alisonag) <a href="https://twitter.com/alisonag/status/322374461212995584">April 11, 2013</a></blockquote>
<script async="" src="http://nicercode.github.io//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>The computer does exactly what you tell it to. If there is a problem in your code, it’s most likely you put it there. How certain
are you that your code is error free? More than once I have reached a state
of near panic, looking over my code to ensure it is bug free before
submitting a final version of a paper for publication. What if I got it wrong?</p>
<p><a href="http://dx.doi.org/10.1109/MCSE.2005.54">It is almost impossible to ensure code is bug free</a>, but one can adopt healthy
habits that minimise the chance of this occurring:</p>
<ul>
<li>Don’t repeat yourself. The less you type, the fewer chances there are for
mistakes</li>
<li>Use test scripts, to compare your code against known cases</li>
<li>Avoid using global variables, the attach function and <a href="../intro/bad-habits.html">other nasties</a>
where ownership of data cannot be ensured</li>
<li>Use version control so that you see what has changed, and easily trace
mistakes</li>
<li>Wherever possible, open your code and project up for review, either by
colleagues, during review process, or in repositories such as github.</li>
<li>The more <em>readable</em> your code is, the less likely it is to contain
errors.</li>
</ul>
<blockquote class="twitter-tweet"><p>“Every bug is two bugs: the bug in your code, and the test you didn’t write”@<a href="https://twitter.com/estherbester">estherbester</a> <a href="https://twitter.com/search/%23pycon">#pycon</a></p>— Ned Batchelder (@nedbat) <a href="https://twitter.com/nedbat/status/312628852558032896">March 15, 2013</a></blockquote>
<script async="" src="http://nicercode.github.io//platform.twitter.com/widgets.js" charset="utf-8"></script>
<h3 id="nice-code-runs-quickly-and-is-therefore-a-pleasure-to-use">Nice code runs quickly and is therefore a pleasure to use</h3>
<blockquote>
<p>The faster you can make the plot, the more fun you will have.</p>
</blockquote>
<p>Code that is slow to run is less fun to use. By <em>slow</em> I mean anything
that takes more than a few seconds to run, so impedes analysis.
Speed is particularly an issue for people analysing large datasets, or
running complex simulations, where code may run for many hours, days,
or weeks.</p>
<p>Some effective strategies for making code run faster:</p>
<ul>
<li>Abstract your code into functions, so that you can compare different
versions</li>
<li>Use code profiling to identify the main computational bottlenecks
and improve them</li>
<li>Think carefully about algorithm design</li>
<li>Understand why some operations are intrinsically slower
than others, e.g. why a <code>for</code> loop is slower than using <code>lapply</code></li>
<li>Use multiple processors to increase computing power, either in your
own machine or by running your code on a cluster.</li>
</ul>