-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
1193 lines (866 loc) · 61.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = ">",
fig.path = "man/figures/README-"
)
```
```{r library, echo = FALSE}
library(packageRank)
```
## packageRank: compute and visualize package download counts and percentile ranks
['packageRank'](https://CRAN.R-project.org/package=packageRank) is an R package that helps put package download counts into context. It does so via two core functions, `cranDownloads()` and `packageRank()`, a set of filters that reduce download count inflation, and a host of other assorted functions.
You can read more about the package in the sections below:
* [I Download Counts](#i---download-counts) describes how `cranDownloads()` gives [`cranlogs::cran_downloads()`](https://r-hub.github.io/cranlogs/reference/cran_downloads.html) a more user-friendly interface and makes visualizing those data easy via its generic R `plot()` method.
* [II Download Percentile Ranks](#ii---download-rank-percentiles) describes how `packageRank()` makes use of percentile ranks. This nonparametric statistic computes the percentage of packages that with fewer downloads than yours: a package is in the 74th percentile has more downloads than 74% of packages. This facilitates comparison and helps you locate your package in the overall distribution of [CRAN](https://CRAN.R-project.org/) package downloads.
* [III Inflation Filters](#iii---inflation-filters) describes four filter functions that remove software and behavioral artifacts that inflate _nominal_ download counts. This functionality is available in `packageRank()` and `packageLog()`.
* [IV Availability of Results](#iv---availability-of-results) discusses when results become available, how to use `logInfo()` to check the availability of results, and the effect of time zones.
* [V Reverse lookup of counts, ranks and percentiles](#v---reverse-lookup-of-counts-ranks-and-percentiles) discusses `queryCount()`, `queryRank()`, `queryPercentile()` and `cranDistribution()`.
* [VI Data Fix A](#vi---data-fix-a) discusses two problems with download counts. The first stems from problems with the logs from the end of 2012 through the beginning of 2013. These are fixed in `fixDate_2012()` and `fixCranlogs()`.
* [VII Data Fix B](#vii---data-fix-b) discusses a problem with ['cranlogs'](https://CRAN.R-project.org/package=cranlogs) that doubles or triples the number of R application downloads between 2023-09-13 and 2023-10-02. This is fixed in `fixRCranlogs()`.
* [VIII Data Note](#viii---data-note) discusses the spike in the download of the Windows version of the R application on Sundays and Wednesdays between 06 November 2022 and 19 March 2023.
* [IX et cetera](#ix---et-cetera) discusses country code top-level domains (e.g., `countryPackage()` and `packageCountry()`), the use of memoization and the internet connection time out problem.
### getting started
To install ['packageRank'](https://cran.r-project.org/package=packageRank) from [CRAN](https://cran.r-project.org/):
```{r cran_install, eval = FALSE}
install.packages("packageRank")
```
To install the development version from GitHub:
```{r gh_install, eval = FALSE}
# You may need to first install 'remotes' via install.packages("remotes").
remotes::install_github("lindbrook/packageRank", build_vignettes = TRUE)
```
### I - download counts
`cranDownloads()` uses all the same arguments as `cranlogs::cran_downloads()`:
```{r cran_downloads, eval = FALSE}
cranlogs::cran_downloads(packages = "HistData")
```
```{r cran_downloads_code, echo = FALSE}
cranlogs::cran_downloads(packages = "HistData", from = "2020-05-01", to = "2020-05-01")
```
The only difference is that `cranDownloads()` adds four features:
#### i) "spell check" for package names
```{r spell_check_fail, eval = FALSE}
cranDownloads(packages = "GGplot2")
```
```
## Error in cranDownloads(packages = "GGplot2") :
## GGplot2: misspelled or not on CRAN.
```
<br/>
```{r spell_check_pass, eval = FALSE}
cranDownloads(packages = "ggplot2")
```
```{r spell_check_pass_code, echo = FALSE}
cranDownloads(packages = "ggplot2", from = "2020-05-01", to = "2020-05-01")
```
<br/>
Note that his also works for inactive or "retired" packages in the [Archive](https://CRAN.R-project.org/src/contrib/Archive/):
```{r check_archive_fail, eval = FALSE}
cranDownloads(packages = "vr")
```
```
## Error in cranDownloads(packages = "vr") :
## vr: misspelled or not on CRAN/Archive.
```
<br/>
```{r check_archive_pass, eval = FALSE}
cranDownloads(packages = "VR")
```
```{r check_archive_pass_code, echo = FALSE}
cranDownloads(packages = "VR", from = "2020-05-01", to = "2020-05-01")
```
<br/>
#### ii) two additional date formats
With `cranlogs::cran_downloads()`, you specify a time frame using the `from` and `to` arguments. The downside of this is that you _must_ use "yyyy-mm-dd". For convenience's sake, `cranDownloads()` also allows you to use "yyyy-mm" or yyyy ("yyyy" also works).
##### "yyyy-mm"
Let's say you want the download counts for ['HistData'](https://CRAN.R-project.org/package=HistData) for February 2020. With `cranlogs::cran_downloads()`, you'd have to type out the whole date and remember that 2020 was a leap year:
```{r yyyy-mm_1, eval = FALSE}
cranlogs::cran_downloads(packages = "HistData", from = "2020-02-01",
to = "2020-02-29")
```
<br/>
With `cranDownloads()`, you can just specify the year and month:
```{r yyyy-mm_2, eval = FALSE}
cranDownloads(packages = "HistData", from = "2020-02", to = "2020-02")
```
##### yyyy or "yyyy"
Let's say you want the download counts for ['rstan'](https://CRAN.R-project.org/package=rstan) for 2020. With `cranlogs::cran_downloads()`, you'd type something like:
```{r yyyy_1, eval = FALSE}
cranlogs::cran_downloads(packages = "rstan", from = "2022-01-01",
to = "2022-12-31")
```
<br/>
With `cranDownloads()`, you can use:
```{r yyyy_2, eval = FALSE}
cranDownloads(packages = "rstan", from = 2020, to = 2020)
```
or
```{r yyyy_3, eval = FALSE}
cranDownloads(packages = "rstan", from = "2020", to = "2020")
```
<br/>
#### iii) shortcuts with `from = ` and `to = ` in `cranDownloads()`
These additional date formats help to create convenient shortcuts. Let's say you want the year-to-date download counts for ['rstan'](https://CRAN.R-project.org/package=rstan). With `cranlogs::cran_downloads()`, you'd type something like:
```{r yyyy_4, eval = FALSE}
cranlogs::cran_downloads(packages = "rstan", from = "2023-01-01",
to = Sys.Date() - 1)
```
<br/>
With `cranDownloads()`, you can just pass the current year to `from =`:
```{r yyyy_5, eval = FALSE}
cranDownloads(packages = "rstan", from = 2023)
```
And if you wanted the entire download history, pass the current year to `to =`:
```{r yyyy_6, eval = FALSE}
cranDownloads(packages = "rstan", to = 2023)
```
Note that the Posit/RStudio logs begin on 01 October 2012.
#### iv) check date validity
```{r check_date, eval = FALSE}
cranDownloads(packages = "HistData", from = "2019-01-15",
to = "2019-01-35")
```
```
## Error in resolveDate(to, type = "to") : Not a valid date.
```
#### v) cumulative count for selected time frame
```{r cranDownloads_cumulative, eval = FALSE}
cranDownloads(packages = "HistData", when = "last-week")
```
```{r cranDownloads_cumulative_code, echo = FALSE}
cranDownloads(packages = "HistData", from = "2020-05-01", to = "2020-05-07")
```
#### pro.mode
The "spell check" or validation of packages described above, requires some additional background downloads. While those data are cached via the 'meomoise' package, this does add time the first time `cranDownloads()` is run. For faster results, which bypass those features, set `pro.mode = TRUE`. The downside is that you'll see zero downloads for packages on dates before they're published on CRAN, you'll see zero downloads for mis-spelled/non-existent packages and you can't just use the `to =` argument alone.
For example, 'packageRank' was first published on CRAN on 2019-05-16 - you can verify this via `packageHistory("packageRank")`. If you use `cranlogs::cran_downloads()` or `cranDownloads(pro.mode = TRUE)` before that date, you'll see zero downloads on dates before that time:
```{r pro_mode_ex_ante}
cranDownloads("packageRank", from = "2019-05-10", to = "2019-05-16", pro.mode = TRUE)
```
You'll notice this particularly when one of the packages you're including newer packages in cranDownloads().
If you mis-spell a package :
```{r pro_mode_non_existent}
cranDownloads("vr", from = "2019-05-10", to = "2019-05-16", pro.mode = TRUE)
```
If you just use `to =` without a value for `from =`, you'll get an error:
```{r pro_mode_to, eval = FALSE}
cranDownloads(to = 2024, pro.mode = TRUE)
```
```
Error: You must also provide a date for "from".
```
<br/>
### visualizing package download counts
`cranDownloads()` makes visualizing package downloads easy by using `plot()`:
```{r cranDownloads_viz1, fig.align = "center"}
plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"))
```
If you pass a vector of package names for a single day, `plot()` returns a dotchart:
```{r cranDownloads_viz2a, fig.align = "center"}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020-03-01", to = "2020-03-01"))
```
If you pass a vector of package names for multiple days, `plot()` uses ['ggplot2'](https://CRAN.R-project.org/package=ggplot2) facets:
```{r cranDownloads_viz2, fig.align = "center"}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"))
```
<br/>
To plot those data in a single frame, set `multi.plot = TRUE`:
```{r cranDownloads_viz3}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), multi.plot = TRUE)
```
<br/>
To plot those data in separate plots on the same scale, set `graphics = "base"` and you'll be prompted for each plot:
```{r cranDownloads_viz4, eval = FALSE}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), graphics = "base")
```
To do the above on separate, independent scales, set `same.xy = FALSE`:
```{r cranDownloads_viz5, eval = FALSE}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), graphics = "base", same.xy = FALSE)
```
#### logarithm of download counts
To use the base 10 logarithm of the download count in a plot, set `log.y = TRUE`:
```{r log_count}
plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"),
log.y = TRUE)
```
Note that for the sake of the plot, zero counts are replaced by ones so that the logarithm can be computed (This does not affect the data returned by `cranDownloads()`).
#### `packages = NULL`
`cranlogs::cran_download(packages = NULL)` computes the total number of package downloads from [CRAN](https://cran.r-project.org/). You can plot these data by using:
```{r null_packages}
plot(cranDownloads(from = 2019, to = 2019))
```
Note that I sometimes get a "Gateway Timeout (HTTP 504)" error when using this function for long time periods. This may be due to traffic but alternatively could be related to ['cranlogs' issue #56](https://github.com/r-hub/cranlogs/issues/56). As a workaround, `annualDownloads()` downloads the data for each year individually and then re-assembles them into a single data frame. This, of course, takes more time but seems to be more reliable.
```{r annualDownloads}
plot(annualDownloads(start.yr = 2013, end.yr = 2023))
```
Note that in the plot above, three historical outlier days are highlighted: "2014-11-17", "2018-10-21" and "2020-02-29". The first was due to a disproportionate download of six packages: 'BayHaz', 'clhs', 'GPseq', 'OPI', 'YaleToolkit' and 'survsim'. The second date was due to downloads of 'tidyverse' (~700x the second place package 'Rcpp``'). The third is possibly related to some kind of scripting error that overlooked the fact that it was a leap day. You can validate this using `packageLog()`.
#### `packages = "R"`
`cranlogs::cran_download(packages = "R")` computes the total number of downloads of the R application (note that you can only use "R" or a vector of packages names, not both!). You can plot these data by using:
```{r r_downloads}
plot(cranDownloads(packages = "R", from = 2019, to = 2019))
```
If you want the total count of R downloads, set `r.total = TRUE`:
```{r r_total, eval = FALSE}
plot(cranDownloads(packages = "R", from = 2019, to = 2019), r.total = TRUE)
```
Note that since Sunday 06 November 2022 and Wednesday, 18 January 2023, there've been spikes of downloads of the Windows version of R on Sundays and Wednesdays (details below in [R Windows Sunday and Wednesday downloads](#r-windows-sunday-and-wednesday-downloads)).
#### smoothers and confidence intervals
To add a smoother to your plot, use `smooth = TRUE`:
```{r lowess}
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
smooth = TRUE)
```
With graphs that use 'ggplot2', `se = TRUE` will add a confidence interval:
```{r ci}
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "2020-03-20"), smooth = TRUE, se = TRUE)
```
In general, loess is the chosen smoother. Note that with base graphics, lowess is used when there are 7 or fewer observations. Thus, to control the degree of smoothness, you'll typically use the `span` argument (the default is span = 0.75). With base graphics with 7 or fewer observations, you control the degree of smoothness using the `f` argument (the default is f = 2/3):
```{r span, eval = FALSE}
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "2020-03-20"), smooth = TRUE, span = 0.75)
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "2020-03-20"), smooth = TRUE, graphics = "ggplot2",
span = 0.33)
```
#### package and R release dates
To annotate a graph with a package's release dates (base graphics only):
```{r pkg_release_date}
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
package.version = TRUE)
```
To annotate a graph with R release dates:
```{r r_release_date}
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
r.version = TRUE)
```
#### plot growth curves (cumulative download counts)
To plot growth curves, set `statistic = "cumulative"`:
```{r cranDownloads_growth_curves}
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), statistic = "cumulative",
multi.plot = TRUE, points = FALSE)
```
#### population plot
To visualize a package's downloads relative to "all" other packages over time:
```{r pop_plot, eval = FALSE}
plot(cranDownloads(packages = "HistData", from = "2020", to = "2020-03-20"),
population.plot = TRUE)
```
```{r pop_plot_code, echo = FALSE}
plot(cranDownloads(packages = "HistData", from = "2020", to = "2020-03-20"),
population.plot = TRUE, population.seed = 1)
```
This longitudinal view plots the date (x-axis) against the base 10 logarithm of the selected package's download counts (y-axis). To get a sense of how the selected package's performance stacks up against "all" other packages, a set of smoothed curves representing a stratified random sample of packages is plotted in gray in the background (this is the "typical" pattern of downloads on [CRAN](https://cran.r-project.org/) for the selected time period).^[Specifically, within each 5% interval of percentile ranks (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked.]
#### unit of observation
The default unit of observation for both `cranDownloads()` and `cranlogs::cran_dowanlods()` is the day. The graph below plots the daily downloads for [‘cranlogs'](https://CRAN.R-project.org/package=cranlogs) from 01 January 2022 through 15 April 2022.
```{r day}
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-04-15"))
```
To view the data from a less granular perspective, change plot.cranDownloads()'s `unit.observation` argument from "day" to "week", "month", or "year".
##### `unit.observation = "month"`
The graph below plots the data aggregated by month (with an added smoother):
```{r month}
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-04-15"),
unit.observation = "month", smooth = TRUE, graphics = "ggplot2")
```
Three things to note. First, if the last/current month (far right) is still in-progress (it's not yet the end of the month), that observation will be split in two: one point for the in-progress total (empty black square), another for the estimated total (empty red circle). The estimate is based on the proportion of the month completed. In the example above, the 635 observed downloads from April 1 through April 15 translates into an estimate of 1,270 downloads for the entire month (30 / 15 * 635). Second, if a smoother is included, it will only use "complete" observations, not in-progress or estimated data. Third, all points are plotted along the x-axis on the first day of the month.
##### `unit.observation = "week"`
The graph below plots the data aggregated by week (weeks begin on Sunday).
```{r week}
plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-06-15"),
unit.observation = "week", smooth = TRUE)
```
Four things to note. First, if the first week (far left) is incomplete (the 'from' date is not a Sunday), that observation will be split in two: one point for the observed total on the start date (gray empty square) and another point for the *backdated* total. Backdating involves completing the week by pushing the nominal start date back to include the previous Sunday (blue asterisk). In the example above, the nominal start date (01 January 2022) is moved back to include data through the previous Sunday (26 December 2021). This is useful because with a weekly unit of observation the first observation is likely to be truncated and would not give the most representative picture of the data. Second, if the last week (far right) is in-progress (the 'to' date is not a Saturday), that observation will be split in two: the observed total (gray empty square) and the estimated total based on the proportion of week completed (red empty circle). Third, just like the monthly plot, smoothers only use complete observations, including backdated data but excluding in-progress and estimated data. Fourth, with the exception of first week's observed count, which is plotted at its nominal date, points are plotted along the x-axis on Sundays, the first day of the week.
##### my default plots
For what it's worth, below are my go-to commands for graphs. They take advantage of RStudio IDE's plot history panel, which allows you to cycle through and compare graphs.
Typically, I'll look at the data for the last year or so at the three available units of observation: day, week and month. I use base graphics, via `graphics = "base"`, to take advantage of prompts and "nicer" axes annotation. This also allows me to easily add graphical elements afterwards as needed, e.g., `abline(h = 100, lty = "dotted")`.
```{r default, eval = FALSE}
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = TRUE,
unit.observation = "day")
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = TRUE,
unit.observation = "week")
# Note that I disable smoothing for monthly data
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
graphics = "base", package.version = TRUE, smooth = FALSE,
unit.observation = "month")
```
#### pro.mode
Perhaps the biggest downside of using cranDownloads(pro.mode = TRUE) is that you might draw mistaken inferences from plotting the data since it adds false zeroes to your data.
Using the example of 'packageRank', which was published on 2019-05-16:
```{r pro_mode_plot}
plot(cranDownloads("packageRank", from = "2019-05", to = "2019-05",
pro.mode = TRUE), smooth = TRUE)
```
```{r non_pro_mode_plot}
plot(cranDownloads("packageRank", from = "2019-05", to = "2019-05",
pro.mode = FALSE), smooth = TRUE)
```
### II - download percentile ranks
After spending some time with nominal download counts, the "compared to what?" question will come to mind. For instance, consider the data for the 'cholera' package from the first week of March 2020:
```{r motivation, eval = FALSE}
plot(cranDownloads(packages = "cholera", from = "2020-03-01",
to = "2020-03-07"))
```
```{r motivation_code, echo = FALSE, fig.align = "center"}
par(mar = c(5, 4, 4, 4))
plot(cranDownloads(packages = "cholera", from = "2020-03-01",
to = "2020-03-07"))
par(mar = c(5, 4, 4, 2))
```
Do Wednesday and Saturday reflect surges of interest in the package or surges of traffic to [CRAN](https://CRAN.R-project.org/)? To put it differently, how can we know if a given download count is typical or unusual?
To answer these questions, we can start by looking at the total number of package downloads:
```{r motivation_cran}
plot(cranDownloads(from = "2020-03-01", to = "2020-03-07"))
```
Here we see that there's a big difference between the work week and the weekend. This seems to indicate that the download activity for ['cholera'](https://CRAN.R-project.org/package=cholera) on the weekend seems high. Moreover, the Wednesday peak for ['cholera'](https://CRAN.R-project.org/package=cholera) downloads seems higher than the mid-week peak of total downloads.
One way to better address these observations is to locate your package's download counts in the overall frequency distribution of download counts. 'cholera' allows you to do so via `packageDistribution()`. Below are the distributions of logarithm of download counts for Wednesday and Saturday. Each vertical segment (along the x-axis) represents a download count. The height of a segment represents that download count's frequency. The location of ['cholera'](https://CRAN.R-project.org/package=cholera) in the distribution is highlighted in red.
```{r packageDistribution, echo = FALSE}
plot_package_distribution <- function(dat, xlim, ylim) {
freq.dist <- dat$freq.dist
freqtab <- dat$freqtab
plot(freq.dist$count, freq.dist$frequency, type = "h", log = "x",
xlab = "Count", ylab = "Frequency", xlim = xlim, ylim = ylim)
axis(3, at = freqtab[1], cex.axis = 0.8, padj = 0.9, col.axis = "dodgerblue",
col.ticks = "dodgerblue", labels = paste(names(freqtab[1]), "=",
format(freqtab[1], big.mark = ",")))
abline(v = freqtab[1], col = "dodgerblue", lty = "dotted")
if (!is.null(dat$package)) {
pkg.ct <- freqtab[names(freqtab) == dat$package]
pkg.bin <- freqtab[freqtab == pkg.ct]
axis(3, at = pkg.ct, labels = format(pkg.ct, big.mark = ","),
cex.axis = 0.8, padj = 0.9, col.axis = "red", col.ticks = "red")
abline(v = pkg.ct, col = grDevices::adjustcolor("red", alpha.f = 0.5))
day <- weekdays(as.Date(dat$date), abbreviate = TRUE)
title(paste0(dat$package, " @ ", dat$date, " (", day, ")"))
} else title(paste("Distribution of Package Download Counts:", dat$date))
}
distn.data <- lapply(c("2020-03-04", "2020-03-07"), function(date) packageDistribution(package = "cholera", date = date))
xlim <- range(lapply(distn.data, function(x) x$freq.dist$count))
ylim <- range(lapply(distn.data, function(x) x$freq.dist$frequency))
```
```{r packageDistribution_wed, eval = FALSE}
plot(packageDistribution(package = "cholera", date = "2020-03-04"))
```
```{r packageDistribution_wed_code, echo = FALSE, fig.align = "center"}
plot_package_distribution(distn.data[[1]], xlim, ylim)
```
```{r packageDistribution_sat, eval = FALSE}
plot(packageDistribution(package = "cholera", date = "2020-03-07"))
```
```{r packageDistribution_sat_code, echo = FALSE, fig.align = "center"}
plot_package_distribution(distn.data[[2]], xlim, ylim)
```
While these plots give us a better picture of where ['cholera'](https://CRAN.R-project.org/package=cholera) is located, comparisons between Wednesday and Saturday are still impressionistic: all we can confidently say is that the download counts for both days were greater than the mode.
To facilitate interpretation and comparison, I use the _percentile rank_ of a download count instead of the simple nominal download count. This nonparametric statistic tells you the percentage of packages that had fewer downloads. In other words, it gives you the location of your package relative to the locations of all other packages. More importantly, by rescaling download counts to lie on the bounded interval between 0 and 100, percentile ranks make it easier to compare packages within and across distributions.
For example, we can compare Wednesday ("2020-03-04") to Saturday ("2020-03-07"):
```{r packageRank1}
packageRank(package = "cholera", date = "2020-03-04")
```
On Wednesday, we can see that ['cholera'](https://CRAN.R-project.org/package=cholera) had 38 downloads, came in 5,788th place out of the 18,038 different packages downloaded, and earned a spot in the 68th percentile.
```{r packageRank2}
packageRank(package = "cholera", date = "2020-03-07")
```
On Saturday, we can see that ['cholera'](https://CRAN.R-project.org/package=cholera) had 29 downloads, came in 3,189st place out of the 15,950 different packages downloaded, and earned a spot in the 80th percentile.
So contrary to what the nominal counts tell us, one could say that the interest in ['cholera'](https://CRAN.R-project.org/package=cholera) was actually greater on Saturday than on Wednesday.
#### computing percentile rank
To compute percentile ranks, I do the following. For each package, I tabulate the number of downloads and then compute the percentage of packages with fewer downloads. Here are the details using ['cholera'](https://CRAN.R-project.org/package=cholera) from Wednesday as an example:
```{r percentile}
pkg.rank <- packageRank(packages = "cholera", date = "2020-03-04")
downloads <- pkg.rank$cran.data$count
names(downloads) <- pkg.rank$cran.data$package
round(100 * mean(downloads < downloads["cholera"]), 1)
```
To put it differently:
```{r percentile2}
(pkgs.with.fewer.downloads <- sum(downloads < downloads["cholera"]))
(tot.pkgs <- length(downloads))
round(100 * pkgs.with.fewer.downloads / tot.pkgs, 1)
```
#### competition v. nominal ranks
In the example above, 38 downloads puts 'cholera' in 5,788th place if we allow for ties using [competition](https://en.wikipedia.org/wiki/Ranking#Standard_competition_ranking_(%221224%22_ranking)) (i.e., "1224" ranking) and 5,556th place if we don't by using [nominal/ordinal](https://en.wikipedia.org/wiki/Ranking#Ordinal_ranking_(%221234%22_ranking)) (i.e., "1234" ranking).
Prior to v0.9.2.9008, only nominal/ordinal ranking was available. Competition ranking is now the default via `packageRank(rank.ties = TRUE)`. If you want ordinal ranking, use `packageRank(rank.ties = FALSE)`.
### visualizing package download percentile ranks
To visualize `packageRank()`, use `plot()`.
```{r packageRank_plot_wed, eval = FALSE}
plot(packageRank(packages = "cholera", date = "2020-03-04"))
```
```{r packageRank_data, echo = FALSE}
dat <- lapply(c("2020-03-04", "2020-03-07"), function(x) {
packageRank("cholera", date = x)
})
freqtab1 <- dat[[1]]$cran.data$count
freqtab2 <- dat[[2]]$cran.data$count
xlim <- range(seq_along(freqtab1), seq_along(freqtab2))
ylim <- range(c(freqtab1), c(freqtab2))
```
```{r packageRank_plot_code_wed, echo = FALSE, fig.align = "center"}
freqtab <- dat[[1]]$cran.data$count
names(freqtab) <- dat[[1]]$cran.data$package
package.data <- dat[[1]]$package.data
pkg <- dat[[1]]$packages
date <- dat[[1]]$date
y.max <- freqtab[1]
q <- stats::quantile(freqtab)[2:4]
iqr <- vapply(c("75%", "50%", "25%"), function(id) {
dat <- which(freqtab > q[[id]])
dat[length(dat)]
}, numeric(1L))
plot(seq_along(freqtab), c(freqtab), type = "l", xlab = "Rank",
ylab = "log10(Count)", log = "y", xlim = xlim, ylim = ylim)
abline(v = iqr, col = "black", lty = "dotted")
iqr.labels <- c("75th", "50th", "25th")
invisible(lapply(seq_along(iqr), function(i) {
text(iqr[[i]], y.max / 2, labels = iqr.labels[i], cex = 0.75)
}))
abline(v = which(names(freqtab) == pkg), col = "red")
abline(h = freqtab[pkg], col = "red")
pct <- package.data[package.data$package == pkg, "percentile"]
pct.label <- paste0(round(pct, 2), "%")
axis(3, at = which(names(freqtab) == pkg), padj = 0.9, col.axis = "red",
col.ticks = "red", labels = pct.label, cex.axis = 0.8)
axis(4, at = freqtab[pkg], col.axis = "red", col.ticks = "red",
cex.axis = 0.8, labels = format(freqtab[pkg], big.mark = ","))
points(which(names(freqtab) == pkg), freqtab[pkg], col = "red")
points(which(names(freqtab) == names(freqtab[1])), y.max,
col = "dodgerblue")
text(which(names(freqtab) == names(freqtab[1])), y.max, pos = 4,
labels = paste(names(freqtab[1]), "=", format(freqtab[1],
big.mark = ",")), cex = 0.8, col = "dodgerblue")
text(max(xlim), max(ylim),
labels = paste("Tot = ", format(sum(freqtab), big.mark = ",")), cex = 0.8,
col = "dodgerblue", pos = 2)
day <- weekdays(as.Date(date), abbreviate = TRUE)
title(main = paste0(pkg, " @ ", date, " (", day, ")"))
```
<br/>
```{r packageRank_plot_sat, eval = FALSE}
plot(packageRank(packages = "cholera", date = "2020-03-07"))
```
```{r packageRank_plot_code_sat, echo = FALSE, fig.align = "center"}
freqtab <- dat[[2]]$cran.data$count
names(freqtab) <- dat[[2]]$cran.data$package
package.data <- dat[[2]]$package.data
pkg <- dat[[2]]$packages
date <- dat[[2]]$date
y.max <- freqtab[1]
q <- stats::quantile(freqtab)[2:4]
iqr <- vapply(c("75%", "50%", "25%"), function(id) {
dat <- which(freqtab > q[[id]])
dat[length(dat)]
}, numeric(1L))
plot(seq_along(freqtab), c(freqtab), type = "l", xlab = "Rank",
ylab = "log10(Count)", log = "y", xlim = xlim, ylim = ylim)
abline(v = iqr, col = "black", lty = "dotted")
iqr.labels <- c("75th", "50th", "25th")
invisible(lapply(seq_along(iqr), function(i) {
text(iqr[[i]], y.max / 2, labels = iqr.labels[i], cex = 0.75)
}))
abline(v = which(names(freqtab) == pkg), col = "red")
abline(h = freqtab[pkg], col = "red")
pct <- package.data[package.data$package == pkg, "percentile"]
pct.label <- paste0(round(pct, 2), "%")
axis(3, at = which(names(freqtab) == pkg), padj = 0.9, col.axis = "red",
col.ticks = "red", labels = pct.label, cex.axis = 0.8)
axis(4, at = freqtab[pkg], col.axis = "red", col.ticks = "red",
cex.axis = 0.8, labels = format(freqtab[pkg], big.mark = ","))
points(which(names(freqtab) == pkg), freqtab[pkg], col = "red")
points(which(names(freqtab) == names(freqtab[1])), y.max,
col = "dodgerblue")
text(which(names(freqtab) == names(freqtab[1])), y.max, pos = 4,
labels = paste(names(freqtab[1]), "=", format(freqtab[1],
big.mark = ",")), cex = 0.8, col = "dodgerblue")
text(max(xlim), max(ylim),
labels = paste("Tot = ", format(sum(freqtab), big.mark = ",")), cex = 0.8,
col = "dodgerblue", pos = 2)
day <- weekdays(as.Date(date), abbreviate = TRUE)
title(main = paste0(pkg, " @ ", date, " (", day, ")"))
```
These graphs above, which are customized here to be on the same scale, plot the _rank order_ of packages' download counts (x-axis) against the logarithm of those counts (y-axis). It then highlights (in red) a package's position in the distribution along with its percentile rank and download count. In the background, the 75th, 50th and 25th percentiles are plotted as dotted vertical lines. The package with the most downloads, ['magrittr'](https://CRAN.R-project.org/package=magrittr) in both cases, is at top left (in blue). The total number of downloads is at the top right (in blue).
### III - inflation filters
['cranlogs'](https://CRAN.R-project.org/package=cranlogs) computes the number of package downloads by simply counting log entries. While straightforward, this approach can run into problems. Putting aside the question of whether package dependencies should be counted, what I have in mind here is what I believe to be two types of "invalid" log entries. The first, a software artifact, stems from entries that are smaller, often orders of magnitude smaller, than a package's actual binary or source file. The second, a behavioral artifact, emerges from efforts to download all of [CRAN](https://cran.r-project.org/). In both cases, a reliance on nominal counts will give you an inflated sense of the degree of interest in your package. For those interested, an early but detailed analysis and discussion of both types of inflation is included as part of this [R-hub blog post](https://blog.r-hub.io/2020/05/11/packagerank-intro/#inflationary-bias-of-download-counts).
#### software artifacts
When looking at package download logs, the first thing you'll notice are wrongly sized log entries. They come in two sizes. The "small" entries are approximately 500 bytes in size. The "medium" entries vary in size, falling somewhere between a "small" entry and a full download (i.e., "small" <= "medium" <= full download). "Small" entries manifest themselves as standalone entries, paired with a full download, or as part of a triplet along side a "medium" and a full download. "Medium" entries manifest themselves as either standalone entries or as part of a triplet.
The example below illustrates a triplet:
```{r triplet}
packageLog(date = "2020-07-01")[4:6, -(4:6)]
```
The "medium" entry is the first observation (99,622 bytes). The full download is the second entry (4,161,948 bytes). The "small" entry is the last observation (536 bytes). At a minimum, what makes a triplet a triplet (or a pair a pair) is that all members share system configuration (e.g. IP address, etc.) and have identical or adjacent time stamps.
To deal with the inflationary effect of "small" entries, I filter out observations smaller than 1,000 bytes (the smallest package on [CRAN](https://cran.r-project.org/) appears to be ['LifeInsuranceContracts'](https://cran.r-project.org/package=LifeInsuranceContracts), whose source file weighs in at 1,100 bytes). "Medium" entries are harder to handle. I remove them using a filter functions that looks up a package's actual size.
#### behavioral artifacts
While wrongly sized entries are fairly easy to spot, seeing the effect of efforts to download all of [CRAN](https://cran.r-project.org/) require a change of perspective. While details and further evidence can be found in the [R-hub blog post](https://blog.r-hub.io/2020/05/11/packagerank-intro/#inflationary-bias-of-download-counts) mentioned above, I'll illustrate the problem with the following example:
```{r, sequence_ex, eval = FALSE}
packageLog(packages = "cholera", date = "2020-07-31")[8:14, -(4:6)]
```
```
> date time size package version country ip_id
> 132509 2020-07-31 21:03:06 3797776 cholera 0.2.1 US 14
> 132106 2020-07-31 21:03:07 4285678 cholera 0.4.0 US 14
> 132347 2020-07-31 21:03:07 4109051 cholera 0.3.0 US 14
> 133198 2020-07-31 21:03:08 3766514 cholera 0.5.0 US 14
> 132630 2020-07-31 21:03:09 3764848 cholera 0.5.1 US 14
> 133078 2020-07-31 21:03:11 4275831 cholera 0.6.0 US 14
> 132644 2020-07-31 21:03:12 4284609 cholera 0.6.5 US 14
```
Here, we see that seven different versions of the package were downloaded as a sequential bloc. A little digging shows that these seven versions represent _all_ versions of 'cholera' available on that date:
```{r, cholera_history, eval = FALSE}
packageHistory(package = "cholera")
```
```
> Package Version Date Repository
> 1 cholera 0.2.1 2017-08-10 Archive
> 2 cholera 0.3.0 2018-01-26 Archive
> 3 cholera 0.4.0 2018-04-01 Archive
> 4 cholera 0.5.0 2018-07-16 Archive
> 5 cholera 0.5.1 2018-08-15 Archive
> 6 cholera 0.6.0 2019-03-08 Archive
> 7 cholera 0.6.5 2019-06-11 Archive
> 8 cholera 0.7.0 2019-08-28 CRAN
```
While there are "legitimate" reasons for downloading past versions (e.g., research, container-based software distribution, etc.), I'd argue that examples like the above are "fingerprints" of efforts to download [CRAN](https://cran.r-project.org/). While this is not necessarily problematic, it does mean that when your package is downloaded as part of such efforts, that download is more a reflection of an interest in [CRAN](https://cran.r-project.org/) itself (a collection of packages) than of an interest in your package _per se_. And since one of the uses of counting package downloads is to assess interest in _your_ package, it may be useful to exclude such entries.
To do so, I try to filter out these entries in two ways. The first identifies IP addresses that download "too many" packages and then filters out _campaigns_, large blocs of downloads that occur in (nearly) alphabetical order. The second looks for campaigns not associated with "greedy" IP addresses and filters out sequences of past versions downloaded in a narrowly defined time window.
#### example usage
To get an idea of how inflated your package's download count may be, use `filteredDownloads()`. Below are the results for 'ggplot2' for 15 September 2021.
```{r, filteredDownloads}
filteredDownloads(package = "ggplot2", date = "2021-09-15")
```
While there were 113,842 nominal downloads, applying all the filters reduced that number to 111,662, an inflation of 1.95%.
Excluding the time it takes to download the log file (typically the bulk of the computation time), the above example take approximate 15 additional seconds to run on a single core on a 3.1 GHz Dual-Core Intel Core i5 processor.
There are 4 filters. You can control them using the following arguments (listed in order of application):
* `ip.filter`: removes campaigns of "greedy" IP addresses.
* `small.filter`: removes entries smaller than 1,000 bytes.
* `sequence.filter`: removes blocs of past versions.
* `size.filter`: removes entries smaller than a package's binary or source file.
For `filteredDownloads()`, they are all on by default. For `packageLog()` and `packageRank()`, they are off by default. To apply them, simply set the argument for the filter you want to TRUE:
```{r, small_filter, eval = FALSE}
packageRank(package = "cholera", small.filter = TRUE)
```
Alternatively, for `packageLog()` and `packageRank()` you can simply set `all.filters = TRUE`.
```{r, all_filters, eval = FALSE}
packageRank(package = "cholera", all.filters = TRUE)
```
Note that the `all.filters = TRUE` is contextual. Depending on the function used, you'll either get the CRAN-specific or the package-specific set of filters. The former sets `ip.filter = TRUE` and `size.filter = TRUE`; it works independently of packages at the level of the entire log. The latter sets sequence.filter = TRUE` and `size.filter TRUE`; it relies on package specific information (e.g., size of source or binary file).
Ideally, we'd like to use both sets. However, the package-specific set is computationally expensive because they need to be applied individually to all packages in the log, which can involve tens of thousands of packages. While not unfeasible, currently this takes a long time. For this reason, when `all.filters = TRUE`, `packageRank()`, `ipPackage()`, `countryPackage()`, `countryDistribution()` and `packageDistribution()` use only CRAN specific filters while `packageLog()`, `packageCountry()`, and `filteredDownloads()` use both [CRAN](https://cran.r-project.org/) and package specific filters.
### IV - availability of results
To understand when results become available, you need to be aware that [‘packageRank'](https://CRAN.R-project.org/package=packageRank) has two upstream, online dependencies. The first is Posit/RStudio's [CRAN package download logs](http://cran-logs.rstudio.com/), which record traffic to the “0-Cloud” mirror at cloud.r-project.org (formerly Posit/RStudio's CRAN mirror). The second is Gábor Csárdi's [‘cranlogs'](https://CRAN.R-project.org/package=cranlogs) R package, which uses those logs to compute the download counts of both the R application and R packages.
The [CRAN package download logs](http://cran-logs.rstudio.com/) for the _previous_ day are typically posted by 17:00 UTC. The results for [‘cranlogs'](https://CRAN.R-project.org/package=cranlogs) usually become available soon thereafter (sometimes as much as a day later).
#### why aren't today's logs and results available?
Occasionally problems with "today's" data can emerge due to the upstream dependencies (illustrated below).
```
CRAN Download Logs --> 'cranlogs' --> 'packageRank'
```
If there's a problem with the [logs](http://cran-logs.rstudio.com/) (e.g., they're not posted on time), both [‘cranlogs'](https://CRAN.R-project.org/package=cranlogs) and [‘packageRank'](https://CRAN.R-project.org/package=packageRank) will be affected. If this happens, you'll see things like an unexpected zero count(s) for your package(s) (actually, you'll see a zero download count for both your package and for all of [CRAN](https://cran.r-project.org/)), data from "yesterday", or a "Log is not (yet) on the server" error message.
```
'cranlogs' --> packageRank::cranDownloads()
```
If there's a problem with [‘cranlogs'](https://CRAN.R-project.org/package=cranlogs) but not with the [logs](http://cran-logs.rstudio.com/), only `packageRank::cranDownalods()` will be affected. In that case, you might get a warning that only "previous" results will be used. All other [‘packageRank'](https://CRAN.R-project.org/package=packageRank) functions should work since they either directly access the logs or use some other source. Usually, these errors resolve themselves the next time the underlying scripts are run ("tomorrow", if not sooner).
#### `logInfo()`
To check the status of the download logs and 'cranlogs', use `logInfo()`. This function checks whether 1) "today's" log is posted on Posit/RStudio's server and 2) "today's" results have been computed by 'cranlogs'.
```{r logInfo0, eval = FALSE}
logInfo()
```
```
$`Today's log/result`
[1] "2023-02-01"
$`Today's log posted?`
[1] "Yes"
$`Today's results on 'cranlogs'?`
[1] "No"
$status
[1] "Today's log is typically posted by 01 Feb 09:00 PST | 01 Feb 17:00 UTC."
```
#### time zones
Because you're typically interested in _today's_ log file, another thing that affects availability is your time zone. For example, let's say that it's 09:01 on 01 January 2021 and you want to compute the percentile rank for ['ergm'](https://CRAN.R-project.org/package=ergm) for the last day of 2020. You might be tempted to use the following:
```{r timezone, eval = FALSE}
packageRank(packages = "ergm")
```
However, depending on _where_ you make this request, you may not get the data you expect. In Honolulu, USA, you will. In Sydney, Australia you won't. The reason is that you've somehow forgotten a key piece of trivia: Posit/RStudio typically posts _yesterday's_ log around 17:00 UTC the following day.
The expression works in Honolulu because 09:01 HST on 01 January 2021 is 19:01 UTC 01 January 2021. So the log you want has been available for 2 hours. The expression fails in Sydney because 09:01 AEDT on 01 January 2021 is 31 December 2020 22:00 UTC. The log you want won't actually be available for another 19 hours.
To make life a little easier, ['packageRank'](https://CRAN.R-project.org/package=packageRank) does two things. First, when the log for the date you want is not available (due to time zone rather than server issues), you'll just get the last available log. If you specified a date in the future, you'll either get an error message or a warning with an estimate of when the log you want should be available.
Using the Sydney example and the expression above, you'd get the results for 30 December 2020:
```{r sydney, eval = FALSE}
packageRank(packages = "ergm")
```
```{r sydney_code, echo = FALSE}
packageRank(packages = "ergm", date = "2020-12-30")
```
If you had specified the date, you'd get an additional warning:
```{r sydneyB, eval = FALSE}
packageRank(packages = "ergm", date = "2021-01-01")
```
```{r sydney_codeB, echo = FALSE}
packageRank(packages = "ergm", date = "2020-12-30")
```
```
Warning message:
2020-12-31 log arrives in ~19 hours at 02 Jan 04:00 AEDT. Using previous!
```
Keep in mind that 17:00 UTC is not a hard deadline. Barring server issues, the logs are usually posted a little _before_ that time. I don't know when the script starts but the posting time seems to be a function of the number of entries: closer to 17:00 UTC when there are more entries (e.g., weekdays); earlier than 17:00 UTC when there are fewer entries (e.g., weekends). Again, barring server issues, the 'cranlogs' results are usually available _before_ 18:00 UTC.
Here's what you'd see using the Honolulu example:
```{r logInfo, eval = FALSE}
logInfo(details = TRUE)
```
```
$`Today's log/result`
[1] "2020-12-31"
$`Today's log posted?`
[1] "Yes"
$`Today's results on 'cranlogs'?`
[1] "Yes"
$`Available log/result`
[1] "Posit/RStudio (2020-12-31); 'cranlogs' (2020-12-31)."
$status
[1] "Everything OK."
```
The function uses your local time zone, which depends on R's ability to compute your local time and time zone (e.g., `Sys.time()` and `Sys.timezone()`). My understanding is that there may be operating system or platform specific issues that could undermine this.
### V - Reverse lookup of counts, ranks and percentiles
To query the log for a specific count, rank or percentile rank, use the functions below:
#### queryCount()
To find the packages that had 100 downloads (the default is 1, the lowest number of observable downloads):
```{R, count, eval = FALSE}
queryCount(100)
```
```{R, count_code, echo = FALSE}
queryCount(100, date = "2024-08-01")
```
#### queryRank()
To find the package that was ranked 20th in downloads (the default is 1st, the most downloaded package):
```{R, rank, eval = FALSE}
queryRank(20)
```
```{R, rank_code, echo = FALSE}
queryRank(20, date = "2024-08-01")
```
#### queryPercentile()
If you want the packages with a particular percentile rank, use `queryPercentile()`. Note that due to the discrete nature of counts, your choice of percentile may not be available because they may fall in the vertical gaps in the observed data:
```{R, discrete_counts, echo = FALSE}
tmp <- cranDistribution(date = "2024-08-01")
plot(tmp$data$percentile, pch = 46, xlab = "Nominal Rank", ylab = "Percentile Rank", main = "Observed Percentile Ranks")
```
For this reason, `queryPercentile()` rounds you selection to whole numbers. Also, the default value, which is set to 50, uses `median()`to guarantee a result.
```{R, rank_percentile, eval = FALSE}
# head() is used because there will be many observations with median count.
head(queryPercentile())
```
```{R, rank_percentile_code, echo = FALSE}
# head() is used because there will be many observations with median count.
head(queryPercentile(date = "2024-08-01"))
```
You can also set a range of percentile ranks using the 'lo' and/or 'hi' arguments. If you get an error message, you may need to widen your interval:
```{R, lo_hi, eval = FALSE}
head(queryPercentile(lo = 95, hi = 96), 3)
tail(queryPercentile(lo = 95, hi = 96), 3)
```
```{R, lo_hi_code, echo = FALSE}
head(queryPercentile(lo = 95, hi = 96, date = "2024-08-01"), 3)
tail(queryPercentile(lo = 95, hi = 96, date = "2024-08-01"), 3)
```
#### cranDistribution()
The above functions leverage `cranDistribution()`, which computes the ranks and the distribution of download counts for a given day's log.
Its print method provides the date, the number of unique packages downloaded, the total number of downloads (the total number of rows/observations in the log) and the count and rank data for the top 20 packages:
```{R, print_distribution, eval = FALSE}
cranDistribution()
```
```{R, print_distribution_code, echo = FALSE}
cranDistribution(date = "2024-08-01")
```
Note that if you want to specify the number of top N packages, you'll have to explicitly use the print() and the 'top.n' argument:
```{R, print_to.n_distribution_code, eval = FALSE}
print(cranDistribution(), top.n = 7)
```
Alternatively, you can use `queryRank()`:
```{R, print_n_queryRank, eval = FALSE}
queryRank(1:7)
```
The summary method provides the number of unique packages downloaded, the total number of downloads and the five number summary (plus the arithmetic mean):
```{R, summary_distribution, eval = FALSE}
summary(cranDistribution())
```
```{R, summary_distribution_code, echo = FALSE}
summary(cranDistribution(date = "2024-08-01"))
```
The plot method graphs the distribution of base 10 logarithm of download counts. Each plot is annotated with the median, mean and maximum download counts, as well as the total number of downloads and the total number of unique packages observed.
```{R, plot_distribution, eval = FALSE}
plot(cranDistribution())
```
```{R, plot_distribution_code, echo = FALSE}
plot(cranDistribution(date = "2024-08-01"))
```
### VI - data fix A
['packageRank'](https://CRAN.R-project.org/package=packageRank) fixes two data problems. The first addresses a problem that affects logs when the data were first collected (late 2012 through the beginning of 2013). To understand the problem, we need to be know that the Posit/RStudio download logs, which begin on 01 October 2012, are stored as separate files with a name/URL that embeds the date:
```
http://cran-logs.rstudio.com/2022/2022-01-01.csv.gz
```
For the logs in question, this convention was broken in three ways: 1) some logs are effectively duplicated (same log, multiple names), 2) at least one is mislabeled and 3) the logs from 13 October through 28 December are offset by +3 days (e.g., the file with the name/URL "2012-12-01" contains the log for "2012-11-28"). As a result, we get erroneous download counts and we actually lose the last three logs of 2012. Details are available [here](https://github.com/lindbrook/packageRank/blob/master/docs/logs.md).
Unsurprisingly, all this leads to erroneous download counts. What is surprising is that these errors are compounded by how ['cranlogs'](https://CRAN.R-project.org/package=cranlogs) computes package downloads.
#### `fixDate_2012()`
['packageRank'](https://CRAN.R-project.org/package=packageRank) functions like `packageRank()` and `packageLog()` are affected by the second and third defects (mislabeled and offset logs) because they access logs via their filename/URL. [`fixDate_2012()`](https://github.com/lindbrook/packageRank/blob/master/R/fixDate_2012.R) addresses the problem by re-mapping problematic logs so that you get the log you expect.
#### `fixCranlogs()`
While unaffected by the second and third defects, functions that rely on `cranlogs::cran_download()` (e.g., [`packageRank::cranDownloads()`](https://github.com/lindbrook/packageRank/blob/master/R/cranDownloads.R)`, ['adjustedcranlogs'](https://CRAN.R-project.org/package=adjustedcranlogs) and ['dlstats'](https://CRAN.R-project.org/package=dlstats)) are susceptible to the first defect (duplicate names). My understanding is that this is because ['cranlogs'](https://CRAN.R-project.org/package=cranlogs) uses the date in a log rather than the filename/URL to retrieve logs.
To put it differently, ['cranlogs'](https://CRAN.R-project.org/package=cranlogs) can't detect multiple instances of logs with the same date. I found 3 logs with duplicate filename/URLs, and 5 additional instances of overcounting (including one of tripling).
[`fixCranlogs()`](https://github.com/lindbrook/packageRank/blob/master/R/fixCranlogs.R) addresses this overcounting problem by recomputing the download counts using the actual log(s) when any of the eight problematic dates are requested. Details about the 8 days and `fixCranlogs()` can be found [here](https://github.com/lindbrook/packageRank/blob/master/docs/logs.md).
### VII - data fix B
The second data problem is of more recent vintage. From 2023-09-13 through 2023-10-02, the download counts for the R application returned by `cranlogs::cran_downloads(packages = "R")`, is, with two exceptions, twice what one would expect when looking at the actual log(s). The two exceptions are: 1) 2023-09-28 where the counts are identical but for a "rounding error" possibly due to an NA and 2) 2023-09-30 where there is actually a three-fold difference.
Here are the relevant ratios of counts comparing ['cranlogs'](https://CRAN.R-project.org/package=cranlogs) results with counts based on the underlying logs: