forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
scales-position.qmd
841 lines (656 loc) · 36.9 KB
/
scales-position.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
# Position scales and axes {#sec-scale-position}
```{r}
#| echo: false
#| message: false
#| results: asis
source("common.R")
status("polishing")
```
Position scales are used to control the locations of visual entities in a plot, and how those locations are mapped to data values.
Every plot has two position scales, corresponding to the x and y aesthetics.
In most cases this is clear in the plot specification, because the user explicitly specifies the variables mapped to x and y explicitly.
However, this is not always the case.
Consider this plot specification:
```{r}
#| fig.show: hide
#| message: false
ggplot(mpg, aes(x = displ)) + geom_histogram()
```
In this example the y aesthetic is not specified by the user.
Rather, the aesthetic is mapped to a computed variable: `geom_histogram()` computes a `count` variable that gets mapped to the y aesthetic.
The default behaviour of `geom_histogram()` is equivalent to the following:
```{r}
#| fig.show: hide
#| message: false
ggplot(mpg, aes(x = displ, y = after_stat(count))) + geom_histogram()
```
Because position scales are used in every plot, it is useful to understand how they work and how they can be modified.
In this chapter we'll discuss this in detail.
The chapter is organised into four main sections:
- @sec-numeric-position-scales discusses continuous position scales. In addition to covering core topics like controlling scale limits (@sec-position-continuous-limits), breaks (@sec-position-continuous-breaks), and labels (@sec-position-continuous-labels), there are sections providing a detailed coverage of scale transformations (@sec-scale-transformation) as well as the subtle issues that arise when you need to zoom in or zoom out on a plot (@sec-zooming-in and @sec-zooming-out).
- @sec-date-scales discusses date/time scales, a special type of continuous scale. Because dates and times are a little more complicated than a standard continuous variable, ggplot2 provides special scales to help you control the major and minor breaks (@sec-date-breaks and @sec-date-minor-breaks) and the labels (@sec-date-labels) for date/time data.
- @sec-discrete-position discusses discrete position scales. It covers limits, breaks, and labels in @sec-scale-labels and axis label customisation in @sec-guide-axis.
- @sec-binned-position discusses binned position scales.
\index{Scales!position} \index{Positioning!scales}
## Numeric position scales {#sec-numeric-position-scales}
The most common continuous position scales are the default `scale_x_continuous()` and `scale_y_continuous()` functions.
In the simplest case they map linearly from the data value to a location on the plot.
There are several other position scales for continuous variables---`scale_x_log10()`, `scale_x_reverse()`, etc---most of which are convenience functions used to provide easy access to common transformations, discussed in in @sec-scale-transformation.
\indexf{scale\_x\_continuous}
### Limits {#sec-position-continuous-limits}
\index{Axis!limits} \index{Scales!limits}
All scales have limits that specify the values of the aesthetic over which the scale is defined.
It's very natural to think about these limits for numeric position scales, as they map directly to the ranges of the axes.
By default, the limits are calculated from the range of the data variable, but sometimes you will need to set the limits manually using the `limits` argument to the scale function.
Whenever the scale is continuous, as is the case for numeric position scales, this should be a numeric vector of length two.
If you only want to set the upper or lower limit, you can set the other value to `NA`.
Manually setting scale limits is a common task when you need to ensure that scales in different plots are consistent with one another.
To illustrate why this is necessary consider this faceted plot:
```{r}
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(vars(year))
```
In this plot, ggplot2 has automatically ensured that both facets have the same axis limits, making visual comparison of the two scatter plots easy.
However, when creating the plots individually the scale limits in different plots will often be inconsistent:
```{r}
#| layout-ncol: 2
#| fig-width: 4
mpg_99 <- mpg %>% filter(year == 1999)
mpg_08 <- mpg %>% filter(year == 2008)
base_99 <- ggplot(mpg_99, aes(displ, hwy)) + geom_point()
base_08 <- ggplot(mpg_08, aes(displ, hwy)) + geom_point()
base_99
base_08
```
Each plot makes sense on its own, but visual comparison between the two is difficult due to the inconsistent axis scaling.
To ensure consistent axis scaling, we can set the `limits` argument to each scale separately:
```{r}
#| layout-ncol: 2
#| fig-width: 4
base_99 +
scale_x_continuous(limits = c(1, 7)) +
scale_y_continuous(limits = c(10, 45))
base_08 +
scale_x_continuous(limits = c(1, 7)) +
scale_y_continuous(limits = c(10, 45))
```
However, this code is a little unwieldy.
Because modifying scale limits is such a common task, ggplot2 provides the `lims()` convenience function to simplify the code.
Analogous to the `labs()` function used to specify axis labels (@sec-titles), `lims()` takes name-value pairs as inputs: the argument name is used to specify the aesthetic, and the value is used to specify the scale limits.
\indexf{xlim} \indexf{ylim} \indexf{lims}
```{r}
#| layout-ncol: 2
#| fig-width: 4
base_99 + lims(x = c(1, 7), y = c(10, 45))
base_08 + lims(x = c(1, 7), y = c(10, 45))
```
In the special case where only one axis limit needs to be specified, ggplot2 also provides `xlim()` and `ylim()` helper functions, which can save you a few keystrokes.
In practice `lims()` tends to be more useful, because it can be used to set limits for several aesthetics at once.
You'll see an example of `lims()` applied to non-position aesthetics in @sec-colour-discrete-limits.
### Zooming in {#sec-zooming-in}
The examples in the previous section expand the scale limits beyond the range spanned by the data.
It is also possible to narrow the default scale limits, but care is required: when you truncate the scale limits, some data points will fall outside the boundaries you set, and ggplot2 has to make a decision about what to do with these data points.
The default behaviour in ggplot2 is to convert any data values outside the scale limits to `NA`.
This means that changing the limits of a scale is not always the same as visually zooming in to a region of the plot.
If your goal is to zoom in on part of the plot, it is usually better to use the `xlim` and `ylim` arguments of `coord_cartesian()`:
```{r}
#| layout-ncol: 3
#| fig-width: 3
base <- ggplot(mpg, aes(drv, hwy)) +
geom_hline(yintercept = 28, colour = "red") +
geom_boxplot()
base
base + coord_cartesian(ylim = c(10, 35)) # works as expected
base + ylim(10, 35) # distorts the boxplot
```
The only difference between the left and middle plots is that the latter is zoomed in.
Some of the outlier points are not shown due to the restriction of the range, but the boxplots themselves remain identical.
In contrast, in the plot on the right one of the boxplots has changed.
When modifying the scale limits, all observations with highway mileage greater than 35 are converted to `NA` before the stat (in this case the boxplot) is computed.
Because these "out of bounds" values are no longer available, the end result is that the sample median is shifted downward, which is almost never desirable behaviour.
With the benefit of hindsight it's clear this wasn't a good design choice, because it is a common source of confusion for users.
Unfortunately, it would be very hard to change this default without breaking a lot of existing code.
You can learn more about coordinate systems in @sec-cartesian.
To learn more about how "out of bounds" values are handled for continuous and binned scales, see @sec-oob.
### Visual range expansion {#sec-zooming-out}
If you have eagle eyes, you'll have noticed that the visual range of the axes actually extends a little bit past the numeric limits that we have specified in the various examples.
This ensures that the data does not overlap the axes, which is usually (but not always) desirable.
You can override the defaults setting the `expand` argument, which expects a numeric vector created by `expansion()`.
For example, one case where it's usually preferable to remove this space is when using `geom_raster()`, which we can achieve by setting `expand = expansion(0)`: \index{Axis!expansion}
```{r}
#| layout-ncol: 2
#| fig-width: 4
base <- ggplot(faithfuld, aes(waiting, eruptions)) +
geom_raster(aes(fill = density)) +
theme(legend.position = "none") +
labs(x = NULL, y = NULL)
base
base +
scale_x_continuous(expand = expansion(0)) +
scale_y_continuous(expand = expansion(0))
```
Axis expansions are described in terms of an "additive" factor, which specifies a constant space added to outside of the nominal axis limits, and a "multiplicative" one that adds space defined as a proportion of the size of the axis limit.
These correspond to the `add` and `mult` arguments to `expansion()`, which can be length one (if the expansion is the same on both sides) or length two (to set different expansions on each side):
```{r}
#| layout-ncol: 3
#| fig-width: 3
# Additive expansion of three units on both axes
base +
scale_x_continuous(expand = expansion(add = 3)) +
scale_y_continuous(expand = expansion(add = 3))
# Multiplicative expansion of 20% on both axes
base +
scale_x_continuous(expand = expansion(mult = .2)) +
scale_y_continuous(expand = expansion(mult = .2))
# Multiplicative expansion of 5% at the lower end of each axes,
# and 20% at the upper end; for the y-axis the expansion is
# set directly instead of using expansion()
base +
scale_x_continuous(expand = expansion(mult = c(.05, .2))) +
scale_y_continuous(expand = c(.05, 0, .2, 0))
```
Note the different behaviour in the left and middle plots: the `add` argument is specified on the same scale as the data variable, whereas the `mult` argument is specified relative to the axis range.
<!-- ### Exercises -->
<!-- 1. The following code creates two plots of the mpg dataset. Modify the code -->
<!-- so that the legend and axes match, without using faceting! -->
<!-- ```{r} -->
<!-- fwd <- subset(mpg, drv == "f") -->
<!-- rwd <- subset(mpg, drv == "r") -->
<!-- ggplot(fwd, aes(displ, hwy, colour = class)) + geom_point() -->
<!-- ggplot(rwd, aes(displ, hwy, colour = class)) + geom_point() -->
<!-- ``` -->
<!-- 1. What happens if you add two `xlim()` calls to the same plot? Why? -->
<!-- 1. What does `scale_x_continuous(limits = c(NA, NA))` do? -->
<!-- 1. What does `expand_limits()` do and how does it work? Read the source code. -->
### Breaks {#sec-position-continuous-breaks}
Setting the locations of the axis tick marks is a common data visualisation task.
In ggplot2, axis tick marks and legend tick marks are both special cases of "scale breaks", and can be modified using the `breaks` argument to the scale function.
We'll illustrate this using a toy data set that will reappear in several places throughout this part of the book:
```{r}
toy <- data.frame(
const = 1,
up = 1:4,
txt = letters[1:4],
big = (1:4)*1000,
log = c(2, 5, 10, 2000)
)
toy
```
To set breaks manually, pass a vector of data values to `breaks`, or set `breaks = NULL` to remove the breaks and the corresponding tick marks entirely.
In the plot below, removing the y-axis breaks also removes the corresponding grid lines:
```{r}
#| fig-height: 2
base <- ggplot(toy, aes(big, const)) +
geom_point() +
labs(x = NULL, y = NULL) +
scale_y_continuous(breaks = NULL)
base
```
Alternatively, notice that when the breaks are set manually it moves the major gridlines and the minor gridlines between them:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
base + scale_x_continuous(breaks = c(1000, 2000, 4000))
base + scale_x_continuous(breaks = c(1000, 1500, 2000, 4000))
```
It is also possible to pass a function to `breaks`.
This function should have one argument that specifies the limits of the scale (a numeric vector of length two), and it should return a numeric vector of breaks.
You can write your own break function, but in many cases there is no need, thanks to the scales package [@scales].
It provides several tools that are useful for this purpose:
- `scales::breaks_extended()` creates automatic breaks for numeric axes.
- `scales::breaks_log()` creates breaks appropriate for log axes.
- `scales::breaks_pretty()` creates "pretty" breaks for date/times.
- `scales::breaks_width()` creates equally spaced breaks.
The `breaks_extended()` function is the standard method used in ggplot2, and accordingly the first two plots below are the same.
We can alter the desired number of breaks by setting `n = 2`, as illustrated in the third plot.
Note that `breaks_extended()` treats `n` as a suggestion rather than a strict constraint.
If you need to specify exact breaks it is better to do so manually.
```{r}
#| layout-ncol: 3
#| fig-width: 3
#| fig-height: 2
base
base + scale_x_continuous(breaks = scales::breaks_extended())
base + scale_x_continuous(breaks = scales::breaks_extended(n = 2))
```
Another approach that is sometimes useful is specifying a fixed `width` that defines the spacing between breaks.
The `breaks_width()` function is used for this.
The first example below shows how to fix the width at a specific value; the second example illustrates the use of the `offset` argument that shifts all the breaks by a specified amount:
```{r}
#| layout-ncol: 3
#| fig-width: 3
#| fig-height: 2
base +
scale_x_continuous(breaks = scales::breaks_width(800))
base +
scale_x_continuous(breaks = scales::breaks_width(800, offset = 200))
base +
scale_x_continuous(breaks = scales::breaks_width(800, offset = -200))
```
Notice the difference between setting an offset of 200 and -200.
### Minor breaks {#sec-minor-breaks}
\index{Minor breaks}\index{Log!ticks}
You can adjust the minor breaks (the unlabelled faint grid lines that appear between the major grid lines) by supplying a numeric vector of positions to the `minor_breaks` argument.
Minor breaks are particularly useful for log scales because they give a clear visual indicator that the scale is non-linear.
To show them off, we'll first create a vector of minor break values (on the transformed scale), using `%o%` to quickly generate a multiplication table and `as.numeric()` to flatten the table to a vector.
```{r}
mb <- unique(as.numeric(1:10 %o% 10 ^ (0:3)))
mb
```
The following plots illustrate the effect of setting the minor breaks:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
base <- ggplot(toy, aes(log, const)) +
geom_point() +
labs(x = NULL, y = NULL) +
scale_y_continuous(breaks = NULL)
base + scale_x_log10()
base + scale_x_log10(minor_breaks = mb)
```
As with `breaks`, you can also supply a function to `minor_breaks`, such as `scales::minor_breaks_n()` or `scales::minor_breaks_width()` functions that can be helpful in controlling the minor breaks.
### Labels {#sec-position-continuous-labels}
Every break is associated with a label and these can be changed by setting the `labels` argument to the scale function:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
base <- ggplot(toy, aes(big, const)) +
geom_point() +
labs(x = NULL, y = NULL) +
scale_y_continuous(breaks = NULL)
base
base +
scale_x_continuous(
breaks = c(2000, 4000),
labels = c("2k", "4k")
)
```
Often you don't need to set the `labels` manually, and can instead specify a labelling function in the same way you can for `breaks`.
A function passed to `labels` should accept a numeric vector of breaks as input and return a character vector of labels (the same length as the input).
Again, the scales package provides a number of tools that will automatically construct label functions for you.
Some of the more useful examples for numeric data include:
- `scales::label_bytes()` formats numbers as kilobytes, megabytes etc.
- `scales::label_comma()` formats numbers as decimals with commas added.
- `scales::label_dollar()` formats numbers as currency.
- `scales::label_ordinal()` formats numbers in rank order: 1st, 2nd, 3rd etc.
- `scales::label_percent()` formats numbers as percentages.
- `scales::label_pvalue()` formats numbers as p-values: \<.05, \<.01, .34, etc.
A few examples are shown below to illustrate how these functions are used:
```{r}
#| label: breaks-functions
#| layout-ncol: 3
#| fig-width: 3
#| fig-height: 3
base <- ggplot(toy, aes(big, const)) +
geom_point() +
labs(x = NULL, y = NULL) +
scale_x_continuous(breaks = NULL)
base
base + scale_y_continuous(labels = scales::label_percent())
base + scale_y_continuous(
labels = scales::label_dollar(prefix = "", suffix = "€")
)
```
You can suppress labels with `labels = NULL`.
This will remove the labels from the axis or legend while leaving its other properties unchanged.
Notice the difference between setting `breaks = NULL` and `labels = NULL`:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
base + scale_y_continuous(breaks = NULL)
base + scale_y_continuous(labels = NULL)
```
<!-- ### Exercises -->
<!-- 1. Recreate the following graphic: -->
<!-- ```{r, echo = FALSE} -->
<!-- ggplot(mpg, aes(displ, hwy)) + -->
<!-- geom_point() + -->
<!-- scale_x_continuous("Displacement", labels = scales::unit_format(suffix = "L")) + -->
<!-- scale_y_continuous(quote(paste("Highway ", (frac(miles, gallon))))) -->
<!-- ``` -->
<!-- Adjust the y axis label so that the parentheses are the right size. -->
<!-- 1. List the three different types of object you can supply to the -->
<!-- `breaks` argument. How do `breaks` and `labels` differ? -->
<!-- 1. What label function allows you to create mathematical expressions? -->
<!-- What label function converts 1 to 1st, 2 to 2nd, and so on? -->
### Transformations {#sec-scale-transformation}
When working with continuous data, the default is to map linearly from the data space onto the aesthetic space.
It is possible to override this default using scale transformations, which alter the way in which this mapping takes place.
In some cases you don't need to dive into the details, because there are convenience functions like `scale_x_log10()`, `scale_x_reverse()` that can do the work for you:
```{r}
#| layout-ncol: 3
#| fig-width: 3
base <- ggplot(mpg, aes(displ, hwy)) + geom_point()
base
base + scale_x_reverse()
base + scale_y_reverse()
```
However, even in these cases a deeper understanding can be valuable.
Every continuous scale takes a `trans` argument, allowing the use of a variety of transformations: \index{Scales!position} \index{Transformation!scales} \index{Log!scale} \indexf{scale\_x\_log10}
```{r}
#| layout-ncol: 2
#| fig-width: 4
# convert from fuel economy to fuel consumption
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reciprocal")
# log transform x and y axes
ggplot(diamonds, aes(price, carat)) +
geom_bin2d() +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")
```
The transformation is carried out by a "transformer", which describes the transformation, its inverse, and how to draw the labels.
You can construct your own transformer using `scales::trans_new()`, but as the plots above illustrate, ggplot2 understands many common transformations supplied by the scales package.
The following table lists some of the more common variants:
| Name | Transformer | Function $f(x)$ | Inverse $f^{-1}(x)$ |
|----------------|------------------------------|-------------------------|----------------------|
| `"asn"` | `scales::asn_trans()` | $\tanh^{-1}(x)$ | $\tanh(y)$ |
| `"exp"` | `scales::exp_trans()` | $e ^ x$ | $\log(y)$ |
| `"identity"` | `scales::identity_trans()` | $x$ | $y$ |
| `"log"` | `scales::log_trans()` | $\log(x)$ | $e ^ y$ |
| `"log10"` | `scales::log10_trans()` | $\log_{10}(x)$ | $10 ^ y$ |
| `"log2"` | `scales::log2_trans()` | $\log_2(x)$ | $2 ^ y$ |
| `"logit"` | `scales::logit_trans()` | $\log(\frac{x}{1 - x})$ | $\frac{1}{1 + e(y)}$ |
| `"probit"` | `scales::probit_trans()` | $\Phi(x)$ | $\Phi^{-1}(y)$ |
| `"reciprocal"` | `scales::reciprocal_trans()` | $x^{-1}$ | $y^{-1}$ |
| `"reverse"` | `scales::reverse_trans()` | $-x$ | $-y$ |
| `"sqrt"` | `scales::scale_x_sqrt()` | $x^{1/2}$ | $y ^ 2$ |
You can specify the `trans` argument as a string containing the name of the transformation, or by calling the transformer directly.
The following are equivalent:
```{r}
#| fig.show: hide
#| layout-ncol: 2
#| fig-width: 4
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = "reciprocal")
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(trans = scales::reciprocal_trans())
```
In a few cases ggplot2 simplifies this even further, and provides convenience functions for the most common transformations: `scale_x_log10()`, `scale_x_sqrt()` and `scale_x_reverse()` provide the relevant transformation on the x axis, with similar functions provided for the y axis.
Thus, these two plot specifications are also equivalent:
```{r}
#| fig.show: hide
#| layout-ncol: 2
#| fig-width: 4
ggplot(diamonds, aes(price, carat)) +
geom_bin2d() +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")
ggplot(diamonds, aes(price, carat)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
```
Note that there is nothing preventing you from performing these transformations manually.
For example, instead of using `scale_x_log10()` to transform the scale, you could transform the data instead and plot `log10(x)`.
The appearance of the geom will be the same, but the tick labels will be different.
Specifically, if you use a transformed scale, the axes will be labelled in the original data space; if you transform the data, the axes will be labelled in the transformed space.
As a consequence, these plot specifications are slightly different:
```{r}
#| layout-ncol: 2
#| fig-width: 4
# manual transformation
ggplot(mpg, aes(log10(displ), hwy)) +
geom_point()
# transform using scales
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_log10()
```
Regardless of which method you use, the transformation occurs before any statistical summaries.
To transform *after* statistical computation use `coord_trans()`.
See @sec-cartesian for more details on coordinate systems, and @sec-scale-transformation-extras if you need to transform something other than a numeric position scale.
## Date-time position scales {#sec-date-scales}
\index{Date/times} \index{Data!date/time} \index{Time} \index{Scales!date/time} \indexf{scale\_x\_datetime}
A special case of numeric position arises when an aesthetic is mapped to a date/time type.
Examples of date/time types include the base `Date` (for dates) and `POSIXct` (for date-times) classes, as well as the `hms` class for "time of day" values provided by the hms package [@hms].
If your dates are in a different format you will need to convert them using `as.Date()`, `as.POSIXct()` or `hms::as_hms()`.
You may also find the lubridate package helpful to manipulate date/time data [@lubridate].
Assuming you have appropriately formatted data mapped to the x aesthetic, ggplot2 will use `scale_x_date()` as the default scale for dates and `scale_x_datetime()` as the default scale for date-time data.
The corresponding scales for other aesthetics follow the usual naming rules.
Date scales behave similarly to other continuous scales, but contain additional arguments that allow you to work in date-friendly units.
This section discusses date/time scales for position aesthetics: see @sec-date-colour-scales for colour and fill aesthetics.
### Breaks {#sec-date-breaks}
The `date_breaks` argument allows you to position breaks by date units (years, months, weeks, days, hours, minutes, and seconds).
For example, `date_breaks = "2 weeks"` will place a major tick mark every two weeks and `date_breaks = "15 years"` will place them every 15 years:
```{r}
#| label: date-scale
#| layout-ncol: 2
#| fig-width: 4
date_base <- ggplot(economics, aes(date, psavert)) +
geom_line(na.rm = TRUE) +
labs(x = NULL, y = NULL)
date_base
date_base + scale_x_date(date_breaks = "15 years")
```
Compared to the plot on the left, two things have changed in the plot on the right: the tick marks are placed at 15 year intervals, and the label format has changed.
We'll discuss date labelling in @sec-date-labels, but for now our focus is on the breaks.
To understand how ggplot2 interprets `date_breaks = "15 years"`, it is helpful to note that it is merely a convenient shorthand for setting `breaks = scales::breaks_width("15 years")`.
The longer form is typically unnecessary, but it can be useful if---as discussed in @sec-position-continuous-breaks---you wish to specify an `offset`.
For example, suppose the goal is to plot data spanning a calendar year, with monthly breaks.
Specifying `date_breaks = "1 month"` is equivalent to setting `scales::breaks_width("1 month")`, which produces these breaks:
```{r}
the_year <- as.Date(c("2021-01-01", "2021-12-31"))
set_breaks <- scales::breaks_width("1 month")
set_breaks(the_year)
```
In this example, the `set_breaks()` function returned by `scales::break_width()` produces breaks spaced one month apart, where the date for each break falls on the first day of the month.
Placing each break at the start of the calendar year is usually sensible, but there are exceptions.
Perhaps the data track income and expenses for a household in which a monthly salary is paid on the ninth day of each month.
In this situation it may be sensible to have the breaks aligned with the salary deposits.
To do this, we can set `offset = 8` when we define the `set_breaks()` function:
```{r}
set_breaks <- scales::breaks_width("1 month", offset = 8)
set_breaks(the_year)
```
### Minor breaks {#sec-date-minor-breaks}
Date/times scales also have a `date_minor_breaks` argument that allows you to specify the minor breaks in using date units, in exactly the same fashion that `date_breaks` does for major breaks.
To illustrate this, we'll define an empty plot with a date scale on the y-axis, and tweak the theme (@sec-polishing) to make the grid lines more visually prominent:
```{r}
#| label: date-scale-2
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
df <- data.frame(y = as.Date(c("2022-01-01", "2022-04-01")))
base <- ggplot(df, aes(y = y)) +
labs(y = NULL) +
theme_minimal() +
theme(
panel.grid.major = element_line(colour = "black"),
panel.grid.minor = element_line(colour = "grey50")
)
base + scale_y_date(date_breaks = "1 month")
base +
scale_y_date(date_breaks = "1 month", date_minor_breaks = "1 week")
```
Note that in the first plot, the minor breaks are spaced evenly between the monthly major breaks.
In the second plot, the major and minor breaks follow slightly different patterns: the minor breaks are always spaced 7 days apart but the major breaks are 1 month apart.
Because the months vary in length, this leads to slightly uneven spacing.
### Labels {#sec-date-labels}
Date scales contain a `labels` argument that behaves similarly to the corresponding argument for numeric scales, but is often more convenient to use the `date_labels` argument.
It controls the display of the labels using the same formatting strings as in `strptime()` and `format()`.
To display dates like 14/10/1979, for example, you would use the string `"%d/%m/%Y"`: in this expression `%d` produces a numeric day of month, `%m` produces a numeric month, and `%Y` produces a four-digit year.
The table below provides a list of formatting strings:
| String | Meaning |
|:-------|:-----------------------------------|
| `%S` | second (00-59) |
| `%M` | minute (00-59) |
| `%l` | hour, in 12-hour clock (1-12) |
| `%I` | hour, in 12-hour clock (01-12) |
| `%p` | am/pm |
| `%H` | hour, in 24-hour clock (00-23) |
| `%a` | day of week, abbreviated (Mon-Sun) |
| `%A` | day of week, full (Monday-Sunday) |
| `%e` | day of month (1-31) |
| `%d` | day of month (01-31) |
| `%m` | month, numeric (01-12) |
| `%b` | month, abbreviated (Jan-Dec) |
| `%B` | month, full (January-December) |
| `%y` | year, without century (00-99) |
| `%Y` | year, with century (0000-9999) |
One useful scenario for date label formatting is when there's insufficient room to specify a four-digit year.
Using `%y` ensures that only the last two digits are displayed:
```{r}
#| label: date-scale-3
#| layout-ncol: 2
#| fig-width: 4
base <- ggplot(economics, aes(date, psavert)) +
geom_line(na.rm = TRUE) +
labs(x = NULL, y = NULL)
base + scale_x_date(date_breaks = "5 years")
base + scale_x_date(date_breaks = "5 years", date_labels = "%y")
```
It can be useful to include the line break character `\n` in a formatting string, particularly when full-length month names are included:
```{r}
#| label: date-scale-4
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
lim <- as.Date(c("2004-01-01", "2005-01-01"))
base + scale_x_date(limits = lim, date_labels = "%b %y")
base + scale_x_date(limits = lim, date_labels = "%B\n%Y")
```
In these examples we have specified the labels manually via the `date_labels` argument.
An alternative approach is to pass a labelling function to the `labels` argument, in the same way we described in @sec-position-continuous-labels.
You can write your own custom labelling function, but this is often unnecessary.
The scales package provides convenient functions that can generate labellers for you, notably `scales::label_date()` and `scales::label_date_short()`.
You rarely need to call `scales::label_date()` directly, because that's the function that `date_labels` uses.
However, if you want to use `scales::label_date_short()` you'll need to do so explicitly.
The goal of `label_date_short()` is to automatically construct short labels that are sufficient to uniquely identify the dates:
```{r}
#| label: date-scale-5
#| fig-height: 3
base +
scale_x_date(
limits = lim,
labels = scales::label_date_short()
)
```
This can often produce clearer plots: in the example above each year is labelled only once rather than appearing in every label, reducing the amount of visual clutter and making it easier for the viewer to see where each year begins and ends.
## Discrete position scales {#sec-discrete-position}
It is also possible to map discrete variables to position scales, with the default scales being `scale_x_discrete()` and `scale_y_discrete()`.
For example, the following two plot specifications are equivalent
```{r}
#| label: default-scales-discrete
#| fig.show: hide
ggplot(mpg, aes(x = hwy, y = class)) +
geom_point()
ggplot(mpg, aes(x = hwy, y = class)) +
geom_point() +
scale_x_continuous() +
scale_y_discrete()
```
Internally, ggplot2 handles discrete scales by mapping each category to an integer value and then drawing the geom at the corresponding coordinate location.
To illustrate this, we can add a custom annotation (see @sec-custom-annotations) to the plot:
```{r}
ggplot(mpg, aes(x = hwy, y = class)) +
geom_point() +
annotate("text", x = 5, y = 1:7, label = 1:7)
```
Mapping each category to an integer value is useful because it means that other width quantities can be specified as a proportion of the category range.
For instance, in the preceding plot, we could specify a vertical jitter for each point spanning half the width of the implied category bin:
```{r}
ggplot(mpg, aes(x = hwy, y = class)) +
geom_jitter(width = 0, height = .25) +
annotate("text", x = 5, y = 1:7, label = 1:7)
```
The same mechanism underpins the widths of bars and boxplots.
Because each category has width 1 in a discrete scale, setting `width = .4` when using `geom_boxplot()` ensures that the box occupies 40% of the width allocated to the category:
```{r}
#| layout-ncol: 2
#| fig-width: 4
ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot()
ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = .4)
```
### Limits, breaks, and labels {#sec-scale-labels}
\index{Axis!labels} \index{Legend!keys}
The limits, breaks.
and labels for a discrete position scale can be set using the `limits`, `breaks`, and `labels` arguments.
For the most part these behave identically to the corresponding arguments for numeric scales (@sec-numeric-position-scales), though there are some differences.
For example, the limits of a discrete scale are not defined in terms of endpoints, but instead correspond to the set of allowable values for that variable.
Accordingly, ggplot2 expects that the `limits` of a discrete scale should be a character vector that enumerates all possible values in the order they should appear:
```{r}
#| layout-ncol: 3
#| fig-width: 3
base <- ggplot(toy, aes(const, txt)) +
geom_label(aes(label = txt)) +
scale_x_continuous(breaks = NULL) +
labs(x = NULL, y = NULL)
base
base + scale_y_discrete(limits = c("a", "b", "c", "d", "e"))
base + scale_y_discrete(limits = c("d", "c", "a", "b"))
```
The `breaks` argument is largely unchanged, enumerating a set of values to be displayed on the axis labels.
The `labels` argument for discrete scales has some additional functionality: you also have the option of using a named vector to set the labels associated with particular values.
This allows you to change some labels and not others, without altering the ordering or the breaks:
```{r}
#| layout-ncol: 2
#| fig-width: 4
base + scale_y_discrete(breaks = c("b", "c"))
base + scale_y_discrete(labels = c(c = "carrot", b = "banana"))
```
As with other scales, discrete position scales allow you to pass a function to the `labels` argument.
The `scales::label_wrap()` function can be particularly valuable for categorical data, as it allows you to wrap long strings across multiple lines.
### Label positions {#sec-guide-axis}
When plotting categorical data it is often necessary to move the axis labels in some way to prevent them from overlapping:
```{r}
base <- ggplot(mpg, aes(manufacturer, hwy)) + geom_boxplot()
base
```
Even when allocated a lot of horizontal space, the axis labels overlap considerably on this plot.
We can control this with the help of the `guides()` function, which works in a similar way to the `labs()` helper function described in @sec-titles.
Both take the name of different aesthetics (e.g., color, x, fill) as arguments and allow you to specify your own value.
For a position aesthetic, we use the `guide_axis()` to tell ggplot2 how we want to modify the axis labels.
For example, we could tell ggplot2 to "dodge" the position of the labels by setting `guide_axis(n.dodge = 3)`, or to rotate them by setting `guide_axis(angle = 90)`:
```{r}
#| layout-ncol: 2
#| fig-width: 4
base + guides(x = guide_axis(n.dodge = 3))
base + guides(x = guide_axis(angle = 90))
```
Note that, in the same way that where `labs()` is a shorthand way to specify the `name` argument to one or more scales, the `guides()` function is a shorthand way to set the `guide` arguments to one or more scales.
So the code below achieves the same result:
```{r}
#| fig.show: hide
#| layout-ncol: 2
#| fig-width: 4
base + scale_x_discrete(guide = guide_axis(n.dodge = 3))
base + scale_x_discrete(guide = guide_axis(angle = 90))
```
To learn more about guide functions see @sec-scale-guide.
## Binned position scales {#sec-binned-position}
A variation on discrete position scales are binned scales, where a continuous variable is sliced into multiple bins and the discretised variable is plotted.
For position aesthetics, binned scales are mostly used to create histograms and related plots.
The example below shows how to approximate the behaviour of `geom_histogram()` using `geom_bar()` in combination with a binned position scale:
```{r}
#| layout-ncol: 2
#| fig-width: 4
ggplot(mpg, aes(hwy)) + geom_histogram(bins = 8)
ggplot(mpg, aes(hwy)) +
geom_bar() +
scale_x_binned()
```
In practice this is not the most useful example, since `geom_histogram()` already exists and supplies defaults that are generally more appropriate for histograms, but the technique can be extended.
Suppose we want to use `geom_count()` in place of `geom_point()` in order to show the number of observations at each location.
The advantage of `geom_count()` is that the size of each dot scales with the number of observations at each location, but as the figure below illustrates, this method doesn't work very well when data vary continuously:
```{r}
base <- ggplot(mpg, aes(displ, hwy)) +
geom_count()
base
```
This plot is rather cluttered, and not particularly easy to read.
To improve this, we can use `scale_x_binned()` to cut the values into bins before passing them to the geom:
```{r}
base +
scale_x_binned(n.breaks = 15) +
scale_y_binned(n.breaks = 15)
```
You can read more about how binned scales are used for non-position aesthetics in @sec-binned-colour and @sec-guide-bins.