14-strings.Rmd

# Strings

**Learning objectives:**

After this chapter, we'll be able to:

-   Create strings.
-   Combine strings.
-   Separate strings into rows and columns.
-   Measure the lengths of strings.
-   Extract substrings from strings.
-   Work with strings in different alphabets.

```{r strings-pkgs, message=FALSE, warning=FALSE, include=FALSE}
library(tidyverse)
library(babynames)
```

## Introduction {-}

-   Strings = text, eg the stuff you're reading here
-   {[stringr](https://stringr.tidyverse.org/)} = {tidyverse} string package ([stringr cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf))
-   {stringr} functions start with `str_`

![Stringr Autocomplete](images/stringr-autocomplete.png)

## Creating strings: " and ' {-}

```{r strings-create}
string1 <- "This is a string"
string2 <- 'Include "quote characters" inside a string'
```

- `+` in console can mean you forgot to close a string:

![](images/string_stuck.png)

## Escapes {-}

-   "Escape" = tell R not to process special character

```{r strings-escapes}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
backslash <- "\\"
```

-   [`?Quotes`](https://rdrr.io/r/base/Quotes.html)

## Printing strings {-}

```{r strings-print-view}
x <- c(single_quote, double_quote, backslash)
x
str_view(x)
```

-   `print()` output = the format needed to define the string!

## Multiple escapes = confusion {-}

```{r strings-escapes-confusion}
tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""
str_view(tricky)
```

## Raw strings {-}

-   Raw strings in R 4.0+ = `r"()"`
    -   Or `r"[]"`, `r"{}"`, `r"--()--"`, etc.

```{r strings-raw}
tricky2 <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"
identical(tricky, tricky2)
```


## Other special characters {-}

-   `\n` = newline
-   `\t` = tab
-   `\u` or `\U` = unicode escapes (other special characters)

```{r strings-special}
str_view(c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604"))
```

## str_c() {-}

-   `str_c()` = combine multiple strings or multiple character vectors
-   Vectorized just like `base::paste0()` but safer recycling rules

```{r strings-str_c}
str_c("x", "y")
str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan"))
```

## str_c() and missing values {-}

```{r strings-str_c-missing}
df <- tibble(name = c("Flora", "David", NA))
df |> 
  mutate(
    greeting0 = str_c("Hi ", name, "!"),
    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
  )
```

## str_glue() {-}

-   `str_glue()` = combine strings with `{glue}` syntax
-   vectorized

```{r strings-str_glue}
df |> mutate(greeting = str_glue("Hi {name}!"))
```

## str_glue() escapes {-}

-   `{{` to escape `{`

```{r}
df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
```

## str_flatten() {-}

-   `str_flatten()` = collapse a single character vector into a single string
-   Similar to `base::paste(collapse = X)`
-   Useful with `summarize()`

```{r strings-str_flatten}
tribble(
  ~ name, ~ fruit,
  "Carmen", "banana",
  "Carmen", "apple",
  "Marvin", "nectarine",
  "Terence", "cantaloupe",
  "Terence", "papaya",
  "Terence", "madarine"
) |>
  group_by(name) |> 
  summarize(fruits = str_flatten(fruit, ", ", last = ", and "))
```

## Separating strings {-}

{tidyr} has multiple `separate_*()` functions:

- in case the combined strings refer to the same variable:
  -   `separate_longer_delim()`
  -   `separate_longer_position()`
- in case the combined strings refer to different variables:
  -   `separate_wider_delim()`
  -   `separate_wider_position()`
  -   (`separate_wider_regex()` in next chapter)

## separate_longer_delim() {-}

```{r strings-separate_longer_delim}
tibble(x = c("a,b,c", "d,e", "f")) |> 
  separate_longer_delim(x, delim = ",")
```

## separate_longer_position {-}

Rarer, but sometimes data is weird

```{r strings-separate_longer_position}
tibble(x = c("1211", "131", "21")) |> 
  separate_longer_position(x, width = 1)
```

## separate_wider_delim() {-}

```{r strings-separate_wider_delim}
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015")) 
df3 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )
```

## Delim: Omit columns {-}

```{r strings-separate_wider_delim-omit}
df3 |> 
  separate_wider_delim(
    x, ".",
    names = c("code", NA, "year")
  )
```

## separate_wider_position() {-}

```{r strings-separate_wider_position}
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df4 |> 
  separate_wider_position(
    x,
    widths = c(year = 4, age = 2, state = 2)
  )
```

## Position: omit columns {-}

```{r strings-separate_wider_position-omit}
df4 |> 
  separate_wider_position(
    x, 
    widths = c(year = 4, 2, state = 2)
  )
```

## Diagnosing widening problems {-}

-   Default = error for too many/too few columns
-   `"debug"` to diagnose

```{r strings-separate_wider_delim-debug}
df <- tibble(x = c("1-1-1", "1-3", "1"))
df |> 
  separate_wider_delim(
    x, "-", names = c("x", "y", "z"),
    too_few = "debug"
  )
```

## too_few {-}

```{r strings-too_few}
df |> 
  separate_wider_delim(
    x, "-", names = c("x", "y", "z"),
    too_few = "align_end"
  )
```

## too_many {-}

```{r strings-too_many}
df |> 
  separate_wider_delim(
    x, "-", names = c("x", "y"),
    too_few = "align_start",
    too_many = "merge"
  )
```


## str_length() {-}

```{r strings-str_length}
str_length(c("a", "R for data science", NA))
```

## str_length() and babynames {-}

```{r strings-babynames-length}
babynames |>
  count(length = str_length(name), wt = n)
```

## Substrings {-}

```{r strings-sub_str}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
str_sub(x, -3, -1)
str_sub("a", 1, 5)
```

##  Encoding {-}

-   Encoding = how each character is represented by one or more bytes
    -   1 byte is 8 bits (base-2 octet) = 2 hex digits (base-16)
-   Old way = multiple standards
    -   Byte `b1` in "Latin1" = "±"
    -   Byte `b1` in "Latin2" = "ą"
-   Modern = UTF-8 (all human characters)
    -   Seriously, ALL: `"\uA66E"` = `r "\uA66E"` (["multiocular O"](https://en.wikipedia.org/wiki/Cyrillic_o_variants#Multiocular_O))
        -   From a single 15th-century manuscript
        -   As "o" in a word meaning "many-eyed"

## Specifying encoding {-}

-   {readr} uses UTF-8 unless you tell it otherwise

```{r strings-encoding-specify, message=FALSE, warning=FALSE}
# \xf1 is the Latin1 encoding of ñ
x <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(charToRaw(x), locale = locale(encoding = "Latin1"))$text
```

## Locales (1) {-}

-   `locale` = specifier of language and optionally the region
    -   e.g. `en`, `en_US`, `en_UK`, `nl_BE`, etc.
    -   wrt strings, it affects **changing case** and **sorting**
-   base R defaults to your personal system's locale = different behaviour between machines!
-   `{stringr}` defaults to `en`; can take explicit locale = consistent behaviour!
    -   many locales supported: [`stringi::stri_locale_list()`](https://stringi.gagolewski.com/rapi/stri_locale_list.html#examples)

## Locales (2) {-}

```{r strings-locale-upper}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```

```{r strings-locale-sort}
str_sort(c("c", "ch", "h"))
str_sort(c("c", "ch", "h"), locale = "cs")
```

## Meeting Videos {-}

### Cohort 5 {-}

`r knitr::include_url("https://www.youtube.com/embed/RpZoQR2xwME")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:10:10	Njoki Njuki Lucy:	Hi everyone!
00:10:24	Federica Gazzelloni:	Hello!
00:10:31	Sandra Muroy:	hi!
00:11:09	Federica Gazzelloni:	Hi Sandra!
00:11:27	Sandra Muroy:	Hi Federica :D
00:11:43	Sandra Muroy:	Hi Lucy!
00:19:20	Jon Harmon (jonthegeek):	https://xkcd.com/208/
00:20:10	Jon Harmon (jonthegeek):	https://smltar.com/ is the new book by Julia Silge and Emil Hvitfeldt.
00:20:42	Jon Harmon (jonthegeek):	https://www.tidytextmining.com/ is the older Tidytext book by Julia Silge and David Robinson.
00:28:15	Njoki Njuki Lucy:	do you also load the packages or that's sufficient for a readable code?
00:28:59	Jon Harmon (jonthegeek):	Within an Rmd I tend to put a library() call per package at the top so it's easy to find, but in my exploration work I rarely library anything other than magrittr.
00:30:11	Njoki Njuki Lucy:	okay, thank you
00:39:53	Jon Harmon (jonthegeek):	> x <- r"(this treats all characters as ecaped. For example: \ " ')"
> writeLines(x)
this treats all characters as ecaped. For example: \ " '
00:45:30	Jon Harmon (jonthegeek):	\"
00:45:31	Federica Gazzelloni:	textutils: https://github.com/enricoschumann/textutils
00:45:42	Jon Harmon (jonthegeek):	"\""
00:46:00	Jon Harmon (jonthegeek):	"""
00:46:19	Jon Harmon (jonthegeek):	"\\"
00:49:07	Jon Harmon (jonthegeek):	> "\"
+
00:50:06	Jon Harmon (jonthegeek):	> cat("this\ttab")
this	tab
00:50:21	Jon Harmon (jonthegeek):	> cat("this\nnewline")
this
newline
00:50:40	Jon Harmon (jonthegeek):	> cat("this\something"
Error: '\s' is an unrecognized escape in character string starting ""this\s"
00:51:20	Njoki Njuki Lucy:	did I get it right that the character after \ is escaped not printed?
00:51:30	Jon Harmon (jonthegeek):	correct
00:51:41	Jon Harmon (jonthegeek):	https://regexr.com/
00:59:49	Jon Harmon (jonthegeek):	> str_c(rep("prefix", 3), c("a", "b", "c"), rep("suffix", 3), sep = "-")
[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
01:00:25	Jon Harmon (jonthegeek):	> str_c("prefix", c("a", "b", "c"), "suffix", sep = "-")
[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
01:01:26	Jon Harmon (jonthegeek):	> str_c(c("prefix", "other", "another"), c("a", "b", "c"), "suffix", sep = "-")
[1] "prefix-a-suffix"  "other-b-suffix"   "another-c-suffix"
01:02:02	Jon Harmon (jonthegeek):	> str_c(c("prefix", "other"), c("a", "b", "c"), "suffix", sep = "-")
[1] "prefix-a-suffix" "other-b-suffix"  "prefix-c-suffix"
Warning message:
In stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE) :
  longer object length is not a multiple of shorter object length
01:04:54	Jon Harmon (jonthegeek):	> str_c("prefix", c("a", "b", "c"), "suffix", sep = "-", collapse = "|")
[1] "prefix-a-suffix|prefix-b-suffix|prefix-c-suffix"
01:09:51	Njoki Njuki Lucy:	thank you so much Ryan and Jon, strings are starting to make sense now!
01:09:53	Sandra Muroy:	thanks so much Ryna!
01:09:56	Sandra Muroy:	Ryan!
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/A-_URHkvxNA")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:07:40	Jon Harmon (jonthegeek):	statlearning.com
00:23:39	Jon Harmon (jonthegeek):	^ = start of text
$ = end of text
00:31:17	Jon Harmon (jonthegeek):	Notes are finally updated on our site if you want to follow along or review: https://r4ds.github.io/bookclub-r4ds/strings.html
00:31:37	Sandra Muroy:	thanks Jon!
00:37:56	Jon Harmon (jonthegeek):	https://en.wikipedia.org/wiki/Harvard_sentences
00:42:35	Jon Harmon (jonthegeek):	sum(stringr::str_detect(sentences, str_c(words, collapse = "|"), negate = TRUE))
00:45:50	Jon Harmon (jonthegeek):	1 %in% 1:10
00:46:02	Jon Harmon (jonthegeek):	"a" %in% "ab"
00:46:08	Jon Harmon (jonthegeek):	"a"
00:46:10	Jon Harmon (jonthegeek):	"ab"
00:46:18	Jon Harmon (jonthegeek):	"a" %in% c("a", "b")
01:02:39	Jon Harmon (jonthegeek):	regexr.com
```
</details>

### Cohort 6 {-}

`r knitr::include_url("https://www.youtube.com/embed/l07Vo1wMW5s")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:09:53	Daniel Adereti:	We will kick off in 1 min
00:14:55	Shannon:	https://r4ds.github.io/bookclub-r4ds/string-basics.html
00:18:48	Shannon:	https://stringr.tidyverse.org/
00:19:06	Shannon:	https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
00:21:02	Daniel Adereti:	For new members, we have this sheet where we volunteer to take classes each week, you are welcome to fill in your names. Thanks! https://docs.google.com/spreadsheets/d/1zy2nXNkvcdqWuF8rQ5ApWRkVQG_UJt0azu3h_mEnY2E/edit?usp=drive_web&ouid=115626840160043101981
00:26:49	Shannon:	https://en.wikipedia.org/wiki/List_of_Unicode_characters
01:05:02	Daniel Adereti:	That's fine!
01:05:13	Daniel Adereti:	We can pick up next week
01:05:37	Adeyemi Olusola:	Thank you!
01:05:45	Folashade Bayode:	Thank you 😊
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/k7poFr556To")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:15:15	Daniel Adereti:	We have this nice book that solves the exercises, if you are interested: https://jrnold.github.io/r4ds-exercise-solutions/strings.html
00:15:35	Marielena Soilemezidi:	awesome, thanks Daniel! :)
00:22:13	Marielena Soilemezidi:	the book has the same solution for this, so you did great actually! :)
00:36:42	Shannon:	https://regexcrossword.com/challenges/beginner
00:37:04	Marielena Soilemezidi:	ooh cool!
00:37:10	Marielena Soilemezidi:	thanks Shannon!
00:43:17	Marielena Soilemezidi:	0.0
00:43:26	Marielena Soilemezidi:	who thought of this example?!
00:52:47	Daniel Adereti:	Hey Marielena, if you want to volunteer for a class, you can fill in your name here: https://docs.google.com/spreadsheets/d/1zy2nXNkvcdqWuF8rQ5ApWRkVQG_UJt0azu3h_mEnY2E/edit#gid=0
00:55:10	Marielena Soilemezidi:	Thanks, Daniel! I'll check the next available chapters :)
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/9UuMNJB3KFA")`

<details>
  <summary> Meeting chat log </summary>
  
```
00:03:15	Adeyemi Olusola:	Good day!
00:04:23	Adeyemi Olusola:	That sounds perfect
00:48:27	Adeyemi Olusola:	I have to drop off now.
Thanks Shannon for the wonderful presentation!
00:49:28	Marielena Soilemezidi:	oh, cool! :)
00:50:24	Marielena Soilemezidi:	haha yaay!
00:51:47	Marielena Soilemezidi:	weird!
```
</details>


### Cohort 7 {-}

`r knitr::include_url("https://www.youtube.com/embed/NQq6EoCCsR0")`

<details>
<summary> Meeting chat log </summary>

```
00:16:35	Oluwafemi Oyedele:	https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
00:16:54	Oluwafemi Oyedele:	https://stringr.tidyverse.org/
00:19:37	Olukunle Tunde:	no sir
00:20:33	Oluwafemi Oyedele:	https://style.tidyverse.org/syntax.html#character-vectors
00:33:16	Oluwafemi Oyedele:	https://glue.tidyverse.org/
00:40:47	Oluwafemi Oyedele:	https://tidyr.tidyverse.org/reference/separate_longer_delim.html
00:42:51	Oluwafemi Oyedele:	https://tidyr.tidyverse.org/reference/separate_longer_delim.html
00:51:51	Oluwafemi Oyedele:	https://stringr.tidyverse.org/reference/str_length.html
00:59:39	Oluwafemi Oyedele:	https://rdrr.io/r/base/rawConversion.html
01:02:17	Oluwafemi Oyedele:	http://kunststube.net/encoding/
01:06:49	Olukunle Bashir:	thank you very much oluwafemi. it's well understood
```
</details>


### Cohort 8 {-}

`r knitr::include_url("https://www.youtube.com/embed/AiIZ0U6zMvU")`

<details>
<summary> Meeting chat log </summary>

```
00:27:26	shamsuddeen: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
00:27:37	Abdou: Thanks!
```
</details>