Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

MiguelCos · 2024-04-17T09:13:25Z

Dear tidyverse team,

I think I have found an unexpected behavior in str_sub that I want to report, because I didn't find anything like this in the issue section.

Imagine we have the following string:

string_test <- "MEGUSTAJUGARBEISBOL"

I want to be able to define a truncation site based on a substring (i.e., "JUGAR", in my example), and use that information to get the 5 letters before and after the truncation site. In this case, the truncation site would be before the first "J", so I would expect the 5 letters after the truncation to be "JUGAR" and the 5 letters before the truncation to be "GUSTA". This works properly in the 1st example, but it doesn't when the trucation site is closer to the beginning of string_test.

Hopefully I can illustrate this better with the two examples below.

Example 1: shows expected behavior (5 letters before and after properly extracted)

# example 1: truncation at the end of 'MEGUSTA' 
peptide_test_1 <- "JUGAR"

str_locate(string_test, peptide_test_1)
     start end
[1,]     8  12

start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]


# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
[1] "GUSTA"

# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
[1] "JUGAR"

Nevertheless, when the 'truncation site' is just at start == 2 of string_test, I get an empty result, instead of the expected behavior of getting the letter at position at start == 1. See the example code:

Example 2: truncation after first "M", shows unexpected behavior

# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"

str_locate(string_test, peptide_test_2)
     start end
[1,]     2   7

start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]

# 5 AAs before truncation site
> str_sub(string_test, start_position - 5, start_position - 1) 
[1] ""

As you can see, I get "" instead of "M", which is the only letter before the 'truncation site'. I would expect to get "M" if it is the only letter before my 'truncation site'.

I would define this as unexpected behavior, but please let me know if I am missing something.

Thank you very much in advance for taking the time to check this. I will be very happy to receive your feedback on this.

Best wishes,
Miguel

Session info:

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tibble_3.2.1  tidyr_1.3.0   stringr_1.5.1 dplyr_1.1.2   purrr_1.0.1  
[6] readr_2.1.4   here_1.0.1

loaded via a namespace (and not attached):
 [1] crayon_1.5.2     vctrs_0.6.3      cli_3.6.1        rlang_1.1.1
 [5] stringi_1.7.12   generics_0.1.3   seqinr_4.2-30    jsonlite_1.8.5
 [9] glue_1.6.2       bit_4.0.5        rprojroot_2.0.4  hms_1.1.3
[13] fansi_1.0.4      MASS_7.3-60      tzdb_0.4.0       lifecycle_1.0.4
[17] compiler_4.3.2   Rcpp_1.0.11      pkgconfig_2.0.3  R6_2.5.1
[21] tidyselect_1.2.0 utf8_1.2.3       parallel_4.3.2   vroom_1.6.4
[25] pillar_1.9.0     magrittr_2.0.3   withr_2.5.2      tools_4.3.2
[29] bit64_4.0.5      ade4_1.7-22

The text was updated successfully, but these errors were encountered:

hadley · 2024-07-15T21:21:46Z

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

MiguelCos · 2024-08-07T18:32:39Z

Hello Hadley,

thanks for checking this.

I rewrote the report using reprex.

Hopefully it makes sense.

Required package and test string:

library(stringr)

string_test <- "MEGUSTAJUGARBEISBOL"

Example 1: shows expected behavior (5 letters before and after properly extracted)

# example 1: truncation at the end of 'MEGUSTA' 
peptide_test_1 <- "JUGAR"

str_locate(string_test, peptide_test_1)
#>      start end
#> [1,]     8  12

start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]

# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] "GUSTA"

# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
#> [1] "JUGAR"

Example 2: truncation after first "M", shows unexpected behavior

# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"

str_locate(string_test, peptide_test_2)
#>      start end
#> [1,]     2   7

start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1) 
#> [1] ""

I would consider the "" output to be unexpected, because the string "EGUSTA" starts at the second position of "MEGUSTAJUGARBEISBOL", therefore I would expect to get at least the only letter before "M", not and empty string.

Example 3 and 4: same unexpected behavior but with truncation at the 3rd or 4th element of the string

# example 3: truncation after the second element of the string
peptide_test_3 <- "GUSTAJU"

str_locate(string_test, peptide_test_3)
#>      start end
#> [1,]     3   9

start_position <- str_locate(string_test, peptide_test_3)[, 1]
end_position <- str_locate(string_test, peptide_test_3)[, 2]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1) 
#> [1] ""

# example 4: truncation after the third element of the string
peptide_test_4 <- "USTAJUG"

str_locate(string_test, peptide_test_4)
#>      start end
#> [1,]     4  10

start_position <- str_locate(string_test, peptide_test_4)[, 1]
end_position <- str_locate(string_test, peptide_test_4)[, 2]


# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] ""

The results of the examples above seems to related to the start argument being of negative value.

Example 5: shows expected behavior when truncation happens after 4th element of the string

This seems to be because in this case, the start element of str_sub is 0.

# example 5: truncation after the fourth element of the string

peptide_test_5 <- "STAJUGA"

str_locate(string_test, peptide_test_5)
#>      start end
#> [1,]     5  11

start_position <- str_locate(string_test, peptide_test_5)[, 1]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] "MEGU"

Simplification of what I define as unexpected

Based on these code tests, it seems like a negative input in the start argument of str_sub with a positive value in the end argument would yield an empty result even if the input string contains 'subsettable' elements in that range.

Therefore for the following code examples

str_sub("MEGUSTAJUGARBEISBOL", -2, 7)
#> [1] ""

str_sub("MEGUSTAJUGARBEISBOL", -1, 7)
#> [1] ""

I would expect the same result as:

str_sub("MEGUSTAJUGARBEISBOL", 0, 7)
#> [1] "MEGUSTA"

Session info

sessionInfo()
#> R version 4.4.0 (2024-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.2     digest_0.6.33     fastmap_1.1.1     xfun_0.40        
#>  [5] magrittr_2.0.3    glue_1.7.0        R.utils_2.12.3    knitr_1.45       
#>  [9] htmltools_0.5.7   rmarkdown_2.26    lifecycle_1.0.4   cli_3.6.2        
#> [13] R.methodsS3_1.8.2 vctrs_0.6.5       reprex_2.1.0      withr_3.0.0      
#> [17] compiler_4.4.0    R.oo_1.26.0       R.cache_0.16.0    purrr_1.0.2      
#> [21] rstudioapi_0.15.0 tools_4.4.0       evaluate_0.23     yaml_2.3.8       
#> [25] rlang_1.1.3       fs_1.6.3          stringi_1.8.3

hadley · 2024-08-20T19:51:50Z

Looks like the problem is that I've failed to document what happens with negative integers — they count back from the right-hand side of the string. This might not be the most intuitive behaviour for your use case, but it's useful in general, and anyway is too late to change now.

* Better documentation for `start` and `end`. Fixes #547 * Add test for empty strings * Check `value` length and add test

hadley added the reprex needs a minimal reproducible example label Jul 15, 2024

hadley added documentation and removed reprex needs a minimal reproducible example labels Aug 20, 2024

hadley added a commit that referenced this issue Aug 20, 2024

Polishing str_sub()

925113f

* Better documentation for `start` and `end`. Fixes #547 * Add test for empty strings * Check `value` length and add test

hadley mentioned this issue Aug 20, 2024

Polishing str_sub() #571

Merged

hadley closed this as completed in #571 Aug 20, 2024

hadley added a commit that referenced this issue Aug 20, 2024

Polishing str_sub() (#571)

9304301

* Better documentation for `start` and `end`. Fixes #547 * Add test for empty strings * Check `value` length and add test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

MiguelCos commented Apr 17, 2024 •

edited

Loading

hadley commented Jul 15, 2024

MiguelCos commented Aug 7, 2024

hadley commented Aug 20, 2024

Unexpected behavior using str_sub when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

Unexpected behavior using str_sub when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

Comments

MiguelCos commented Apr 17, 2024 • edited Loading

Example 1: shows expected behavior (5 letters before and after properly extracted)

Example 2: truncation after first "M", shows unexpected behavior

hadley commented Jul 15, 2024

MiguelCos commented Aug 7, 2024

Required package and test string:

Example 1: shows expected behavior (5 letters before and after properly extracted)

Example 2: truncation after first "M", shows unexpected behavior

Example 3 and 4: same unexpected behavior but with truncation at the 3rd or 4th element of the string

Example 5: shows expected behavior when truncation happens after 4th element of the string

Simplification of what I define as unexpected

Session info

hadley commented Aug 20, 2024

Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

MiguelCos commented Apr 17, 2024 •

edited

Loading