Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior using str_sub when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string #547

Closed
MiguelCos opened this issue Apr 17, 2024 · 3 comments · Fixed by #571

Comments

@MiguelCos
Copy link

MiguelCos commented Apr 17, 2024

Dear tidyverse team,

I think I have found an unexpected behavior in str_sub that I want to report, because I didn't find anything like this in the issue section.

Imagine we have the following string:

string_test <- "MEGUSTAJUGARBEISBOL"

I want to be able to define a truncation site based on a substring (i.e., "JUGAR", in my example), and use that information to get the 5 letters before and after the truncation site. In this case, the truncation site would be before the first "J", so I would expect the 5 letters after the truncation to be "JUGAR" and the 5 letters before the truncation to be "GUSTA". This works properly in the 1st example, but it doesn't when the trucation site is closer to the beginning of string_test.

Hopefully I can illustrate this better with the two examples below.

Example 1: shows expected behavior (5 letters before and after properly extracted)

# example 1: truncation at the end of 'MEGUSTA' 
peptide_test_1 <- "JUGAR"

str_locate(string_test, peptide_test_1)
     start end
[1,]     8  12

start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]


# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
[1] "GUSTA"

# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
[1] "JUGAR"

Nevertheless, when the 'truncation site' is just at start == 2 of string_test, I get an empty result, instead of the expected behavior of getting the letter at position at start == 1. See the example code:

Example 2: truncation after first "M", shows unexpected behavior

# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"

str_locate(string_test, peptide_test_2)
     start end
[1,]     2   7

start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]

# 5 AAs before truncation site
> str_sub(string_test, start_position - 5, start_position - 1) 
[1] ""

As you can see, I get "" instead of "M", which is the only letter before the 'truncation site'. I would expect to get "M" if it is the only letter before my 'truncation site'.

I would define this as unexpected behavior, but please let me know if I am missing something.

Thank you very much in advance for taking the time to check this. I will be very happy to receive your feedback on this.

Best wishes,
Miguel

Session info:

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tibble_3.2.1  tidyr_1.3.0   stringr_1.5.1 dplyr_1.1.2   purrr_1.0.1  
[6] readr_2.1.4   here_1.0.1

loaded via a namespace (and not attached):
 [1] crayon_1.5.2     vctrs_0.6.3      cli_3.6.1        rlang_1.1.1
 [5] stringi_1.7.12   generics_0.1.3   seqinr_4.2-30    jsonlite_1.8.5
 [9] glue_1.6.2       bit_4.0.5        rprojroot_2.0.4  hms_1.1.3
[13] fansi_1.0.4      MASS_7.3-60      tzdb_0.4.0       lifecycle_1.0.4
[17] compiler_4.3.2   Rcpp_1.0.11      pkgconfig_2.0.3  R6_2.5.1
[21] tidyselect_1.2.0 utf8_1.2.3       parallel_4.3.2   vroom_1.6.4
[25] pillar_1.9.0     magrittr_2.0.3   withr_2.5.2      tools_4.3.2
[29] bit64_4.0.5      ade4_1.7-22
@hadley
Copy link
Member

hadley commented Jul 15, 2024

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Jul 15, 2024
@MiguelCos
Copy link
Author

Hello Hadley,

thanks for checking this.

I rewrote the report using reprex.

Hopefully it makes sense.

Required package and test string:

library(stringr)

string_test <- "MEGUSTAJUGARBEISBOL"

Example 1: shows expected behavior (5 letters before and after properly extracted)

# example 1: truncation at the end of 'MEGUSTA' 
peptide_test_1 <- "JUGAR"

str_locate(string_test, peptide_test_1)
#>      start end
#> [1,]     8  12

start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]

# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] "GUSTA"

# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
#> [1] "JUGAR"

Example 2: truncation after first "M", shows unexpected behavior

# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"

str_locate(string_test, peptide_test_2)
#>      start end
#> [1,]     2   7

start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1) 
#> [1] ""

I would consider the "" output to be unexpected, because the string "EGUSTA" starts at the second position of "MEGUSTAJUGARBEISBOL", therefore I would expect to get at least the only letter before "M", not and empty string.

Example 3 and 4: same unexpected behavior but with truncation at the 3rd or 4th element of the string

# example 3: truncation after the second element of the string
peptide_test_3 <- "GUSTAJU"

str_locate(string_test, peptide_test_3)
#>      start end
#> [1,]     3   9

start_position <- str_locate(string_test, peptide_test_3)[, 1]
end_position <- str_locate(string_test, peptide_test_3)[, 2]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1) 
#> [1] ""

# example 4: truncation after the third element of the string
peptide_test_4 <- "USTAJUG"

str_locate(string_test, peptide_test_4)
#>      start end
#> [1,]     4  10

start_position <- str_locate(string_test, peptide_test_4)[, 1]
end_position <- str_locate(string_test, peptide_test_4)[, 2]


# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] ""

The results of the examples above seems to related to the start argument being of negative value.

Example 5: shows expected behavior when truncation happens after 4th element of the string

This seems to be because in this case, the start element of str_sub is 0.

# example 5: truncation after the fourth element of the string

peptide_test_5 <- "STAJUGA"

str_locate(string_test, peptide_test_5)
#>      start end
#> [1,]     5  11

start_position <- str_locate(string_test, peptide_test_5)[, 1]

# 5 AAs before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
#> [1] "MEGU"

Simplification of what I define as unexpected

Based on these code tests, it seems like a negative input in the start argument of str_sub with a positive value in the end argument would yield an empty result even if the input string contains 'subsettable' elements in that range.

Therefore for the following code examples

str_sub("MEGUSTAJUGARBEISBOL", -2, 7)
#> [1] ""

str_sub("MEGUSTAJUGARBEISBOL", -1, 7)
#> [1] ""

I would expect the same result as:

str_sub("MEGUSTAJUGARBEISBOL", 0, 7)
#> [1] "MEGUSTA"

Session info

sessionInfo()
#> R version 4.4.0 (2024-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.2     digest_0.6.33     fastmap_1.1.1     xfun_0.40        
#>  [5] magrittr_2.0.3    glue_1.7.0        R.utils_2.12.3    knitr_1.45       
#>  [9] htmltools_0.5.7   rmarkdown_2.26    lifecycle_1.0.4   cli_3.6.2        
#> [13] R.methodsS3_1.8.2 vctrs_0.6.5       reprex_2.1.0      withr_3.0.0      
#> [17] compiler_4.4.0    R.oo_1.26.0       R.cache_0.16.0    purrr_1.0.2      
#> [21] rstudioapi_0.15.0 tools_4.4.0       evaluate_0.23     yaml_2.3.8       
#> [25] rlang_1.1.3       fs_1.6.3          stringi_1.8.3

@hadley
Copy link
Member

hadley commented Aug 20, 2024

Looks like the problem is that I've failed to document what happens with negative integers — they count back from the right-hand side of the string. This might not be the most intuitive behaviour for your use case, but it's useful in general, and anyway is too late to change now.

@hadley hadley added documentation and removed reprex needs a minimal reproducible example labels Aug 20, 2024
hadley added a commit that referenced this issue Aug 20, 2024
* Better documentation for `start` and `end`. Fixes #547
* Add test for empty strings
* Check `value` length and add test
hadley added a commit that referenced this issue Aug 20, 2024
* Better documentation for `start` and `end`. Fixes #547
* Add test for empty strings
* Check `value` length and add test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants