Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] stringr binding for str_sub() silently mishandles negative start/stop values #43960

Closed
coussens opened this issue Sep 5, 2024 · 5 comments
Assignees
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug
Milestone

Comments

@coussens
Copy link
Contributor

coussens commented Sep 5, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I noticed some unusual behavior behavior when attempting to use negative start/end values (i.e. counting from the end of the string) when using str_sub() in arrow. I've included a few examples below, contrasting how str_sub behaves with tibbles in R and arrow tables:

library(arrow)
library(tidyverse)
library(reprex)

str_tbl <- tibble(my_string = 'abcde')
str_tbl_a <- as_arrow_table(str_tbl)

# example #1: extract first through second-to-last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = 1, end = -2))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     abcd

# but with arrow: ruh-roh -- missing all characters
str_tbl_a %>%  mutate(my_substring = str_sub(my_string, start = 1, end = -2)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     ""


# example #2: extract third-to-last through second-to-last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = -3, end = -2))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cd

# but with arrow: ruh-roh -- missing a character
str_tbl_a %>% mutate(my_substring = str_sub(my_string, -3, -2)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     c


# example #3: extract third-to-last through last characters from string

# works fine with a tibble
str_tbl %>% mutate(my_substring = str_sub(my_string, start = -3, end = -1))
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cde

# but with arrow: bizarrely, this is also fine
str_tbl_a %>% mutate(my_substring = str_sub(my_string, -3, -1)) %>% collect()
#> # A tibble: 1 × 2
#>   my_string my_substring
#>   <chr>     <chr>       
#> 1 abcde     cde

Created on 2024-09-05 with reprex v2.1.1

Note: the above reprex was created on an Ubuntu 22.04 system running R 4.4.1 and Arrow 16.1.0

Component(s)

R

@coussens coussens changed the title stringr binding for str_sub() silently mishandles 'negative' index values [R] stringr binding for str_sub() silently mishandles 'negative' index values Sep 5, 2024
@coussens coussens changed the title [R] stringr binding for str_sub() silently mishandles 'negative' index values [R] stringr binding for str_sub() silently mishandles negative start/stop values Sep 5, 2024
@thisisnic
Copy link
Member

Can confirm that I can replicate this with the dev version of Arrow too.

In our implementation of str_sub() there's a bit that looks like this:

    # An end value lower than a start value returns an empty string in
    # stringr::str_sub so set end to 0 here to match this behavior
    if (end < start) {
      end <- 0
    }

That might not be quite right, so we'll need to add some more tests and revisit the logic here.

@coussens
Copy link
Contributor Author

coussens commented Sep 11, 2024

@thisisnic totally agree that the snippet you point out is the culprit for the behavior in Example 1 above. I think the conditional logic could instead be changed to the following in order to avoid returning empty strings for negative values of end:

    if(end < start & end > 0) {
      end <- 0
      start <- 0
    }

Upon looking at the source code (sorry I hadn't taken the time to examine before), I think there are more problems in addition to this (but please correct me if I'm wrong!).

I believe the end argument in the C++ function utf8_slice_codeunits is exclusive, so won't include my_string[end] in the returned substring. For positive end values, this poses no issue, since in C++ is 0-based and R is 1-based, cancelling each other out. Note that for positive start values, the below snippet in the source code addresses this:

    # subtract 1 from `start` because C++ is 0-based and R is 1-based
    # str_sub treats a `start` value of 0 or 1 as the same thing so don't subtract 1 when `start` == 0
    # when `start` < 0, both str_sub and utf8_slice_codeunits count backwards from the end
    if (start > 0) {
      start <- start - 1L
    }

For negative start values, this isn't necessary, as noted in source comments above. But that leaves the issue of negative end values unsolved. This explains why in my Example 2 above arrow::str_sub fails to include my_string[-2] (i.e. only returns 'c' instead of the correct 'cd').

However, for just the [default] case when end == -1, this problem is incidentally fixed by this code:

    # In stringr::str_sub, an `end` value of -1 means the end of the string, so
    # set it to the maximum integer to match this behavior
    if (end == -1) {
      end <- .Machine$integer.max
    }

which explains why arrow::str_sub works fine (as in my Example 3 above) when end == -1, but not when end < -1 (which confused me at first).

I'm happy to create a MR that should fix this, with the warning that while I've been working with R for quite a while, I'm a total newb when it comes to the collaborative development side of things (@thisisnic thanks so much for your https://github.com/forwards/first-contributions tutorial!).

@thisisnic
Copy link
Member

I'd love if you'd be happy to contribute a PR! And don't worry, we're friendly around here - we don't expect everything t be perfect first time round (my PRs still aren't ;) ) and will help you through the process of become an arrow contributor!

Thanks for investigating this so thoroughly, and look forward to reviewing a PR! :D

@coussens
Copy link
Contributor Author

OK, sounds good -- I'll write a PR for this!

thisisnic added a commit that referenced this issue Sep 21, 2024
… values (#44141)

First-time contributor here, so let me know where I can improve!

### Rationale for this change

The `str_sub` binding in arrow was not handling negative `end` values properly. The problem was two-fold: 
 1.  When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 
2.  When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. 

See discussion/examples at #43960 for details.

### What changes are included in this PR?
1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set `end`= 0 when `start < end`, which meant if the user was counting backwards from the end of the string (with a negative `end` value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (`utf8_slice_codeunits`).
2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the difference in between R's inclusive `end` and C++'s exclusive `end` when `end` is negative.
3. The addition of a test (described below) to `r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.

### Are these changes tested?

Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, including one which I added (see attached commit), which explicitly tests the case where `end` is negative (-3) and less than the `start` value (1). This also tests the case where `end`  < -1 and to the right of the `start` position.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** Previously:
- When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 
- When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. 
* GitHub Issue: #43960

Lead-authored-by: Stephen Coussens <coussens@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 18.0.0 milestone Sep 21, 2024
@thisisnic
Copy link
Member

Issue resolved by pull request 44141
#44141

@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Critical Type: bug
Projects
None yet
Development

No branches or pull requests

3 participants