-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] stringr binding for str_sub()
silently mishandles negative start/stop values
#43960
Comments
str_sub()
silently mishandles 'negative' index valuesstr_sub()
silently mishandles 'negative' index values
str_sub()
silently mishandles 'negative' index valuesstr_sub()
silently mishandles negative start/stop values
Can confirm that I can replicate this with the dev version of Arrow too. In our implementation of
That might not be quite right, so we'll need to add some more tests and revisit the logic here. |
@thisisnic totally agree that the snippet you point out is the culprit for the behavior in Example 1 above. I think the conditional logic could instead be changed to the following in order to avoid returning empty strings for negative values of
Upon looking at the source code (sorry I hadn't taken the time to examine before), I think there are more problems in addition to this (but please correct me if I'm wrong!). I believe the
For negative However, for just the [default] case when
which explains why I'm happy to create a MR that should fix this, with the warning that while I've been working with R for quite a while, I'm a total newb when it comes to the collaborative development side of things (@thisisnic thanks so much for your https://github.com/forwards/first-contributions tutorial!). |
I'd love if you'd be happy to contribute a PR! And don't worry, we're friendly around here - we don't expect everything t be perfect first time round (my PRs still aren't ;) ) and will help you through the process of become an arrow contributor! Thanks for investigating this so thoroughly, and look forward to reviewing a PR! :D |
OK, sounds good -- I'll write a PR for this! |
… values (#44141) First-time contributor here, so let me know where I can improve! ### Rationale for this change The `str_sub` binding in arrow was not handling negative `end` values properly. The problem was two-fold: 1. When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 2. When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. See discussion/examples at #43960 for details. ### What changes are included in this PR? 1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set `end`= 0 when `start < end`, which meant if the user was counting backwards from the end of the string (with a negative `end` value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (`utf8_slice_codeunits`). 2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the difference in between R's inclusive `end` and C++'s exclusive `end` when `end` is negative. 3. The addition of a test (described below) to `r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases. ### Are these changes tested? Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, including one which I added (see attached commit), which explicitly tests the case where `end` is negative (-3) and less than the `start` value (1). This also tests the case where `end` < -1 and to the right of the `start` position. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Previously: - When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. - When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. * GitHub Issue: #43960 Lead-authored-by: Stephen Coussens <coussens@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Issue resolved by pull request 44141 |
Describe the bug, including details regarding any error messages, version, and platform.
I noticed some unusual behavior behavior when attempting to use negative start/end values (i.e. counting from the end of the string) when using
str_sub()
in arrow. I've included a few examples below, contrasting howstr_sub
behaves with tibbles in R and arrow tables:Created on 2024-09-05 with reprex v2.1.1
Note: the above reprex was created on an Ubuntu 22.04 system running R 4.4.1 and Arrow 16.1.0
Component(s)
R
The text was updated successfully, but these errors were encountered: