Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions R/pkg/R/column.R
Original file line number Diff line number Diff line change
Expand Up @@ -164,12 +164,18 @@ setMethod("alias",
#' @aliases substr,Column-method
#'
#' @param x a Column.
#' @param start starting position.
#' @param start starting position. It should be 1-base.
#' @param stop ending position.
#' @examples
#' \dontrun{
#' df <- createDataFrame(list(list(a="abcdef")))
#' collect(select(df, substr(df$a, 1, 4))) # the result is `abcd`.
#' collect(select(df, substr(df$a, 2, 4))) # the result is `bcd`.
#' }
#' @note substr since 1.4.0
setMethod("substr", signature(x = "Column"),
function(x, start, stop) {
jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concern with changing this. As you can see it's been like this from the very beginning...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API behavior should be considered as wrong and performs inconsistently. Because for starting position 1, we get substring from 1st element, but for position 2, we still get the substring from 1. So we will get the following inconsistent results:

> collect(select(df, substr(df$a, 1, 5)))
  substring(a, 0, 5)
1              abcde
> collect(select(df, substr(df$a, 2, 5)))
  substring(a, 1, 4)
1               abcd

For such change, we might need to add a note in the doc as @HyukjinKwon suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I think it is another issue regarding 0/negative indices. I can deal it here if you strongly feel it is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to make the behavior the same before this change for any caller calling substr with common index like 0

Should we keep the behavior when calling substr with 0 as start index?

> df <- createDataFrame(list(list(a="abcdef")))
> collect(select(df, substr(df$a, 0, 5)))
  substring(a, -1, 6)
1                   f
> substr("abcdef", 0, 5)
[1] "abcde"

I think the previous behavior is pretty unreasonable..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do two things:

  1. add to the func doc that the start param should be 0-base and to add to the example with the result
    collect(select(df, substr(df$a, 0, 5))) # this should give you...

Copy link
Member Author

@viirya viirya Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean 1-base.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the func doc.

column(jc)
})

Expand Down
1 change: 1 addition & 0 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -1649,6 +1649,7 @@ test_that("string operators", {
expect_false(first(select(df, startsWith(df$name, "m")))[[1]])
expect_true(first(select(df, endsWith(df$name, "el")))[[1]])
expect_equal(first(select(df, substr(df$name, 1, 2)))[[1]], "Mi")
expect_equal(first(select(df, substr(df$name, 4, 6)))[[1]], "hae")
if (as.numeric(R.version$major) >= 3 && as.numeric(R.version$minor) >= 3) {
expect_true(startsWith("Hello World", "Hello"))
expect_false(endsWith("Hello World", "a"))
Expand Down
4 changes: 4 additions & 0 deletions docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
- The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
- For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.

## Upgrading to Spark 2.4.0

- The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`.