-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709
Conversation
I'll take a look in CI. If it's simple, i will just fix it here. |
Thanks, @HyukjinKwon ! |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #139108 has finished for PR 32709 at commit
|
Test build #139099 has finished for PR 32709 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -1451,7 +1451,7 @@ test_that("column functions", { | |||
expect_equal(collect(df2)[[3, 2]], TRUE) | |||
|
|||
# Test that input_file_name() | |||
actual_names <- sort(collect(distinct(select(df, input_file_name())))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was sorting a DataFrame in R which isn't legitimate. In addition, there's no point of sorting. It's a single value single column DataFrame.
input_file_name()
1 file:///private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/RtmpwPkEnu/sparkr-testf2872350813d.tmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is intentional to be a single value? Because it is actual_names
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yea, it is verified the length is 1. So removing sort looks good.
@@ -68,7 +68,7 @@ test_that("createDataFrame/collect Arrow optimization - type specification", { | |||
callJMethod(conf, "set", "spark.sql.execution.arrow.sparkr.enabled", arrowEnabled) | |||
}) | |||
|
|||
expect_equal(collect(createDataFrame(rdf)), expected) | |||
expect_true(all(collect(createDataFrame(rdf)) == expected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I work around to make the tests work with any R version. The problem was R 4.1 introduced check.tzone
at all.equal
which apparently testthat
uses.
When you collect from a POSIXct with an empty tzone (by default), Arrow conversion fills a local timezone instead of being empty:
rdf <- data.frame(list(list(t = as.POSIXct("1990-02-24 12:34:56", tz="UTC"))))
SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "false")
withoutArrow <- collect(createDataFrame(rdf))
SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "true")
withArrow <- collect(createDataFrame(rdf))
attr(withoutArrow$t, "tzone")
attr(withArrow$t, "tzone")
[1] ""
[1] "Asia/Seoul"
Spark returns a local time instances in Scala, Python and R. Therefore, I think either being an empty timezone or local timezone can be correct in Spark context, and it's not an issue IMO.
FWIW, we're the one who sets the timezone on JVM side if I remember all correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MaxGekk @BryanCutler FYI.
cc @shivaram, @felixcheung @falaki FYI |
Test build #139116 has finished for PR 32709 at commit
|
Kubernetes integration test starting |
expect_equal(actual, g) | ||
if (as.numeric(R.Version()$major) >= 4 && !startsWith(R.Version()$minor, "0")) { | ||
# 4.1+ checks environment in the function | ||
expect_true(all.equal(actual, g, check.environment = FALSE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems like testthat
uses all.equal
internally which now compares environment of the functions (previously it didn't):
all.equal(f, g) for functions now by default also compares their environment(.)s, notably via new all.equal method for class function. Comparison of nls() fits, e.g., may now need all.equal(m1, m2, check.environment = FALSE).
Kubernetes integration test status success |
Test build #139117 has finished for PR 32709 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a curious look, and looks okay.
CRAN still fails in my local but I am not sure if this is my env issue or not. I will merge this one first and take a separate look. Thanks for review, @viirya. Please feel free to posthoc review if you guys find some time. |
Merged to master. |
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: #32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: #32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is #32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes #32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1ba1b70) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: #32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: #32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is #32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes #32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1ba1b70) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
I have backported it to branch-3.1 and branch-3.0 too because this is a test-only, and in case other people run the tests with higher R versions. |
CRAN was my env issue. Now the tests and CRAN check should work with R 4.1+ too. |
Cool! Thanks @HyukjinKwon! |
Nice! Thank you so much! |
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes apache#32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1ba1b70) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes apache#32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:
It fixes three as below:
check.environment
in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709 (comment)Why are the changes needed?
Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.
Does this PR introduce any user-facing change?
Yes, SparkR will work together with R 4.1.0+
How was this patch tested?