[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

HyukjinKwon · 2021-05-31T07:55:03Z

What changes were proposed in this pull request?

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

It fixes three as below:

Avoid a sort on DataFrame which isn't legitimate: [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709 (comment)
Treat the empty timezone and local timezone as equivalent in SparkR: [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709 (comment)
Disable check.environment in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709 (comment)

Why are the changes needed?

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Does this PR introduce any user-facing change?

Yes, SparkR will work together with R 4.1.0+

How was this patch tested?

./R/run-tests.sh

sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...


utils:
functions in utils.R: ..............................................

HyukjinKwon · 2021-05-31T07:56:03Z

I'll take a look in CI. If it's simple, i will just fix it here.

dongjoon-hyun · 2021-05-31T07:57:09Z

Thanks, @HyukjinKwon !

SparkQA · 2021-05-31T08:57:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43620/

SparkQA · 2021-05-31T09:35:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43620/

SparkQA · 2021-05-31T10:09:10Z

Test build #139108 has finished for PR 32709 at commit c573435.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-31T10:44:49Z

Test build #139099 has finished for PR 32709 at commit ec08067.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-31T10:54:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43628/

SparkQA · 2021-05-31T11:27:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43628/

HyukjinKwon · 2021-05-31T12:38:54Z

R/pkg/tests/fulltests/test_sparkSQL.R

@@ -1451,7 +1451,7 @@ test_that("column functions", {
  expect_equal(collect(df2)[[3, 2]], TRUE)

  # Test that input_file_name()
-  actual_names <- sort(collect(distinct(select(df, input_file_name()))))


It was sorting a DataFrame in R which isn't legitimate. In addition, there's no point of sorting. It's a single value single column DataFrame.

input_file_name() 1 file:///private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/RtmpwPkEnu/sparkr-testf2872350813d.tmp

Not sure if it is intentional to be a single value? Because it is actual_names. :)

Oh yea, it is verified the length is 1. So removing sort looks good.

HyukjinKwon · 2021-05-31T12:49:10Z

R/pkg/tests/fulltests/test_sparkSQL_arrow.R

@@ -68,7 +68,7 @@ test_that("createDataFrame/collect Arrow optimization - type specification", {
    callJMethod(conf, "set", "spark.sql.execution.arrow.sparkr.enabled", arrowEnabled)
  })

-  expect_equal(collect(createDataFrame(rdf)), expected)
+  expect_true(all(collect(createDataFrame(rdf)) == expected))


Here I work around to make the tests work with any R version. The problem was R 4.1 introduced check.tzone at all.equal which apparently testthat uses.

When you collect from a POSIXct with an empty tzone (by default), Arrow conversion fills a local timezone instead of being empty:

rdf <- data.frame(list(list(t = as.POSIXct("1990-02-24 12:34:56", tz="UTC")))) SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "false") withoutArrow <- collect(createDataFrame(rdf)) SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "true") withArrow <- collect(createDataFrame(rdf)) attr(withoutArrow$t, "tzone") attr(withArrow$t, "tzone")

[1] "" [1] "Asia/Seoul"

Spark returns a local time instances in Scala, Python and R. Therefore, I think either being an empty timezone or local timezone can be correct in Spark context, and it's not an issue IMO.

FWIW, we're the one who sets the timezone on JVM side if I remember all correctly.

cc @MaxGekk @BryanCutler FYI.

HyukjinKwon · 2021-05-31T12:52:26Z

cc @shivaram, @felixcheung @falaki FYI

SparkQA · 2021-05-31T13:38:03Z

Test build #139116 has finished for PR 32709 at commit 2f3cec2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-31T13:38:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43636/

HyukjinKwon · 2021-05-31T13:59:03Z

R/pkg/tests/fulltests/test_utils.R

-  expect_equal(actual, g)
+  if (as.numeric(R.Version()$major) >= 4 && !startsWith(R.Version()$minor, "0")) {
+    # 4.1+ checks environment in the function
+    expect_true(all.equal(actual, g, check.environment = FALSE))


This also seems like testthat uses all.equal internally which now compares environment of the functions (previously it didn't):

all.equal(f, g) for functions now by default also compares their environment(.)s, notably via new all.equal method for class function. Comparison of nls() fits, e.g., may now need all.equal(m1, m2, check.environment = FALSE).

https://cran.r-project.org/doc/manuals/r-release/NEWS.html

SparkQA · 2021-05-31T14:13:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43636/

SparkQA · 2021-05-31T14:33:26Z

Test build #139117 has finished for PR 32709 at commit 97f3c0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-31T14:47:01Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43637/

viirya

Took a curious look, and looks okay.

HyukjinKwon · 2021-06-01T01:35:36Z

CRAN still fails in my local but I am not sure if this is my env issue or not. I will merge this one first and take a separate look.

Thanks for review, @viirya. Please feel free to posthoc review if you guys find some time.

HyukjinKwon · 2021-06-01T01:35:44Z

Merged to master.

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: #32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: #32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is #32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes #32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1ba1b70) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon · 2021-06-01T01:37:39Z

I have backported it to branch-3.1 and branch-3.0 too because this is a test-only, and in case other people run the tests with higher R versions.

HyukjinKwon · 2021-06-01T03:30:39Z

CRAN was my env issue. Now the tests and CRAN check should work with R 4.1+ too.

viirya · 2021-06-01T04:05:18Z

Cool! Thanks @HyukjinKwon!

dongjoon-hyun · 2021-06-01T07:32:42Z

Nice! Thank you so much!

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes apache#32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1ba1b70) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment) - Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment) - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment) Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. Yes, SparkR will work together with R 4.1.0+ ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes apache#32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Support R 4.1.0 in SparkR

ec08067

HyukjinKwon marked this pull request as draft May 31, 2021 07:55

github-actions bot added the INFRA label May 31, 2021

Update condition

c573435

fix tests

2f3cec2

HyukjinKwon marked this pull request as ready for review May 31, 2021 12:37

HyukjinKwon changed the title ~~[WIP][SPARK-35573][R] Support R 4.1.0 in SparkR~~ [SPARK-35573][R] Support R 4.1.0 in SparkR May 31, 2021

github-actions bot added R SQL labels May 31, 2021

HyukjinKwon changed the title ~~[SPARK-35573][R] Support R 4.1.0 in SparkR~~ [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ May 31, 2021

HyukjinKwon commented May 31, 2021

View reviewed changes

Fix tests in utils too

97f3c0e

HyukjinKwon commented May 31, 2021

View reviewed changes

viirya reviewed May 31, 2021

View reviewed changes

HyukjinKwon closed this in 1ba1b70 Jun 1, 2021

HyukjinKwon deleted the SPARK-35573 branch January 4, 2022 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

HyukjinKwon commented May 31, 2021 •

edited

Loading

HyukjinKwon commented May 31, 2021

dongjoon-hyun commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

HyukjinKwon May 31, 2021

viirya May 31, 2021

viirya May 31, 2021

HyukjinKwon May 31, 2021 •

edited

Loading

HyukjinKwon May 31, 2021 •

edited

Loading

HyukjinKwon commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

HyukjinKwon May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

viirya left a comment

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

viirya commented Jun 1, 2021

dongjoon-hyun commented Jun 1, 2021

[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

Conversation

HyukjinKwon commented May 31, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented May 31, 2021

dongjoon-hyun commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

HyukjinKwon May 31, 2021

Choose a reason for hiding this comment

viirya May 31, 2021

Choose a reason for hiding this comment

viirya May 31, 2021

Choose a reason for hiding this comment

HyukjinKwon May 31, 2021 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon May 31, 2021 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

HyukjinKwon May 31, 2021

Choose a reason for hiding this comment

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

SparkQA commented May 31, 2021

viirya left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

HyukjinKwon commented Jun 1, 2021

viirya commented Jun 1, 2021

dongjoon-hyun commented Jun 1, 2021

HyukjinKwon commented May 31, 2021 •

edited

Loading

HyukjinKwon May 31, 2021 •

edited

Loading

HyukjinKwon May 31, 2021 •

edited

Loading