Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ #32709

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 31, 2021

What changes were proposed in this pull request?

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

It fixes three as below:

Why are the changes needed?

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Does this PR introduce any user-facing change?

Yes, SparkR will work together with R 4.1.0+

How was this patch tested?

./R/run-tests.sh
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...


utils:
functions in utils.R: ..............................................

@HyukjinKwon HyukjinKwon marked this pull request as draft May 31, 2021 07:55
@HyukjinKwon
Copy link
Member Author

I'll take a look in CI. If it's simple, i will just fix it here.

@github-actions github-actions bot added the INFRA label May 31, 2021
@dongjoon-hyun
Copy link
Member

Thanks, @HyukjinKwon !

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43620/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43620/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139108 has finished for PR 32709 at commit c573435.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139099 has finished for PR 32709 at commit ec08067.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43628/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43628/

@HyukjinKwon HyukjinKwon marked this pull request as ready for review May 31, 2021 12:37
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-35573][R] Support R 4.1.0 in SparkR [SPARK-35573][R] Support R 4.1.0 in SparkR May 31, 2021
@HyukjinKwon HyukjinKwon changed the title [SPARK-35573][R] Support R 4.1.0 in SparkR [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ May 31, 2021
@@ -1451,7 +1451,7 @@ test_that("column functions", {
expect_equal(collect(df2)[[3, 2]], TRUE)

# Test that input_file_name()
actual_names <- sort(collect(distinct(select(df, input_file_name()))))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was sorting a DataFrame in R which isn't legitimate. In addition, there's no point of sorting. It's a single value single column DataFrame.

                                                                                       input_file_name()
1 file:///private/var/folders/0c/q8y15ybd3tn7sr2_jmbmftr80000gp/T/RtmpwPkEnu/sparkr-testf2872350813d.tmp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is intentional to be a single value? Because it is actual_names. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, it is verified the length is 1. So removing sort looks good.

@@ -68,7 +68,7 @@ test_that("createDataFrame/collect Arrow optimization - type specification", {
callJMethod(conf, "set", "spark.sql.execution.arrow.sparkr.enabled", arrowEnabled)
})

expect_equal(collect(createDataFrame(rdf)), expected)
expect_true(all(collect(createDataFrame(rdf)) == expected))
Copy link
Member Author

@HyukjinKwon HyukjinKwon May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I work around to make the tests work with any R version. The problem was R 4.1 introduced check.tzone at all.equal which apparently testthat uses.

When you collect from a POSIXct with an empty tzone (by default), Arrow conversion fills a local timezone instead of being empty:

rdf <- data.frame(list(list(t = as.POSIXct("1990-02-24 12:34:56", tz="UTC"))))
SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "false")
withoutArrow <- collect(createDataFrame(rdf))
SparkR:::callJMethod(SparkR:::callJMethod(spark, "conf"), "set", "spark.sql.execution.arrow.sparkr.enabled", "true")
withArrow <- collect(createDataFrame(rdf))

attr(withoutArrow$t, "tzone")
attr(withArrow$t, "tzone")
[1] ""
[1] "Asia/Seoul"

Spark returns a local time instances in Scala, Python and R. Therefore, I think either being an empty timezone or local timezone can be correct in Spark context, and it's not an issue IMO.

FWIW, we're the one who sets the timezone on JVM side if I remember all correctly.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member Author

cc @shivaram, @felixcheung @falaki FYI

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139116 has finished for PR 32709 at commit 2f3cec2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43636/

expect_equal(actual, g)
if (as.numeric(R.Version()$major) >= 4 && !startsWith(R.Version()$minor, "0")) {
# 4.1+ checks environment in the function
expect_true(all.equal(actual, g, check.environment = FALSE))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like testthat uses all.equal internally which now compares environment of the functions (previously it didn't):

all.equal(f, g) for functions now by default also compares their environment(.)s, notably via new all.equal method for class function. Comparison of nls() fits, e.g., may now need all.equal(m1, m2, check.environment = FALSE).

https://cran.r-project.org/doc/manuals/r-release/NEWS.html

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43636/

@SparkQA
Copy link

SparkQA commented May 31, 2021

Test build #139117 has finished for PR 32709 at commit 97f3c0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 31, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43637/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a curious look, and looks okay.

@HyukjinKwon
Copy link
Member Author

CRAN still fails in my local but I am not sure if this is my env issue or not. I will merge this one first and take a separate look.

Thanks for review, @viirya. Please feel free to posthoc review if you guys find some time.

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request Jun 1, 2021
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: #32709 (comment)
- Treat the empty timezone and local timezone as equivalent in SparkR: #32709 (comment)
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is #32709 (comment)

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Yes, SparkR will work together with R 4.1.0+

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes #32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 1ba1b70)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Jun 1, 2021
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: #32709 (comment)
- Treat the empty timezone and local timezone as equivalent in SparkR: #32709 (comment)
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is #32709 (comment)

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Yes, SparkR will work together with R 4.1.0+

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes #32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 1ba1b70)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon
Copy link
Member Author

I have backported it to branch-3.1 and branch-3.0 too because this is a test-only, and in case other people run the tests with higher R versions.

@HyukjinKwon
Copy link
Member Author

CRAN was my env issue. Now the tests and CRAN check should work with R 4.1+ too.

@viirya
Copy link
Member

viirya commented Jun 1, 2021

Cool! Thanks @HyukjinKwon!

@dongjoon-hyun
Copy link
Member

Nice! Thank you so much!

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment)
- Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment)
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment)

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Yes, SparkR will work together with R 4.1.0+

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes apache#32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 1ba1b70)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-35573 branch January 4, 2022 00:53
sumwale pushed a commit to TIBCOSoftware/snappy-spark that referenced this pull request Jun 29, 2022
This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
 10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: apache#32709 (comment)
- Treat the empty timezone and local timezone as equivalent in SparkR: apache#32709 (comment)
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is apache#32709 (comment)

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

Yes, SparkR will work together with R 4.1.0+

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes apache#32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants