Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,21 @@ setMethod("column",
function(x) {
col(x)
})
#' corr
#'
#' Computes the Pearson Correlation Coefficient for two Columns.
#'
#' @rdname corr
#' @name corr
#' @family math_funcs
#' @export
#' @examples \dontrun{corr(df$c, df$d)}
setMethod("corr", signature(x = "Column"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two versions of corr():

def corr(column1: Column, column2: Column): Column
def corr(columnName1: String, columnName2: String): Column

We'd better support both. Something like:

setMethod("corr", signature(x = "characterOrColumn"),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same for count, max, mean and so on, so change we would need to change every function here - should we do that?

function(x, col2) {
stopifnot(class(col2) == "Column")
jc <- callJStatic("org.apache.spark.sql.functions", "corr", x@jc, col2@jc)
column(jc)
})

#' cos
#'
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@ setGeneric("cov", function(x, col1, col2) {standardGeneric("cov") })

#' @rdname statfunctions
#' @export
setGeneric("corr", function(x, col1, col2, method = "pearson") {standardGeneric("corr") })
setGeneric("corr", function(x, ...) {standardGeneric("corr") })

#' @rdname summary
#' @export
Expand Down
9 changes: 5 additions & 4 deletions R/pkg/R/stats.R
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ setMethod("cov",
#' Calculates the correlation of two columns of a DataFrame.
#' Currently only supports the Pearson Correlation Coefficient.
#' For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
#'
#'
#' @param x A SparkSQL DataFrame
#' @param col1 the name of the first column
#' @param col2 the name of the second column
Expand All @@ -95,8 +95,9 @@ setMethod("cov",
#' corr <- corr(df, "title", "gender", method = "pearson")
#' }
setMethod("corr",
signature(x = "DataFrame", col1 = "character", col2 = "character"),
signature(x = "DataFrame"),
function(x, col1, col2, method = "pearson") {
stopifnot(class(col1) == "character" && class(col2) == "character")
statFunctions <- callJMethod(x@sdf, "stat")
callJMethod(statFunctions, "corr", col1, col2, method)
})
Expand All @@ -109,7 +110,7 @@ setMethod("corr",
#'
#' @param x A SparkSQL DataFrame.
#' @param cols A vector column names to search frequent items in.
#' @param support (Optional) The minimum frequency for an item to be considered `frequent`.
#' @param support (Optional) The minimum frequency for an item to be considered `frequent`.
#' Should be greater than 1e-4. Default support = 0.01.
#' @return a local R data.frame with the frequent items in each column
#'
Expand All @@ -131,7 +132,7 @@ setMethod("freqItems", signature(x = "DataFrame", cols = "character"),
#' sampleBy
#'
#' Returns a stratified sample without replacement based on the fraction given on each stratum.
#'
#'
#' @param x A SparkSQL DataFrame
#' @param col column that defines strata
#' @param fractions A named list giving sampling fraction for each stratum. If a stratum is
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/inst/tests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -887,7 +887,7 @@ test_that("column functions", {
c11 <- to_date(c) + trim(c) + unbase64(c) + unhex(c) + upper(c)
c12 <- variance(c)
c13 <- lead("col", 1) + lead(c, 1) + lag("col", 1) + lag(c, 1)
c14 <- cume_dist() + ntile(1)
c14 <- cume_dist() + ntile(1) + corr(c, c1)
c15 <- dense_rank() + percent_rank() + rank() + row_number()

# Test if base::rank() is exposed
Expand Down