Skip to content

Conversation

@eshilts
Copy link

@eshilts eshilts commented Aug 31, 2015

Added statsByKey() method for computing summary statistics of each key in an RDD.

x = sc.parallelize([("key_a", 1.0), ("key_a", 2.0), ("key_b", 2.0), ("key_b", 3.0)])
s = sorted(x.statsByKey().collect())
s[0]
#('key_a', (count: 2, mean: 1.5, stdev: 0.5, max: 2.0, min: 1.0))
s[1]
#('key_b', (count: 2, mean: 2.5, stdev: 0.5, max: 3.0, min: 2.0))

https://issues.apache.org/jira/browse/SPARK-10291

Added statsByKey() method for computing summary statistics of each key
in an RDD.
@eshilts eshilts force-pushed the SPARK-10291-statsByKey branch from 43d2662 to fdc858f Compare October 28, 2015 23:22
@eshilts
Copy link
Author

eshilts commented Oct 28, 2015

This is ready to test.

I often manually calc mean, stddev, etc across keys and this RDD method would make it a lot easier.

@andrewor14
Copy link
Contributor

ok to test. What do you think @srowen @JoshRosen?

@andrewor14
Copy link
Contributor

If we want something like this it would be good to add the Scala API first though.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47706 has finished for PR 8539 at commit fdc858f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 15, 2015

Yeah, this can't just exist in the Python API. It's not simplifying much, since there's already a whole class to do the accumulation of sufficient statistics; it's just a call to combineByKey. I appreciate the value of utility methods but have to weight it against adding another item to a core API and how often it'd be used. This is also straightforward to express in Spark SQL on a dataframe, no?

@srowen
Copy link
Member

srowen commented Jan 1, 2016

Do you mind closing this PR?

@asfgit asfgit closed this in 085f510 Feb 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants