-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
totalDigest add spark dataframe column / array #65
Comments
Yes. This is a real problem. There is a pull request that I have accepted and which should make this Check out https://gist.github.com/RobColeman/7a5ebcb7c155c94b0a62 in the On Sat, Jun 11, 2016 at 6:50 PM, thiakx notifications@github.com wrote:
|
See my comment on #56 (comment) for a significant warning/caveat. Also, note that the linked Here's a different approach that we're currently using in our Spark code: /** Wrap TreeDigest inside a class that properly handles serialization so that it can be used inside Spark.
*
* @param digest The TreeDigest to wrap.
*/
class TreeDigestWrapper(private var digest: TreeDigest) extends Serializable {
def size(): Long = {
digest.size()
}
def add(x: Double): Unit = {
digest.add(x)
}
// ... expose other wrapper methods as necessary
@throws(classOf[IOException])
private def writeObject(out: ObjectOutputStream): Unit = {
val bytes = new Array[Byte](digest.byteSize)
out.writeInt(bytes.length)
digest.asBytes(ByteBuffer.wrap(bytes))
out.write(bytes)
}
@throws(classOf[IOException])
private def readObject(in: ObjectInputStream): Unit = {
val length = in.readInt()
val bytes = new Array[Byte](length)
in.readFully(bytes)
digest = TreeDigest.fromBytes(ByteBuffer.wrap(bytes))
}
} Then you can perform operations directly on the wrapper, and you'll only pay serialization/deserialization costs when data is moved between JVMs (which Spark should be good at minimizing). |
Kevin, Don't you think it would be better to do without the wrapper entirely? On Mon, Jun 13, 2016 at 6:03 AM, Kevin Litwack notifications@github.com
|
Oh, yeah just making things natively |
The digests are all natively Serializable now. This will be improved in the future to allow cross serialization and use more economical formats, but for now it works. |
Currently, we are loading a spark dataframe column into totalDigest with a hackerish way of using .take() and foreach, as tdigests keep throwing an "object not serializable" error without .take(). Is there a more native way of loading large array / spark dataframe columns directly into totalDigest?
Sample code we using
val totalDigest = TDigest.createDigest(100)
val data =df.select("col").rdd.map(r => r.getDouble(0)).take(numberOfRows)
data.foreach(value => totalDigest.add(value))
Alternative, using array of bytes:
http://apache-spark-user-list.1001560.n3.nabble.com/Percentile-td19978.html#a20032
The text was updated successfully, but these errors were encountered: