[WIP]: Spark 27463: Cogrouped Pandas Udf POC by d80tb7 · Pull Request #1 · d80tb7/spark

d80tb7 · 2019-06-21T05:45:52Z

This includes:
JVM serialisation for interleaved dataframes.
Python Deserialisation for interleaved dataframes
A skeleton cogroup implementation

As this is a Poc there's a couple of caveats:

This code is very rough!
The data passing is pretty minimal (e.g. it only supports exactly two dataframes, there's no ability distinguish on the python side between key and value columns etc)
The cogroup implementation I have doesn't actually work properly (the column pruning is removing all the non key columns of the right dataframe so the rhs passed to pandas is missing the value cols).

At this point I think I'd like to focus on:

Does the Data passing mechanaism (i.e. the deviation from arrow streaming) make sense.
If we are going to introduce such a data passing mechanism how complex should it be?
Does the high level implementation of the cogroup here make sense.

d80tb7 · 2019-06-21T08:11:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/BaseArrowPythonRunner.scala

+
+
+abstract class BaseArrowPythonRunner[T](
+                         funcs: Seq[ChainedPythonFunctions],


This is just some common stuff that I needed for both the new Data passing mechanism and the existing (Arrow Streaming mechanism). I've broken it out her mainly because made it easier for me to track what new functionality I'd actually added. I don't think a proper solution would really have this class hierarchy.

d80tb7 · 2019-06-21T08:15:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/InterleavedArrowWriter.scala

+import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, MessageSerializer}
+
+
+class InterleavedArrowWriter( leftRoot: VectorSchemaRoot,


this is analagous to org.apache.arrow.vector.ipc.ArrowWriter but allows for interleaved dataframes to be sent. I suspect it could all be more memory efficient if we had a different interface which allowed for left batch to be sent before right batch is loaded.

d80tb7 · 2019-06-21T08:16:41Z

sql/core/src/test/scala/org/apache/spark/sql/GroupedDataTest.scala

@@ -0,0 +1,33 @@
+/*


ignore this!

d80tb7 · 2019-06-21T08:18:49Z

python/pyspark/serializers.py

+
+    def __init__(self, stream):
+        import pyarrow as pa
+        self._schema1 = pa.read_schema(stream)


I wanted to read these also using the message reader but for some reason pa.read_schema(self_reader.read_next_message()) didn't work.

icexelloss · 2019-06-24T15:40:34Z

@d80tb7 This is a great start! I think this could be summited against apache/spark master because:
(1) The code looks reasonable
(2) More people will watch that
(3) It shows progress

WDYT? If you are going to do that then I will wait for that one and comment

d80tb7 · 2019-06-25T10:15:41Z

@icexelloss fair enough I've raised against apache/spark master here: apache#24965

…comparison assertions ## What changes were proposed in this pull request? This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`. **x86_64** ``` rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic apache#181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux scala> import java.lang.Float.floatToRawIntBits import java.lang.Float.floatToRawIntBits scala> floatToRawIntBits(0.0f/0.0f) res0: Int = -4194304 scala> floatToRawIntBits(Float.NaN) res1: Int = 2143289344 ``` **aarch64** ``` [rootarm-huangtianhua spark]# uname -a Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux scala> import java.lang.Float.floatToRawIntBits import java.lang.Float.floatToRawIntBits scala> floatToRawIntBits(0.0f/0.0f) res1: Int = 2143289344 scala> floatToRawIntBits(Float.NaN) res2: Int = 2143289344 ``` ## How was this patch tested? Pass the Jenkins (This removes the test coverage). Closes apache#25186 from huangtianhua/special-test-case-for-aarch64. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: apache#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

d80tb7 added 3 commits June 20, 2019 18:15

initial commit of cogroup

2e0b308

minor tidy up

64ff5ac

removed incorrect test

6d039e3

d80tb7 commented Jun 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/GroupedDataTest.scala Outdated

@@ -0,0 +1,33 @@

/*

Copy link

Owner Author

d80tb7 Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore this!

d80tb7 commented Jun 21, 2019

View reviewed changes

d80tb7 added 3 commits June 25, 2019 10:50

tidies up test, fixed output cols

d8a5c5d

removed incorrect file

73188f6

Revert: removed incorrect test

690fa14

d80tb7 closed this Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]: Spark 27463: Cogrouped Pandas Udf POC#1

[WIP]: Spark 27463: Cogrouped Pandas Udf POC#1
d80tb7 wants to merge 6 commits intomasterfrom
SPARK-27463-poc

d80tb7 commented Jun 21, 2019 •

edited

Loading

Uh oh!

d80tb7 Jun 21, 2019

Uh oh!

d80tb7 Jun 21, 2019

Uh oh!

d80tb7 Jun 21, 2019

Uh oh!

d80tb7 Jun 21, 2019

Uh oh!

icexelloss commented Jun 24, 2019

Uh oh!

d80tb7 commented Jun 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		abstract class BaseArrowPythonRunner[T](
		funcs: Seq[ChainedPythonFunctions],

		import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, MessageSerializer}


		class InterleavedArrowWriter( leftRoot: VectorSchemaRoot,

Conversation

d80tb7 commented Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d80tb7 Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

d80tb7 Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

d80tb7 Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

d80tb7 Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Jun 24, 2019

Uh oh!

d80tb7 commented Jun 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d80tb7 commented Jun 21, 2019 •

edited

Loading