Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNAP-1136] Pooled version of Kryo serializer which works for closures #426

Merged
merged 11 commits into from
Nov 28, 2016

Conversation

sumwale
Copy link
Contributor

@sumwale sumwale commented Nov 25, 2016

Changes proposed in this pull request

  • new PooledKryoSerializer that does pooling of Kryo objects (else performance is bad if
    new instance is created for every call which needs to register and walk tons of classes)
  • has an overridden version for ASCII strings to fix (readClassAndObject throws KryoException: Buffer underflow. EsotericSoftware/kryo#128);
    currently makes a copy but will be modified to use one extra byte to indicate end of string
  • optimized external serializers for StructType, and Externalizable having readResolve() method;
    using latter for StorageLevel and BlockManagerId
  • added optimized serialization for the closure used by SparkSQLExecuteImpl (now a proper class instead)
  • fixed index column determination in RowFormatRelation (was off by 1 due to 0 based vs 1 based)

Patch testing

precheckin with spark precheckin in progress

ReleaseNotes.txt changes

NA

Other PRs

TIBCOSoftware/snappy-spark#27

- new PooledKryoSerializer that does pooling of Kryo objects (else performance is bad if
    new instance is created for every call which needs to register and walk tons of classes)
- has an overridden version for ASCII strings to fix (EsotericSoftware/kryo#128);
  currently makes a copy but will be modified to use one extra byte to indicate end of string
- optimized external serializers for StructType, and Externalizable having readResolve() method;
  using latter for StorageLevel and BlockManagerId
- added optimized serialization for the closure used by SparkSQLExecuteImpl (now a proper class instead)
- fixed index column determination in RowFormatRelation (was off by 1 due to 0 based vs 1 based)
@sumwale
Copy link
Contributor Author

sumwale commented Nov 25, 2016

run precheckin

Sumedh Wale added 8 commits November 26, 2016 03:17
…ce it does not use Lead API

formatting changes and fixed some compiler warnings
- Kryo serialization for RowFormatScanRDD, SparkShellRowRDD, ColumnarStorePartitionedRDD, SparkShellCachedBatchRDD and MultiBucketExecutorPartition
- added base RDDKryo to encapsulate serialization of bare minimum fields in RDD (using reflection where required)
- removed unused SparkShellRDDHelper.mapBucketsToPartitions
correcting an issue in SparkSQLExecuteImpl

avoid sending ConnectionProperties in RowFormatScanRDD for cases where it is not required in execution

updated log4j.properties for core/cluster tests

removed LaunchTasks from PooledKryoSerializer which is not yet in the code base

change Attribute to StructField in columns decoders since StructType has an efficient serializer
as well as being cleaner since it doesn't depend on Attribute (with potentially invalid ExprId
    for remote node though those fields are not used)

removing "a." alias in ORDER BY for Query36 in NorthWind tests (due to fix for SPARK-17863 as reported in SNAP-1183)
- renamed passed partitioning to "parts" to avoid conflict with RDD.partitions that caused infinite loop in getPartitions
- renamed TypeUtils to TypeUtilities to fix docs error (due to conflict with Spark TypeUtils)
- updating spark link to fix AQP dunits with the new kryo serializer
- skip DUnitSingleTest from the aqp test target since those really are dunits which should not be run like normal junit tests
- re-create snappy catalog connection for MetaException failures too (message says "... we don't support retries ...")
- clear the serializer/codec system properties when stopping Spark so that these are not carried through to subsequent tests in same JVMs
@sumwale
Copy link
Contributor Author

sumwale commented Nov 27, 2016

run precheckin

@sumwale sumwale merged commit 9e75292 into master Nov 28, 2016
@sumwale sumwale deleted the SNAP-1136 branch November 28, 2016 11:01
@soubhik-c
Copy link

for my understanding, could you please tell how kryo serialization in core is compatible with stock spark where none of the kryo serialization in our fork is there yet.

@sumwale
Copy link
Contributor Author

sumwale commented Nov 30, 2016

In upstream Spark core kryo serialization has been disabled for closures (SPARK-12414, apache/spark#11150) and is only for data. Our changes to snappy-spark make it usable for closures, as well as for Spark Netty messages which was not configurable earlier. Secondly the PooledKryoSerializer has the fix for string overwrite issue (EsotericSoftware/kryo#128) and has been updated to 4.0.0 to get the fix for EsotericSoftware/kryo#342 so it corrects those issues that can get hit. However, with only data serialization these issues are also not hit so this works with upstream Spark.

In addition this is not enabled by default when working with upstream spark. It gets set only in LeadImpl and ExecutorInitiator by default. Users can enable it for split mode too using the "spark.serializer" property but that has not been tested (though I believe it should work fine as per the comments above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants