-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNAP-1136] Pooled version of Kryo serializer which works for closures #426
Conversation
- new PooledKryoSerializer that does pooling of Kryo objects (else performance is bad if new instance is created for every call which needs to register and walk tons of classes) - has an overridden version for ASCII strings to fix (EsotericSoftware/kryo#128); currently makes a copy but will be modified to use one extra byte to indicate end of string - optimized external serializers for StructType, and Externalizable having readResolve() method; using latter for StorageLevel and BlockManagerId - added optimized serialization for the closure used by SparkSQLExecuteImpl (now a proper class instead) - fixed index column determination in RowFormatRelation (was off by 1 due to 0 based vs 1 based)
run precheckin |
…ce it does not use Lead API formatting changes and fixed some compiler warnings
- Kryo serialization for RowFormatScanRDD, SparkShellRowRDD, ColumnarStorePartitionedRDD, SparkShellCachedBatchRDD and MultiBucketExecutorPartition - added base RDDKryo to encapsulate serialization of bare minimum fields in RDD (using reflection where required) - removed unused SparkShellRDDHelper.mapBucketsToPartitions
correcting an issue in SparkSQLExecuteImpl avoid sending ConnectionProperties in RowFormatScanRDD for cases where it is not required in execution updated log4j.properties for core/cluster tests removed LaunchTasks from PooledKryoSerializer which is not yet in the code base change Attribute to StructField in columns decoders since StructType has an efficient serializer as well as being cleaner since it doesn't depend on Attribute (with potentially invalid ExprId for remote node though those fields are not used) removing "a." alias in ORDER BY for Query36 in NorthWind tests (due to fix for SPARK-17863 as reported in SNAP-1183)
- renamed passed partitioning to "parts" to avoid conflict with RDD.partitions that caused infinite loop in getPartitions - renamed TypeUtils to TypeUtilities to fix docs error (due to conflict with Spark TypeUtils)
- updating spark link to fix AQP dunits with the new kryo serializer - skip DUnitSingleTest from the aqp test target since those really are dunits which should not be run like normal junit tests
- re-create snappy catalog connection for MetaException failures too (message says "... we don't support retries ...") - clear the serializer/codec system properties when stopping Spark so that these are not carried through to subsequent tests in same JVMs
run precheckin |
for my understanding, could you please tell how kryo serialization in core is compatible with stock spark where none of the kryo serialization in our fork is there yet. |
In upstream Spark core kryo serialization has been disabled for closures (SPARK-12414, apache/spark#11150) and is only for data. Our changes to snappy-spark make it usable for closures, as well as for Spark Netty messages which was not configurable earlier. Secondly the PooledKryoSerializer has the fix for string overwrite issue (EsotericSoftware/kryo#128) and has been updated to 4.0.0 to get the fix for EsotericSoftware/kryo#342 so it corrects those issues that can get hit. However, with only data serialization these issues are also not hit so this works with upstream Spark. In addition this is not enabled by default when working with upstream spark. It gets set only in LeadImpl and ExecutorInitiator by default. Users can enable it for split mode too using the "spark.serializer" property but that has not been tested (though I believe it should work fine as per the comments above). |
Changes proposed in this pull request
new instance is created for every call which needs to register and walk tons of classes)
currently makes a copy but will be modified to use one extra byte to indicate end of string
using latter for StorageLevel and BlockManagerId
Patch testing
precheckin with spark precheckin in progress
ReleaseNotes.txt changes
NA
Other PRs
TIBCOSoftware/snappy-spark#27