-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Pickling performance #67
Conversation
…when converting string to Utf8
…k the existing APIs
…t explaining why it's generic
@adamsitnik random question (excuse my lack of understanding of how Spark and .NET Spark work): are you only using Pyrolite's pickle protocol implementation to talk to Spark? I mean; Pickle is a Python serialization protocol and Spark is primarily a Java system. |
Yes. dotnet/spark is piggybacking on how PySpark works. So, it needs to follow the same protocol as the Python worker, which uses pickling for Ser/De to transfer data between the Python worker process and Spark JVM executor process. |
dotnet/Razorvine.Pyrolite/Pyrolite/Pickle/Internals/IPicklerImplementation.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Awesome work @adamsitnik!
Amazed by your improvements. I wonder if similar improvements are possible for the Java implementation. |
@adamsitnik @stephentoub I have just pushed a new release (4.26) to nuget.org |
@irmen awesome, thank you!! |
This PR is similar to #64 so again we have two structs that implement writing to Array and Stream. By using them as generics all virtual method calls are eliminated which allows for inlining and other optimizations.
Difference:
stream.WriteByte(v)
is justarray[position++] = v
.Benchmark Results:
Following benchmark results are for pickler with
useMemo=false
(dotnet/spark use case) and1000
elements in each array.Serializing an array of given type to array of bytes:
Serializing an array of objects with boxed values inside (dotnet/spark scenario, every array can contain values of different types)
Serializing an array of given type to provided array of bytes:
I have added a new method that accepts a byte buffer by reference and can resize it if needed. This will be helpfull in
dotnet/spark
scenario where input comes in very similar batches and output is almost always of the same size (for example: an array of 100 doubles, an array of 100 bools) .The goal is to allocate an array of bytes once and then write to it as many times as we want with almost no managed memory allocations (72 bytes are going to be allocated, independent from input size).
Serializing an object[] to provided array of bytes
Serializing to Stream
No regressions, minor improvements:
Boxed input (object[])
/cc @danmosemsft @stephentoub, @imback82, @eerhardt, @rapoth