Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Pickling performance #67

Merged
merged 14 commits into from
May 15, 2019
Merged

Improve Pickling performance #67

merged 14 commits into from
May 15, 2019

Conversation

adamsitnik
Copy link
Contributor

@adamsitnik adamsitnik commented May 9, 2019

This PR is similar to #64 so again we have two structs that implement writing to Array and Stream. By using them as generics all virtual method calls are eliminated which allows for inlining and other optimizations.

Difference:

  • stream.WriteByte(v) is just array[position++] = v.
  • Direct write to the underlying memory buffer with no extra copy step:
WAS:
BinaryPrimitives.Write(tempBuffer, value);
stream.Write(tempBuffer, 0, tempBuffer.Length);
IS:
BinaryPrimitives.Write(outputBuffer, value);
  • Serialize strings as Utf8 with no memory allocation (this is going to help a LOT in the Q13 TPC-H benchmark):
WAS:
byte[] encoded = Encoding.UTF8.GetBytes(str); // array allocation
stream.Write(encoded, 0, encoded.Length);
IS:
Encoding.UTF8.GetBytes(source, str.Length, destination + position, byteCount);
  • more inlining
  • faster type checks
  • hot path for strings (1 extra method)
  • possibility to serialize to provided byte buffer with no managed allocations

Benchmark Results:

Following benchmark results are for pickler with useMemo=false (dotnet/spark use case) and 1000 elements in each array.

Serializing an array of given type to array of bytes:

Type Mean WAS MEAN IS WAS\IS
bool[] 51.83 us 45.491 us 1.14
double[] 27.41 us 9.192 us 2.98
int[] 14.44 us 6.362 us 2.27
string[] 77.55 us 27.774 us 2.79

Serializing an array of objects with boxed values inside (dotnet/spark scenario, every array can contain values of different types)

Type Mean WAS MEAN IS WAS\IS
bool[] 22.96 us 16.951 us 1.35
double[] 43.37 us 22.033 us 1.97
int[] 31.94 us 20.530 us 1.56
string[] 81.45 us 43.355 us 1.88

Serializing an array of given type to provided array of bytes:

I have added a new method that accepts a byte buffer by reference and can resize it if needed. This will be helpfull in dotnet/spark scenario where input comes in very similar batches and output is almost always of the same size (for example: an array of 100 doubles, an array of 100 bools) .
The goal is to allocate an array of bytes once and then write to it as many times as we want with almost no managed memory allocations (72 bytes are going to be allocated, independent from input size).

Type Mean WAS MEAN IS WAS\IS Alloc WAS Alloc IS
bool[] 51.83 us 44.662 us 1.16 34.17 KB 32097 B
double[] 27.41 us 5.862 us 4.68 40.82 KB 72 B
int[] 14.44 us 5.290 us 2.73 10.66 KB 72 B
string[] 77.55 us 25.636 us 3.03 54.95 KB 72 B

Serializing an object[] to provided array of bytes

Type Mean WAS MEAN IS WAS\IS Alloc WAS Alloc IS
bool[] 22.96 us 17.184 us 1.34 2.9 KB 72 B
double[] 43.37 us 18.474 us 2.35 40.81 KB 72 B
int[] 31.94 us 18.824 us 1.70 10.65 KB 72 B
string[] 81.45 us 37.711 us 2.16 54.95 KB 72 B

Serializing to Stream

No regressions, minor improvements:

Type Mean WAS Mean IS
bool[] 51.45 us 49.901 us
double[] 27.74 us 27.089 us
int[] 15.36 us 15.150 us
string[] 77.18 us 60.875 us

Boxed input (object[])

Type Mean WAS Mean IS
bool[] 22.14 us 19.775 us
double[] 41.35 us 39.768 us
int[] 31.21 us 28.752 us
string[] 79.40 us 72.461 us

/cc @danmosemsft @stephentoub, @imback82, @eerhardt, @rapoth

@irmen
Copy link
Owner

irmen commented May 9, 2019

@adamsitnik random question (excuse my lack of understanding of how Spark and .NET Spark work): are you only using Pyrolite's pickle protocol implementation to talk to Spark? I mean; Pickle is a Python serialization protocol and Spark is primarily a Java system.

@imback82
Copy link

imback82 commented May 9, 2019

@adamsitnik random question (excuse my lack of understanding of how Spark and .NET Spark work): are you only using Pyrolite's pickle protocol implementation to talk to Spark?

Yes. dotnet/spark is piggybacking on how PySpark works. So, it needs to follow the same protocol as the Python worker, which uses pickling for Ser/De to transfer data between the Python worker process and Spark JVM executor process.

Copy link

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Awesome work @adamsitnik!

@irmen
Copy link
Owner

irmen commented May 14, 2019

Amazed by your improvements.

I wonder if similar improvements are possible for the Java implementation.

@irmen irmen merged commit da27ae7 into irmen:master May 15, 2019
@irmen
Copy link
Owner

irmen commented May 16, 2019

@adamsitnik @stephentoub I have just pushed a new release (4.26) to nuget.org

@adamsitnik
Copy link
Contributor Author

I have just pushed a new release (4.26) to nuget.org

@irmen awesome, thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants