-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing serialization for knn vector from single array object to collection of floats #253
Changing serialization for knn vector from single array object to collection of floats #253
Conversation
Adding results of performance tests perf_test_results.zip |
Codecov Report
@@ Coverage Diff @@
## main #253 +/- ##
============================================
+ Coverage 83.22% 83.32% +0.09%
- Complexity 865 883 +18
============================================
Files 123 127 +4
Lines 3780 3832 +52
Branches 359 361 +2
============================================
+ Hits 3146 3193 +47
- Misses 473 477 +4
- Partials 161 162 +1
Continue to review full report at Codecov.
|
import java.util.Random; | ||
import java.util.stream.IntStream; | ||
|
||
public class VectorSerializerTests extends KNNTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding unit tests to keep percentage of code coverage greater or equals to the current one
src/main/java/org/opensearch/knn/index/codec/VectorSerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsCollectionOfFloatsSerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsArraySerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorScriptDocValues.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsArraySerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorSerializerFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsArraySerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsCollectionOfFloatsSerializer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/VectorAsCollectionOfFloatsSerializer.java
Outdated
Show resolved
Hide resolved
e7d2593
to
da398af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
* @param byteStream stream of bytes that will be used for deserialization to array of floats | ||
* @return array of floats deserialized from the stream | ||
* @throws IOException | ||
* @throws ClassNotFoundException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have ClassNotFoundException here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes from the serializer implementation, ObjectInputStream.readObject().
I've dig deeper into exceptions and seems we just catch and re-throw RuntimeException at the higher level, so I moved this logic inside the serializer and removed all exceptions from the interface method signature. I hope it makes sense
* | ||
* The OpenSearch Contributors require contributions made to | ||
* this file be licensed under the Apache-2.0 license or a | ||
* compatible open source license. | ||
* | ||
* Modifications Copyright OpenSearch Contributors. See | ||
* GitHub history for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to expand license description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me revert to a shorter header, seems this one has been added by signoff command automatically
/* | ||
* SPDX-License-Identifier: Apache-2.0 | ||
* | ||
* The OpenSearch Contributors require contributions made to | ||
* this file be licensed under the Apache-2.0 license or a | ||
* compatible open source license. | ||
* | ||
* Modifications Copyright OpenSearch Contributors. See | ||
* GitHub history for details. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* | |
* SPDX-License-Identifier: Apache-2.0 | |
* | |
* The OpenSearch Contributors require contributions made to | |
* this file be licensed under the Apache-2.0 license or a | |
* compatible open source license. | |
* | |
* Modifications Copyright OpenSearch Contributors. See | |
* GitHub history for details. | |
*/ | |
/* | |
* Copyright OpenSearch Contributors | |
* SPDX-License-Identifier: Apache-2.0 | |
*/ |
public float[] byteToFloatArray(ByteArrayInputStream byteStream) { | ||
final byte[] vectorAsByteArray = new byte[byteStream.available()]; | ||
byteStream.read(vectorAsByteArray, 0, byteStream.available()); | ||
final float[] vector = new float[vectorAsByteArray.length / BYTES_IN_FLOAT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move it to a variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, it will make code more readable. let me do the change
} | ||
|
||
private static byte highByte(short shortValue) { | ||
return (byte) (shortValue>>8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need constant instead of number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, changing in next revision
|
||
public void testVectorSerializerFactory() { | ||
final KNNVectorSerializer defaultSerializer = KNNVectorSerializerFactory.getDefaultSerializer(); | ||
assertNotNull(defaultSerializer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we assert what is the default Serializer type you are expecting too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check on exact serializer based on functionality it supports. Current default is for collection of floats, so it should be able to deserialize collection of floats without any exception and to the same original content. Is this something you're looking for in such check?
5508fa8
186bf04
to
8e946f5
Compare
…lection of floats Signed-off-by: Martin Gaievski <gaievski@amazon.com>
-added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
8e946f5
to
7a3a1a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Description
Changing logic of k-NN vector serialization/deserialization in order to decrease memory taken by the index. Existing serialization stores vector as Java array, which has 27 bytes of overhead for each individual array (as per stream protocol). In newly implemented logic vector is serialized as a collection of individual numbers. During deserialization we can restore this collection to an array assuming it length from number of bytes in byte stream.
Logic has been added in a backward compatibility manner for deserialization. Initially we read first 27 bytes of the byte stream and identify if this is a serialized array based on the same stream protocol grammar. Depending on result of the check we apply old logic (deserialize as a single array object) or a new logic (deserialize as collection of individual floats).
Performance stayed roughly the same time-wise but we see an improvement in memory usage. Below are results of 2 tests comparing existing and new approaches. For testing we have used the benchmark tool from main k-NN branch with two data sets Fashion-MNIST and SIFT from here.
Diff (changed - original) for SIFT dataset:
Diff (changed - original) for Fashion-MNIST dataset:
Issues Resolved
knn_vector codec minor overhead
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.