Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing serialization for knn vector from single array object to collection of floats #253

Conversation

martin-gaievski
Copy link
Member

Description

Changing logic of k-NN vector serialization/deserialization in order to decrease memory taken by the index. Existing serialization stores vector as Java array, which has 27 bytes of overhead for each individual array (as per stream protocol). In newly implemented logic vector is serialized as a collection of individual numbers. During deserialization we can restore this collection to an array assuming it length from number of bytes in byte stream.

Logic has been added in a backward compatibility manner for deserialization. Initially we read first 27 bytes of the byte stream and identify if this is a serialized array based on the same stream protocol grammar. Depending on result of the check we apply old logic (deserialize as a single array object) or a new logic (deserialize as collection of individual floats).

Performance stayed roughly the same time-wise but we see an improvement in memory usage. Below are results of 2 tests comparing existing and new approaches. For testing we have used the benchmark tool from main k-NN branch with two data sets Fashion-MNIST and SIFT from here.

Diff (changed - original) for SIFT dataset:

{
  "test_took": -37602.98385410011,
  "train_model_took_total": -717.2553265995593,
  "create_index_took_total": 58.110593299596985,
  "ingest_took_total": 248.3000000000029,
  "ingest_took_p50": -0.5999999999999943,
  "ingest_took_p90": -1.799999999999983,
  "ingest_took_p99": -0.6000000000000227,
  "refresh_index_store_kb_total": -14444.896679687488,
  "refresh_index_took_total": -152.2593566997159,
  "force_merge_took_total": -119.27976410053702,
  "query_took_total": -36920.59999999998,
  "query_took_p50": -0.6999999999999993,
  "query_took_p90": -1.0999999999999979,
  "query_took_p99": -2.6999999999999993,
  "query_memory_kb_total": -0.8000000000029104,
  "query_recall@K_total": 0.004822199999999999,
  "query_recall@1_total": -0.00017999999999984695
}

Diff (changed - original) for Fashion-MNIST dataset:

{
  "test_took": -1796.963270999724,
  "create_index_took_total": -195.58473960000993,
  "ingest_took_total": 158.40000000000146,
  "ingest_took_p50": -0.19999999999998863,
  "ingest_took_p90": -0.19999999999998863,
  "ingest_took_p99": 4.0,
  "refresh_index_store_kb_total": 23103.330078125,
  "refresh_index_took_total": -4277.6522125998745,
  "force_merge_took_total": 60.2736812000615,
  "query_took_total": 2457.600000000035,
  "query_took_p50": -0.20000000000000018,
  "query_took_p90": -0.1999999999999993,
  "query_took_p99": -0.6999999999999993,
  "query_memory_kb_total": -1.5,
  "query_recall@K_total": 8.959999999724744e-06,
  "query_recall@1_total": 0.0
}

Issues Resolved

knn_vector codec minor overhead

Check List

  • New functionality includes testing.
    • All tests pass
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski
Copy link
Member Author

Adding results of performance tests perf_test_results.zip

@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2022

Codecov Report

Merging #253 (5508fa8) into main (fee4026) will increase coverage by 0.09%.
The diff coverage is 86.56%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #253      +/-   ##
============================================
+ Coverage     83.22%   83.32%   +0.09%     
- Complexity      865      883      +18     
============================================
  Files           123      127       +4     
  Lines          3780     3832      +52     
  Branches        359      361       +2     
============================================
+ Hits           3146     3193      +47     
- Misses          473      477       +4     
- Partials        161      162       +1     
Impacted Files Coverage Δ
...n/java/org/opensearch/knn/index/KNNIndexShard.java 92.85% <ø> (ø)
...nn/index/codec/KNN80Codec/KNN80CompoundFormat.java 100.00% <ø> (ø)
...index/codec/KNN80Codec/KNN80DocValuesConsumer.java 82.00% <ø> (ø)
...n/index/codec/util/KNNVectorAsArraySerializer.java 62.50% <62.50%> (ø)
.../util/KNNVectorAsCollectionOfFloatsSerializer.java 85.71% <85.71%> (ø)
...n/index/codec/util/KNNVectorSerializerFactory.java 96.15% <96.15%> (ø)
...opensearch/knn/index/KNNVectorScriptDocValues.java 90.47% <100.00%> (+8.65%) ⬆️
...ain/java/org/opensearch/knn/index/VectorField.java 75.00% <100.00%> (-6.82%) ⬇️
.../opensearch/knn/index/codec/util/KNNCodecUtil.java 93.75% <100.00%> (ø)
...search/knn/index/codec/util/SerializationMode.java 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee4026...5508fa8. Read the comment docs.

import java.util.Random;
import java.util.stream.IntStream;

public class VectorSerializerTests extends KNNTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding unit tests to keep percentage of code coverage greater or equals to the current one

@martin-gaievski martin-gaievski self-assigned this Jan 10, 2022
@martin-gaievski martin-gaievski added v1.3.0 Issues and PRs related to version 1.3.0 Enhancements Increases software capabilities beyond original client specifications labels Jan 10, 2022
@martin-gaievski martin-gaievski linked an issue Jan 10, 2022 that may be closed by this pull request
@martin-gaievski martin-gaievski force-pushed the custom_serialization_for_vector branch from e7d2593 to da398af Compare January 11, 2022 23:10
@martin-gaievski martin-gaievski marked this pull request as ready for review January 12, 2022 18:43
Shivamdhar
Shivamdhar previously approved these changes Jan 12, 2022
Copy link
Contributor

@Shivamdhar Shivamdhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

* @param byteStream stream of bytes that will be used for deserialization to array of floats
* @return array of floats deserialized from the stream
* @throws IOException
* @throws ClassNotFoundException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have ClassNotFoundException here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from the serializer implementation, ObjectInputStream.readObject().
I've dig deeper into exceptions and seems we just catch and re-throw RuntimeException at the higher level, so I moved this logic inside the serializer and removed all exceptions from the interface method signature. I hope it makes sense

Comment on lines 3 to 9
*
* The OpenSearch Contributors require contributions made to
* this file be licensed under the Apache-2.0 license or a
* compatible open source license.
*
* Modifications Copyright OpenSearch Contributors. See
* GitHub history for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to expand license description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me revert to a shorter header, seems this one has been added by signoff command automatically

Comment on lines 1 to 4
/*
* SPDX-License-Identifier: Apache-2.0
*
* The OpenSearch Contributors require contributions made to
* this file be licensed under the Apache-2.0 license or a
* compatible open source license.
*
* Modifications Copyright OpenSearch Contributors. See
* GitHub history for details.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*
* SPDX-License-Identifier: Apache-2.0
*
* The OpenSearch Contributors require contributions made to
* this file be licensed under the Apache-2.0 license or a
* compatible open source license.
*
* Modifications Copyright OpenSearch Contributors. See
* GitHub history for details.
*/
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/

public float[] byteToFloatArray(ByteArrayInputStream byteStream) {
final byte[] vectorAsByteArray = new byte[byteStream.available()];
byteStream.read(vectorAsByteArray, 0, byteStream.available());
final float[] vector = new float[vectorAsByteArray.length / BYTES_IN_FLOAT];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to a variable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, it will make code more readable. let me do the change

}

private static byte highByte(short shortValue) {
return (byte) (shortValue>>8);
Copy link
Member

@VijayanB VijayanB Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need constant instead of number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, changing in next revision


public void testVectorSerializerFactory() {
final KNNVectorSerializer defaultSerializer = KNNVectorSerializerFactory.getDefaultSerializer();
assertNotNull(defaultSerializer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we assert what is the default Serializer type you are expecting too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check on exact serializer based on functionality it supports. Current default is for collection of floats, so it should be able to deserialize collection of floats without any exception and to the same original content. Is this something you're looking for in such check?

VijayanB
VijayanB previously approved these changes Jan 12, 2022
@martin-gaievski martin-gaievski dismissed stale reviews from VijayanB and Shivamdhar via 5508fa8 January 12, 2022 23:46
@martin-gaievski martin-gaievski force-pushed the custom_serialization_for_vector branch 2 times, most recently from 186bf04 to 8e946f5 Compare January 13, 2022 00:12
…lection of floats

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
-added getDefaultSerializer to Factory
- moved SerializationMode enum to a separate file
- added javadocs and comments
- adjust format, added missing endline characters

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- replace Vector by KNNVector in class names and variables
- fixed method names in Serializer interface
- replace number of bytes in float from number to constant

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- rework factory method getSerializerByStreamContent
- added test case for stream of unsupported content
- removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception
- simplify license header in new classes

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the custom_serialization_for_vector branch from 8e946f5 to 7a3a1a9 Compare January 13, 2022 00:27
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@martin-gaievski martin-gaievski merged commit e781ad3 into opensearch-project:main Jan 13, 2022
martin-gaievski added a commit to martin-gaievski/k-NN that referenced this pull request Mar 7, 2022
…lection of floats (opensearch-project#253)

* Changing serialization for knn vector from single array object to collection of floats

rev2:
* Addressing PR comments:
- added getDefaultSerializer to Factory
- moved SerializationMode enum to a separate file
- added javadocs and comments
- adjust format, added missing endline characters
rev3:
* Addressing multiple review comments:
- replace Vector by KNNVector in class names and variables
- fixed method names in Serializer interface
- replace number of bytes in float from number to constant
rev4:
* Moving new classes under index.codec.util
rev5:
* Addressing multiple review comments:
- rework factory method getSerializerByStreamContent
- added test case for stream of unsupported content
- removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception
- simplify license header in new classes

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit to martin-gaievski/k-NN that referenced this pull request Mar 7, 2022
…lection of floats (opensearch-project#253)

* Changing serialization for knn vector from single array object to collection of floats

rev2:
* Addressing PR comments:
- added getDefaultSerializer to Factory
- moved SerializationMode enum to a separate file
- added javadocs and comments
- adjust format, added missing endline characters
rev3:
* Addressing multiple review comments:
- replace Vector by KNNVector in class names and variables
- fixed method names in Serializer interface
- replace number of bytes in float from number to constant
rev4:
* Moving new classes under index.codec.util
rev5:
* Addressing multiple review comments:
- rework factory method getSerializerByStreamContent
- added test case for stream of unsupported content
- removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception
- simplify license header in new classes

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
martin-gaievski added a commit to martin-gaievski/k-NN that referenced this pull request Mar 30, 2022
…lection of floats (opensearch-project#253)

* Changing serialization for knn vector from single array object to collection of floats

rev2:
* Addressing PR comments:
- added getDefaultSerializer to Factory
- moved SerializationMode enum to a separate file
- added javadocs and comments
- adjust format, added missing endline characters
rev3:
* Addressing multiple review comments:
- replace Vector by KNNVector in class names and variables
- fixed method names in Serializer interface
- replace number of bytes in float from number to constant
rev4:
* Moving new classes under index.codec.util
rev5:
* Addressing multiple review comments:
- rework factory method getSerializerByStreamContent
- added test case for stream of unsupported content
- removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception
- simplify license header in new classes

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications v1.3.0 Issues and PRs related to version 1.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

knn_vector codec minor overhead
5 participants