Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing the vec file extension from INDEX_STORE_HYBRID_NIO_EXTENSIONS, to ensure the no performance degradation for vector search via Lucene Engine. #9528

Merged
merged 1 commit into from
Aug 24, 2023

Conversation

navneet1v
Copy link
Contributor

@navneet1v navneet1v commented Aug 23, 2023

Description

Removing the vec file extension from INDEX_STORE_HYBRID_NIO_EXTENSIONS, to ensure the no performance degradation for vector search via Lucene Engine.

This PR: https://github.com/opensearch-project/OpenSearch/pull/8508/files added .vec file extension in INDEX_STORE_HYBRID_NIO_EXTENSIONS and deprecated the setting : INDEX_STORE_HYBRID_MMAP_EXTENSIONS. Which made .vec not to be Mmapped. This resulted in below problems:

  1. K-NN plugin was adding the required files like .vec and .vex to INDEX_STORE_HYBRID_MMAP_EXTENSIONS to make sure that files are Mmapped, for best performance. But now as the setting is deprecated the Unit test started to fail and K-NN plugin need to hack around to make sure that unit tests are succeeding. https://github.com/opensearch-project/k-NN/blob/3df8308bbe7ce6559b559f32920c31a6b0ae3f20/src/test/java/org/opensearch/knn/index/KNNSettingsTests.java#L67-L70
  2. As the setting is going to be deprecated there is no point in using that setting. Hence to make sure that K-NN plugin can remove this piece of code I am removing .vec file from INDEX_STORE_HYBRID_NIO_EXTENSIONS.

This change will ensure that no custom logic is present in k-NN plugin going forward and if new list get created going forward k-NN plugin doesn't require any change.

Related Issues

NA

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@navneet1v navneet1v added backport PRs or issues specific to backporting features or enhancments backport 2.x Backport to 2.x branch v2.10.0 and removed backport PRs or issues specific to backporting features or enhancments labels Aug 23, 2023
@navneet1v navneet1v mentioned this pull request Aug 24, 2023
6 tasks
@navneet1v
Copy link
Contributor Author

Do we know what kind of impact this will have on the Lucene vector files?

@msfroh vec and vem lucene files only. These files are used by Lucene to store information related to vectors.

we probably should have always had .vec in the MMap list

Yes we should had it always. The way it was happening earlier was K-NN plugin was adding them(vec and vex files) from 2.5 version of OpenSearch to MMap Extension list. After the release of OpenSearch 2.4 we found that k-NN search for Lucene engine became slower. Please refer these graphs for lucene: opensearch-project/k-NN#576 (comment) . The same was also reported by a customer opensearch-project/k-NN#637 (comment) (detailed analysis is there) after they upgrade to OpenSearch 2.4 from OpenSearch 2.3 . Some workaround was also provided to customers but we fixed the issue in 2.5 release of OpenSearch.

This is the reason I am pushing towards a better solution, which is removing .vec file from NIO file extension list and completely remove the logic from K-NN plugin. If we don't do that K-NN plugin needs to keep on overriding the files in future which is not a scalable solution as these settings can be deprecated and new settings can come in future.

I am adding @martin-gaievski from k-NN plugin maintainers who did the deep-dive if he wants to add anything else.

@navneet1v
Copy link
Contributor Author

For reference I am adding the quick benchmarks that I did after removing the k-NN logic which add these files in the MMAP.

{
  "metadata": {
    "test_name": "Lucene",
    "test_id": "Lucene Index workflow",
    "date": "08/24/2023 16:02:23",
    "python_version": "3.8.16 (default, Mar  2 2023, 03:21:46) \n[GCC 11.2.0]",
    "os_version": "Linux-5.4.250-173.369.amzn2int.x86_64-x86_64-with-glibc2.17",
    "processor": "x86_64, 16 cores",
    "memory": "21340106752 (used) / 43812683776 (available) / 65887473664 (total)"
  },
  "results": {
    "test_took": 265919.0,
    "query_took_total": 265919.0,
    "query_took_p50": 27.0,
    "query_took_p90": 33.0,
    "query_took_p99": 36.0,
    "query_took_p99.9": 38.5,
    "query_took_p100": 53.4,
    "query_client_time_total": 292433.4,
    "query_client_time_p50": 30.0,
    "query_client_time_p90": 35.0,
    "query_client_time_p99": 38.9,
    "query_client_time_p99.9": 54.9,
    "query_client_time_p100": 301.0,
    "query_memory_kb_total": 0.0,
    "query_recall@K_total": 0.9990970000000001,
    "query_recall@1_total": 1.0
  }
}

if vec and vex files are mmaped.

{
  "metadata": {
    "test_name": "Lucene",
    "test_id": "Lucene Index workflow",
    "date": "08/23/2023 20:47:04",
    "python_version": "3.8.16 (default, Mar  2 2023, 03:21:46) \n[GCC 11.2.0]",
    "os_version": "Linux-5.4.250-173.369.amzn2int.x86_64-x86_64-with-glibc2.17",
    "processor": "x86_64, 16 cores",
    "memory": "20670394368 (used) / 44482461696 (available) / 65887473664 (total)"
  },
  "results": {
    "test_took": 68728.0,
    "query_took_total": 68728.0,
    "query_took_p50": 7.0,
    "query_took_p90": 8.0,
    "query_took_p99": 9.0,
    "query_took_p99.9": 10.0,
    "query_took_p100": 97.0,
    "query_client_time_total": 92300.0,
    "query_client_time_p50": 9.0,
    "query_client_time_p90": 10.0,
    "query_client_time_p99": 11.0,
    "query_client_time_p99.9": 47.0,
    "query_client_time_p100": 277.0,
    "query_memory_kb_total": 0.0,
    "query_recall@K_total": 0.999083,
    "query_recall@1_total": 1.0
  }
}

@msfroh
Copy link
Collaborator

msfroh commented Aug 24, 2023

Right -- my question was less around the k-nn plugin vector files and more about the Lucene vector files (where both implementations decided to use .vec as an extension).

I'm pretty sure the Lucene vector files should be MMapped -- do we have numbers to confirm?

@navneet1v
Copy link
Contributor Author

Right -- my question was less around the k-nn plugin vector files and more about the Lucene vector files (where both implementations decided to use .vec as an extension).

I'm pretty sure the Lucene vector files should be MMapped -- do we have numbers to confirm?

@msfroh what I am saying is K-NN plugin uses the Lucene Vector field, when customer try to use Lucene engine for vector search. So, vec and vex are Lucene files only. They are not k-NN plugin specific.

This is the quick benchmarks I have done: #9528 (comment)

@msfroh
Copy link
Collaborator

msfroh commented Aug 24, 2023

what I am saying is K-NN plugin uses the Lucene Vector field, when customer try to use Lucene engine for vector search. So, vec and vex are Lucene files only. They are not k-NN plugin specific.

Okay, thanks! Now I (almost) get it. I'm a little slow. (So what's up with the .vex versus .vem difference?)

@navneet1v
Copy link
Contributor Author

navneet1v commented Aug 24, 2023

what I am saying is K-NN plugin uses the Lucene Vector field, when customer try to use Lucene engine for vector search. So, vec and vex are Lucene files only. They are not k-NN plugin specific.

Okay, thanks! Now I (almost) get it. I'm a little slow. (So what's up with the .vex versus .vem difference?)

@msfroh

Here is the official documentation: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.html

So .vex is the main file that has the HNSW graph present. and during the vector search this is an important file
.vem has the metadata info.
.vec has all the vectors.

As per the last deep-dive done by @martin-gaievski vec and vex needs to be mmaped for performance. .vem didn't have impact on the latency. Hence vec and vex needs to be MMapped.

@martin-gaievski please add anything if I have missed.

@martin-gaievski
Copy link
Member

what I am saying is K-NN plugin uses the Lucene Vector field, when customer try to use Lucene engine for vector search. So, vec and vex are Lucene files only. They are not k-NN plugin specific.

Okay, thanks! Now I (almost) get it. I'm a little slow. (So what's up with the .vex versus .vem difference?)

@msfroh

Here is the official documentation: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.html

So .vex is the main file that has the HNSW graph present. .vem has the metadata info. .vec has all the vectors.

As per the last deep-dive done by @martin-gaievski vec and vex needs to be mmaped for performance. .vem didn't have impact on the latency. Hence vec and vex needs to be MMapped.

@martin-gaievski please add anything if I have missed.

Confirming, as per our testing vec and vex are contributing a lot to performance of both ingestion and search, we didn't see effect of vem beeing mmaped.

@msfroh
Copy link
Collaborator

msfroh commented Aug 24, 2023

Awesome! Thanks for clearing all of that up.

Feel free to @ me as soon as you've fixed the changelog conflict and I'm happy to merge this.

…S, to ensure the no performance degradation for vector search via Lucene Engine.

Signed-off-by: Navneet Verma <navneev@amazon.com>
@navneet1v
Copy link
Contributor Author

@msfroh fixed the conflicts. Please check

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@msfroh msfroh merged commit a4024e7 into opensearch-project:main Aug 24, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 24, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit a4024e7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Aug 25, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Gagan Juneja <gjjuneja@amazon.com>
Gaganjuneja pushed a commit to Gaganjuneja/OpenSearch that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Gagan Juneja <gjjuneja@amazon.com>
andrross pushed a commit that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit a4024e7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kkmr pushed a commit to kkmr/OpenSearch that referenced this pull request Aug 28, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Kiran Reddy <kkreddy@amazon.com>
andrross pushed a commit that referenced this pull request Aug 29, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit a4024e7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 1, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit a4024e7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 5, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit a4024e7)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Sep 5, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (#9528) (#9540)

(cherry picked from commit a4024e7)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…S, to ensure the no performance degradation for vector search via Lucene Engine. (opensearch-project#9528)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch v2.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants