Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Creating Vector data structures Greedily #1942

Closed
navneet1v opened this issue Aug 8, 2024 · 5 comments
Closed

[FEATURE] Creating Vector data structures Greedily #1942

navneet1v opened this issue Aug 8, 2024 · 5 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Aug 8, 2024

Description

As of version 2.13 of Opensearch, whenever a segment is created we create the data structures which are required to do vector search(aka graphs for HNSW algorithm, buckets for IVF algorithm etc.). When the segments gets merged unlink inverted file index, BKDs these data structures are not merged, rather we create them from scratch(true for native engines, and Lucene(if deletes are there)). Example: if we are merging 2 segments with 1k documents each, the graphs which are created in both the segments are ignored and a new graph with 2K documents will newly be created. This leads to waste of compute(as build vector search data structures is very expensive) and slows down the build time for Vector indices.

Hence the idea is we should build these data structures greedily.

  1. Delaying Vector Data structures Creation:
    1. For bulk ingestion the proposal is we should completely stop creating vector data structures during segment creation and merges of segments. We should only create vector data structures once the whole indexing and merges(including force merge if possible) is completed. This will ensure that we are creating vector data structures once. Refer Appendix A on how we can do this of GH issue: [META] [Build-Time] Improving Build time for Vector Indices #1599.
    2. If a user is constantly indexing and searching on data then not building graphs will lead to high latency + user might not have time to do force merge to trigger the graph creation, hence, we should build the capability where based on a certain threshold we can Opensearch can decide for which segment graph needs to be created. This will improve the ingestion speed. Ref this community request: [FEATURE] Separate refresh interval for KNN segment merge #1577

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

  1. Incremental graph creation: For native engines(Nmslib and Faiss) the vector data structures are build during segment creation(aka OS refresh), which leads to spikes in CPU and sometimes throttling due to high CPU(we saw this in various benchmarks). On the other time when segments are not getting created the CPU util stays very low. This leads uneven usage of CPU(basically a SAW tooth curve of CPU utilization is created) and users are not able to push more documents at Opensearch at steady state. Specifically for Streaming Ingestion use cases we can create graphs incrementally during ingestion(Lucene engine already implements this) this will ensure that we spread out the CPU utilization peak over a period of time and when OS refresh happens we already have graph created. Which will ensure that it is available for search for users hence reduce the Search after Indexing time.

References:

  1. Meta Issue: [META] [Build-Time] Improving Build time for Vector Indices #1599
  2. Community request: [FEATURE] Separate refresh interval for KNN segment merge #1577
@navneet1v
Copy link
Collaborator Author

With the changes that are done as part of #1938, and #1853 we will have the ground work to do the Incremental graph creation. Once these issues are resolved we can start working on this feature.

@navneet1v navneet1v added indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Features Introduces a new unit of functionality that satisfies a requirement labels Aug 8, 2024
@navneet1v
Copy link
Collaborator Author

So for 1.i) the process will be

  1. user disable the graph creation for the index and index all the vectors
  2. user enables the graph creation for the index.
  3. User now run force merge to 1 segment to ensure that all KNN DS are created.
  4. Perform search.

The idea is with #2007, there will be speed up in 2, and overall there will be reduction in build time.

Having the capability to disable graph creation is extreme and will be used for cases where we need high speed indexing, index re-builds etc. On top of this feature next capability will be added is threshold based graph builds. This will ensure that this greed graph build based capability is used for more general use-cases with search also possible if graph not present.

@VijayanB
Copy link
Member

In this comment we will be looking at some of the options on how to expose a setting “approximate_threshold” which controls when to build vector data structure during flush and merge operation. If no value is provided during index creation, we will use default value to decide whether to build vector data structure or not during flush and merge operations.

The possible values for this parameters are:

  • -1 => When users want to disable building vector data structure during indexing
  • 0 => When users want to build vector data structure always ( this is the behavior till 2.17)
  • 1.. INTEGER.MAX_VALUE - 2 => Will build vector data structure only when a segment contains more than this number of documents.

Proposal

1. Keep “approximate_threshold” as dynamic index setting

In this approach, users can set this value per index. They can use “Update Index Setting” API to update this value on live index at any time.

Example:

PUT my-test-knn-index/_settings
{
      "index.knn.advanced.approximate_threshold":-1
}

Pros:

  1. It is easy to update setting by just passing only the property and the desired value.
  2. This setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
  3. This is aligned with other property “index.knn.advanced.filtered_exact_search_threshold"

Cons

  1. Different vector fields from same index cannot have different value.

2. Keep “approximate_threshold” as mapping parameter

In this approach, users can set this value as mapping parameter for knn vector field type. They can use “Update Mapping Setting” API to update this value.

Example:

PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "space_type": "l2",
      "dimension": 2,
`      "approximate_threshold": -1,`
      "method": {
        "name": "hnsw",
        "engine": "nmslib",
        "parameters": {
          "ef_construction": 128,
          "m": 24
        }
      }
    }
  }
}

Pros

  1. Different vector fields from same index can have different value. This gives more control to the users since they can tune this parameter individually at field level.
  2. This works well if user is going to set it and forget during index creation step.

Cons

  1. This doesn’t provide very good user experience if users want to update the value after index creation step. we would like to provide same experience as index setting, for example.
PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "approximate_threshold": -1
    }
  }
}

But in order to update a mapping parameter, users has to pass other non-updatable parameters as part of update mapping api request. In above example, to update “approximate_threshold", we had to first use GET Setting API to get the non-updatable properties like type, space_type, dimension, methods etc and then use exact value in update mapping api to change ”approximate_threshold".

3. Combine 1 & 2

In this approach we will provide this as both index setting and mapping parameter, where index setting will be considered as global setting for every field, and, users can override this value for any specific field by updating as mapping parameter.

Pros

  1. It is easy to update index setting by just passing only the property and the desired value.
  2. The index setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
  3. This is aligned partially with other property “index.knn.advanced.filtered_exact_search_threshold"
  4. Different vector fields from same index can have different value by overriding at field level. This gives more control to the users since they can tune this parameter individually at field level.

Cons

  1. This doesn’t solve the problem of having to pass non-updatable field as part of update mapping api request. However, this will be only considered in a specific case where users had to override this value for certain fields. This significantly reduces the nuances for most of the use case where user either has an index with one vector field or they want to use same setting for every vector field.

Finalized approach

Since there is no work around to use as mapping parameter and provide better user experience at same time, we can first introduce this as “index setting”, (Proposal 1 ) and, if we receive substantial feedback on having it as mapping parameter from community, we can introduce mapping parameter in next iteration**(effectively Proposal 3).**

@VijayanB
Copy link
Member

Default value for Approximate threshold index setting

Objective

In this document we will look into experiments that we conducted to compare the performance of k-NN index with and without vector data structures. The metrics from this experiments will help us to find crossing point where vector search performance is not affected if k-NN plugin uses exact search instead of Approximate search. The motivation for identifying right default value is mainly to reduce indexing time, and/or reduce cpu utilization during indexing, where we found out that graph building is directly proportional to high cpu utilization and total indexing time.

Experiment Configuration

if a setting is not specified, then consider default is used from latest 2.x

Cluster Configuration

Setting Value
No of Data Node 3
No of cores 8
Memory 64
No of Cluster Manager Node 1
Storage 50 GB

Index Configuration

Setting Value
No of primary 3
No of replicas 0
Engine FAISS
Space Type inner product
Dimension 768

Workload Configuration

Setting Value
Bulk Size 100
Total Indexing Client 10
Search Clients 1 or 10
K 100
Total Search Query 1000

We ran two types of experiment to compare performance. First, we don’t create necessary vector data structure during indexing, and, will perform exact search during knn search api, this is simulated by setting “approximate_threshold” to -1.

Second, we create required data structure during and use those data structure to perform approximate search during knn search api ( default behavior ). This is simulated by setting “approximate_threshold” to 0

Experiment 1 (index.knn.approximate_threshold = -1 ):

Indexing

P50 Service time comparison

fig1

P90 Service time comparison

fig2

Max CPU Utilization

fig3

Max throughput (docs/secs)

fig4

Observation

As expected, if we don’t build graph as part of indexing process, the indexing latency improves compare to default set up where we build index while flushing segments.

P90 Service Time

Total Num Vectors P90 latency with graph (ms) P90 latency without graph(ms) Improvement in percentage
1K 71 44 38.03
2k 63 42 33.33
5K 47 40 14.89
10K 55 43 21.82
15K 63 43 31.75
25K 48 40 16.67
50K 46 43.6 5.22
100K 63 42.6 32.38

CPU Utilization

Total Num Vectors With Graph Without Graph Improvement in percentage
1K 10 1 90
2k 10 1 90
5K 9 6 33.33
10K 9 6 33.33
15K 13 6 53.85
25K 13 7 46.15
50K 26 20 23.08
100K 27 22 18.52

Search

For search we conducted two sets of experiment with 1 client and 10 parallel clients to capture trend when more clients are used. Also, before running search, indices are already available in memory using warm up api.

P50 Service Time (client = 1 )

fig5

P90 Service time (client = 1 )

fig6

Max Cpu Utilization (client = 1 )

fig7

Max throughput (client = 1 )

fig8

P90 Service time (client = 10)

fig9

Max CPU Utilization

fig10

Observation

As expected Approximate search performs better once number of vectors increases. In the above experiment, once the number of vectors grows above 50K, the exact search performs poor compared to approx search. In below metrics, positive represents improved behavior, negative means degradation from default behavior

P90 Service Time ( client = 1 )

Total Num Vectors P90 latency with graph (ms) P90 latency without graph (ms) Difference (ms)
1K 4.5 4 0.5
2k 5 4.8 0.2
5K 5.5 5.4 0.1
10K 5.8 5.7 0.1
15K 5.9 5.8 0.1
25K 8.8 7.8 1
50K 10.8 11.1 -0.3
100K 13.2 22 -8.8

CPU Utilization ( client = 1 )

Total Num Vectors With Graph Without Graph Difference
1K 5 5 0
2k 11 5 6
5K 12 7 5
10K 8 7 1
15K 10 7 3
25K 20 16 4
50K 30 25 5
100K 30 37.88 -7.88

Conclusion

Based on Indexing and Search comparison, decided to go with 15K as default threshold

@navneet1v
Copy link
Collaborator Author

Closing this GH issue as the feature will be launched with 2.18 version of Opensearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0
Projects
Status: Done
Development

No branches or pull requests

2 participants