Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to index document containing knn_vector using NLP ingest pipeline #613

Closed
whittssg opened this issue Apr 18, 2024 · 13 comments
Closed
Labels
bug Something isn't working

Comments

@whittssg
Copy link

whittssg commented Apr 18, 2024

What is the bug?

Indexing a document with knnvector errors

How can one reproduce the bug?

?

What is the expected behavior?

It works

What is your host/environment?

Windows (just latest stabe release downloaded and setup today)


I followed this tutorial and everything works perfectly but when i try and index somthing via this library it errors..
https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/

This is the code i am using in c# to create the index:

var response = openSearchClient.Indices.Create("df2_kasf", indexes => indexes
        .Settings(s => s
            .Setting("index.knn", true)
            .DefaultPipeline("nlp-ingest-pipeline"))
        .Map<kas>(x => x
            .AutoMap()
            .Properties(p => p
                .KnnVector(kv => kv
                    .Name(n => n.passage_embedding)
                    .Dimension(768)
                    .Method(m => m
                        .Engine("lucene")
                        .SpaceType("l2")
                        .Name("hnsw")))

The index creates without error but indexing gives this error:

{Type: mapper_parsing_exception Reason: "failed to parse field [passage_embedding] of type [knn_vector] in document with id '000017498'. Preview of field's value: 'having a good day'" CausedBy: "Type: illegal_argument_exception Reason: "Vector dimension mismatch. Expected: 768, Given: 0"}

I am filling the passage_embedding with text.

Is there something else i need to specify in the .net library for vector indexing?

Thanks

@whittssg whittssg added bug Something isn't working untriaged labels Apr 18, 2024
@Xtansia Xtansia removed the untriaged label Apr 18, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Apr 18, 2024

Hi @whittssg,

Could you please provide a more complete code sample including how you've created the ingest pipeline and how you're attempting to index the document?

@whittssg
Copy link
Author

whittssg commented Apr 18, 2024

@Xtansia I followed the tutorial on https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/ exactly.... names included.. Manually indexing works via the dashboard dev tools

@Xtansia Xtansia changed the title [BUG] [BUG] Unable to index document containing knn_vector using NLP ingest pipeline Apr 18, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Apr 18, 2024

@whittssg If manually indexing via dev tools works then please share the code you're using to attempt to index the document in C#.

@whittssg
Copy link
Author

whittssg commented Apr 18, 2024

I am just filling the field that is specified as the vector field in the index creation... nothing special:

  newka.passage_embedding = "some cool text to test this search";
  if (counter % 5 == 0 && counter != 0)
  {
      Console.WriteLine("KA - Bulk Inserting: " + counter);

      var result = openSearchClient.Bulk(descriptor);
      descriptor = new BulkDescriptor();


  }

and as i mentioned above this is how i created the index in c#:

` indexes => indexes
.Settings(s => s
.Setting("index.knn", true)

              .DefaultPipeline("nlp-ingest-pipeline"))

                                     .Map<kas>(
                                         x => x.AutoMap()
                                         .Properties(p => p
                                         
                                                 .KnnVector(kv => kv .Name(n => n.passage_embedding).Dimension(768).Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))`

@Xtansia
Copy link
Collaborator

Xtansia commented Apr 18, 2024

In the tutorial the ingest pipeline is configured to take an input field named text and map its embeddings into a field named passage_embeddings: https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/#step-2a-create-an-ingest-pipeline-for-neural-search
Note the field_map configuration:

PUT /_ingest/pipeline/nlp-ingest-pipeline
{
  "description": "An NLP ingest pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "aVeif4oB5Vm0Tdw8zYO2",
        "field_map": {
          "text": "passage_embedding"
        }
      }
    }
  ]
}

So when you're indexing documents you need to send your string in the text field as is done in the tutorial not directly into the passage_embeddings field: https://opensearch.org/docs/latest/search-plugins/neural-search-tutorial/#step-2c-ingest-documents-into-the-index

PUT /my-nlp-index/_doc/1
{
  "text": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
  "id": "4319130149.jpg"
}

@whittssg
Copy link
Author

oh ok, so do i need to change this (i changed the field to text):
.KnnVector(kv => kv .Name(n => n.text).Dimension(768).Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))

I changed it to text as you can see above but now get the same error"

illegal_argument_exception Reason: "Vector dimension mismatch. Expected: 768, Given: 0""}

@whittssg
Copy link
Author

So it will index if i change that but querying will give this error:

"error": { "root_cause": [ { "type": "query_shard_exception", "reason": "failed to create query: Field 'text' is not knn_vector type.", "index": "df2_kasf", "index_uuid": "XiphTWaXQpWeq_wST-lNwA" } ], "type

I think i am missing something obvious

@whittssg
Copy link
Author

Creating the index should read this right to match the tutorial:

.KnnVector(kv => kv .Name(n => n.text).Dimension(768).Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))
or should it be
.KnnVector(kv => kv .Name(n => n.passage_embedding).Dimension(768) .ModelId("").Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))

@whittssg
Copy link
Author

the only way my code indexes is if i rem out this:
KnnVector(kv => kv .Name(n => n.passage_embedding).Dimension(768) .ModelId("").Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))

I can see documents indexing once this is gone.

Maybe i am just missing somethings stupid, is there an example some where for this type of search?

@whittssg
Copy link
Author

I think i should have been more clear, i followed everything upto the creation of the index on that tutorial (i went through it all and everything worked perfectly but now i want to do it via c#). So instead of doing the models etc in c# i started at the create index step in c# (since the models etc were created via the puts in the tutorial). Since i am creating the index in c# i need to specify the KnnVector in the index creator and the field that is associated with it? Which i thought should be this:

.KnnVector(kv => kv.Name(n => n.passage_embedding).Dimension(768).ModelId("").Method(m => m.Engine("lucene").SpaceType("l2").Name("hnsw")))

Then i should just fill the test field and it should work but nope.

Thanks for you help by the way.

@Xtansia
Copy link
Collaborator

Xtansia commented Apr 19, 2024

I haven't actually run this code yet but the below should be roughly what's needed. I'm going to work on creating a full working sample.

You document class would look something like:

public class NlpDoc
{
    public NlpDoc()
    {
    }

    public NlpDoc(string id, string text)
    {
        Id = id;
        Text = text;
    }

    public string Id { get; set; }
    public string Text { get; set; }
    [PropertyName("passage_embedding")]
    public float[] PassageEmbedding { get; set; }
}

Creating the index would look something like:

var resp = await client.Indices.CreateAsync(
    indexName,
    i => i
        .Settings(s => s
            .Setting("index.knn", true)
            .DefaultPipeline(pipelineName))
        .Map<NlpDoc>(m => m
            .Properties(p => p
                .Text(t => t.Name(d => d.Id))
                .KnnVector(k => k
                    .Name(d => d.PassageEmbedding)
                    .Dimension(768)
                    .Method(km => km
                        .Engine("lucene")
                        .SpaceType("l2")
                        .Name("hnsw")))
                .Text(t => t.Name(d => d.Text)))));

Indexing the documents would look like:

var docs = new[]
{
    new NlpDoc("4319130149.jpg", "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena ."),
    new NlpDoc("1775029934.jpg", "A wild animal races across an uncut field with a minimal amount of trees ."),
    new NlpDoc("2664027527.jpg", "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco ."),
    new NlpDoc("4427058951.jpg", "A man who is riding a wild horse in the rodeo is very near to falling off ."),
    new NlpDoc("2691147709.jpg", "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .")
};
var resp = await client.IndexManyAsync(docs, indexName);

@whittssg
Copy link
Author

I will give it a go tomorrow, a sample would be awsome. Thanks again.

@Xtansia
Copy link
Collaborator

Xtansia commented Apr 19, 2024

@whittssg I've created a working sample in #614 and confirmed this isn't a bug in the client. I'm going to close this issue in favour of #372 for adding proper support and typings for the feature.

@Xtansia Xtansia closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants