Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parsing logic for neural query #15

Merged
merged 6 commits into from
Oct 11, 2022

Conversation

jmazanec15
Copy link
Member

Description

Adds new query builder to plugin: NeuralQueryBuilder. Adds logic to parse the user provided parameters for the NeuralQueryBuilder as well as a few other smaller methods. Query text to dense vector logic will be added in a future PR

The query has the following structure:

{
  "neural": {
    "<VECTOR_FIELD>": {
    "query_text": "string",
    "model_id": "string",
    "k": int,
    "boost": float, (optional - comes with default query builder)
    "_name": "string", (optional - comes with default query builder)
  }
}

Adds unit tests as well as test utility class

Issues Resolved

#14 (partially resolved)

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jmazanec15 jmazanec15 requested a review from a team October 6, 2022 23:24
Comment on lines 57 to 60
public NeuralQueryBuilder fieldName(String fieldName) {
this.fieldName = fieldName;
return this;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we see if we can use @builder of lombok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but I didnt see how it would work, given we are extending a builder class. Instead, I followed how AbstractQueryBuilder handled it for the boost field: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java#L172

Comment on lines 40 to 53
static final ParseField QUERY_TEXT_FIELD = new ParseField("query_text");

static final ParseField MODEL_ID_FIELD = new ParseField("model_id");

static final ParseField K_FIELD = new ParseField("k");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these are not private? and is package private?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can use @VisibleForTesting for clarification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was for testing purposes. I will add annotation.

Comment on lines 57 to 61
public NeuralQueryBuilder fieldName(String fieldName) {
this.fieldName = fieldName;
return this;
}

/**
* Set the queryText that will be translated into the dense query vector used for k-NN search.
*
* @param queryText Text of a query that should be translated to a dense vector
* @return this
*/
public NeuralQueryBuilder queryText(String queryText) {
this.queryText = queryText;
return this;
}

/**
* Set the modelId that should produce the dense query vector
*
* @param modelId ID of model to produce query vector
* @return this
*/
public NeuralQueryBuilder modelId(String modelId) {
this.modelId = modelId;
return this;
}

/**
* Set the number of neighbors that should be retrieved during k-NN search
*
* @param k number of neighbors to be retrieved in k-NN query
* @return this
*/
public NeuralQueryBuilder k(int k) {
this.k = k;
return this;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use @Setter for these. any reason for not using setters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't sure how to use lombok for this, but it looks like it is possible with:

@Setter
@Accessors(chain=true,fluent=true)

Comment on lines +218 to +187
@Override
protected int doHashCode() {
return new HashCodeBuilder().append(fieldName).append(queryText).append(modelId).append(k).toHashCode();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for not using @equals and hascode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for this is that we are not implementing "hashCode" and "equals" method, but instead "doHashCode" and "doEquals".

private static final float BOOST = 1.8f;
private static final String QUERY_NAME = "queryName";

public void testFromXContent_valid_withDefaults() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if we want to use @SneakyThrows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

assertEquals(K, neuralQueryBuilder.getK());
}

public void testFromXContent_valid_withOptionals() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if we want to rename the test function to something like testFromXConent_whenXXXX_thenYYYYY for all the functions. This can help in just getting to know what is being tested with what just be reading the name of the fuction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update

out.writeString(this.fieldName);
out.writeString(this.queryText);
out.writeString(this.modelId);
out.writeInt(this.k);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if want to use writeVInt() instead of Int. VInt uses less bytes than int.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

int k1 = 1;
int k2 = 2;

NeuralQueryBuilder neuralQueryBuilder1 = new NeuralQueryBuilder().fieldName(fieldName1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using 1,2,3,4,5 can we rename them in fashion with which just by variable name we can tell what is missing/what default values we have used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Comment on lines 40 to 53
static final ParseField QUERY_TEXT_FIELD = new ParseField("query_text");

static final ParseField MODEL_ID_FIELD = new ParseField("model_id");

static final ParseField K_FIELD = new ParseField("k");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we can use @VisibleForTesting for clarification.

requireValue(neuralQueryBuilder.getQueryText(), "Query text must be provided for neural query");
requireValue(neuralQueryBuilder.getFieldName(), "Field name must be provided for neural query");
requireValue(neuralQueryBuilder.getModelId(), "Model ID must be provided for neural query");
requireValue(neuralQueryBuilder.getK(), "K must be provided for neural query");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't k have a default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this. Yes, I think we can make it 10 by default.

}

@Override
protected Query doToQuery(QueryShardContext queryShardContext) throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"toQuery" will build a Query object using the queryShardContext and the queryBuilder. "doToQuery" is used internally to produce the query in AbstractQueryBuilder.

Its null right now because we will be implementing in future PR. Ill add a comment.

@ylwu-amzn
Copy link

{
  "neural": {
    "<VECTOR_FIELD>": {
    "query_text": "string",
    "model_id": "string",
    "k": int,
    "boost": float, (optional - comes with default query builder)
    "_name": "string", (optional - comes with default query builder)
  }
}

From this query example, user could run neural search on multiple VECTOR_FIELDs , Is that true?

@jmazanec15
Copy link
Member Author

{
  "neural": {
    "<VECTOR_FIELD>": {
    "query_text": "string",
    "model_id": "string",
    "k": int,
    "boost": float, (optional - comes with default query builder)
    "_name": "string", (optional - comes with default query builder)
  }
}

From this query example, user could run neural search on multiple VECTOR_FIELDs , Is that true?

No, user would be able to run one neural search on one field per query.

@ylwu-amzn
Copy link

ylwu-amzn commented Oct 11, 2022

{
  "neural": {
    "<VECTOR_FIELD>": {
    "query_text": "string",
    "model_id": "string",
    "k": int,
    "boost": float, (optional - comes with default query builder)
    "_name": "string", (optional - comes with default query builder)
  }
}

From this query example, user could run neural search on multiple VECTOR_FIELDs , Is that true?

No, user would be able to run one neural search on one field per query.

If we can only support one filed per neural search, how about change to this format to avoid confusion

{
  "neural": {
    "vector_field": "string",
    "query_text": "string",
    "model_id": "string",
    "k": int,
    "boost": float, (optional - comes with default query builder)
    "_name": "string", (optional - comes with default query builder)
  }
}

User may think this could be a correct neural query

{
  "neural": {
    
      "<VECTOR_FIELD1>": {
          "query_text": "string",
          "model_id": "string",
          "k": int,
          "boost": float, (optional - comes with default query builder)
          "_name": "string", (optional - comes with default query builder)
      },
      
      "<VECTOR_FIELD2>": {
        "query_text": "string",
        "model_id": "string",
        "k": int,
        "boost": float, (optional - comes with default query builder)
        "_name": "string", (optional - comes with default query builder)
      }
      
  }
}

@jmazanec15
Copy link
Member Author

@ylwu-amzn Yes, I see what you mean. The main reason that I opted to make the field name the key was to follow OpenSearch convention.

For instance, for the text matching query types, the field is the key. Similarly, this is how it is done for k-NN.

Adds new query builder to plugin: NeuralQueryBuilder. Adds logic to parse the
user provided parameters for the NeuralQueryBuilder as well as a few
other smaller methods. Query text to dense vector logic will be added in
a future PR

Adds unit tests.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 force-pushed the issue-14 branch 2 times, most recently from 1721893 to 74fb5a1 Compare October 11, 2022 17:09
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@@ -119,6 +119,7 @@ allprojects {
}

repositories {
mavenLocal()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit-pick] should we add this? I think we can remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it explicitly because it is required for developing against ml-commons unpublished changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is required for local development, but ideally we should not put this, as it can lead to errors with different people can different versions.

These should be added as a part of Development.md file that we need to enable mavenLocal(). But good for now.

@ylwu-amzn
Copy link

ylwu-amzn commented Oct 11, 2022

@ylwu-amzn Yes, I see what you mean. The main reason that I opted to make the field name the key was to follow OpenSearch convention.

For instance, for the text matching query types, the field is the key. Similarly, this is how it is done for k-NN.

Thanks, that makes sense. Make sure the error message is readable. Tell user we only support one query field. like this one

POST /_search
{
  "query": {
    "match_phrase_prefix": {
      "message": {
        "query": "quick brown f"
      },
      "k1": {
        "query": "test"
      }
    }
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "[match_phrase_prefix] query doesn't support multiple fields, found [message] and [k1]",
        "line" : 7,
        "col" : 13
      }
    ],
    "type" : "parsing_exception",
    "reason" : "[match_phrase_prefix] query doesn't support multiple fields, found [message] and [k1]",
    "line" : 7,
    "col" : 13
  },
  "status" : 400
}

Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15
Copy link
Member Author

@ylwu-amzn Makes sense. Just updated.

Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 merged commit 393141d into opensearch-project:main Oct 11, 2022
@jmazanec15 jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants