Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#639: allow metadata fields and score opensearch function #228

Merged
merged 41 commits into from
Mar 21, 2023

Conversation

acarbonetto
Copy link

@acarbonetto acarbonetto commented Feb 15, 2023

Description

OpenSearch reserved fields (_id, _index, _sort, _score, _max_score) are not allowed to be used in SQL clauses (SELECT, WHERE, ORDER BY) because the field format starting with underscore _ is not allowed.

  • This ticket adds specific identifiers to the language, and opens up support for OpenSearch reserved identifiers.
  • As an aside, identifiers with double underscore at the start (such as __myCoolField) is acceptable as an identifier.

This ticket allows for the score(), score_query() and scorequery() function to be used to wrap around relevance-search queries to force the _score and _max_score metadata fields to be returned with values. For some queries, _score returns null unless the score() function is included.

  • The score function also allows for an optional boost argument to be included that boosts the score of the child relevance function.

Example - metadata fields returned:

SELECT calcs.key, str0, _id, _sort, _score, _maxscore FROM calcs WHERE _id="5"
Result:
{ "key04", "OFFICE SUPPLIES", "5", -2, null, null }

Example - Metadata fields not requested are not displayed:

SELECT *, _id FROM bigint WHERE _id="2"
Result (assuming bigint only has one field):
[9223372026854775807, "2"]

Example - relevance search without and with score function

SELECT _id, _index, _score, _maxscore, str0, str1 FROM calcs WHERE QUERY_STRING([\"*\"], `BINDING SUPPLIES`);
result: 
[
    "7",
    "calcs",
    null,
    null,
    "OFFICE SUPPLIES",
    "BINDING SUPPLIES"
]

SELECT _id, _index, _score, _maxscore, str0, str1 FROM calcs WHERE SCORE(QUERY_STRING([\"*\"], `BINDING SUPPLIES`));
[
    "7",
    "calcs",
    2.4849067,
    2,
    "OFFICE SUPPLIES",
    "BINDING SUPPLIES"
]

Example - boost score on the _sql/_explain call:

SELECT _id, _score, _maxscore FROM calcs WHERE SCORE(QUERY_STRING([\"*\"], `BINDING SUPPLIES`, boost=2.5));
result: 
[
    "7",
    6.2122664,
    6
]

SELECT _id, _score, _maxscore FROM calcs WHERE SCOREQUERY(QUERY_STRING([\"*\"], `BINDING SUPPLIES`, boost=2.5), 2.0);
result: 
[
    "7",
    12.424533,
    12
]

SELECT _id, _index, _score, _maxscore, str0, str1 FROM calcs WHERE SCORE_QUERY(QUERY_STRING([\"*\"], `BINDING SUPPLIES`), 2.5);
result:
[
    "7",
    6.2122664,
    6
]

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Feb 15, 2023

Codecov Report

❗ No coverage uploaded for pull request base (integ-metadata-fields@d44cd39). Click here to learn what that means.
The diff coverage is n/a.

@@                   Coverage Diff                    @@
##             integ-metadata-fields     #228   +/-   ##
========================================================
  Coverage                         ?   98.47%           
  Complexity                       ?     3875           
========================================================
  Files                            ?      345           
  Lines                            ?     9648           
  Branches                         ?      626           
========================================================
  Hits                             ?     9501           
  Misses                           ?      142           
  Partials                         ?        5           
Flag Coverage Δ
sql-engine 98.47% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@dai-chen
Copy link

not sure if opensearch-project#788 can be resolved as well?

@acarbonetto
Copy link
Author

not sure if opensearch-project#788 can be resolved as well?

This problem will occur for any field that shares a name with a function. The workaround for now will be to use backticks.

@acarbonetto acarbonetto marked this pull request as ready for review February 17, 2023 17:09
@dai-chen
Copy link

not sure if opensearch-project#788 can be resolved as well?

This problem will occur for any field that shares a name with a function. The workaround for now will be to use backticks.

I think the issue you're referring to is already fixed here: opensearch-project#1191. I tried to fix score in this PR but got some problem.

* @param context analysis context for the query
* @return resolved relevance function
*/
public Expression visitScoreFunction(ScoreFunction node, AnalysisContext context) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add this to opensearch module as storage-specific function. Personally I think we should prioritize opensearch-project#811. It will become more and more difficult as we keep adding more OpenSearch logic to core engine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've got it on our radar, and we can start to scope it out. As is, there's quite a bit of work to do to pull out the opensearch specific classes, but I think it's do-able.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had storage-specific function support in opensearch-project#1354. Can we start this now instead of adding more and more special logic to core? Agreed there is quite lots of work to move all to opensearch but maybe easy to do this for single PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look. There's already a lot in this PR.
Would you mind if we move the score function out as a proof of concept for a set of OpenSearch storage engine functions?

public static final String METADATA_FIELD_SCORE = "_score";
public static final String METADATA_FIELD_MAXSCORE = "_maxscore";
public static final String METADATA_FIELD_SORT = "_sort";
public static final java.util.Map<String, ExprCoreType> METADATAFIELD_TYPE_MAP = new HashMap<>() {
Copy link

@dai-chen dai-chen Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to hardcode this in core. Not sure if adding these fields in OpenSearchIndex.getFieldMapping() can work for you, however we need to consider meta column in other storage engine. For example, S3 may have $path, $partition in each row.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use OpenSearchIndex as it would create a dependency on opensearch in core. Not ideal. We kind of need to wait until we have a user-defined functions that could override core visitors, like QualifiedName.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't mean you need to depend on opensearch module. OpenSearchIndex is subclass of our Table interface. As I understand, Analyzer will fetch field list from Table during query analysis. I saw you've added isMetaField flag, so I don't see why we want to hardcode metadata field list in core module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Interesting. I'll see if I can hook that up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see changes in:

  • core/src/main/java/org/opensearch/sql/analysis/TypeEnvironment.java (for interface changes in the environment)
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java (specific changes for OpenSearch indexes), and
  • core/src/main/java/org/opensearch/sql/analysis/ExpressionAnalyzer.java the visitQualifiedName function

@acarbonetto acarbonetto force-pushed the dev-metadata-fields branch 2 times, most recently from b4912da to c8f0520 Compare March 7, 2023 22:18
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
… function

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
@@ -280,7 +280,7 @@ void search() {

Iterator<ExprValue> hits = response1.iterator();
assertTrue(hits.hasNext());
assertEquals(exprTupleValue, hits.next());
assertEquals(exprTupleValue.tupleValue().get("id"), hits.next().tupleValue().get("id"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was matching the whole object - and failing because the Boolean object wasn't .equal. This could be removed (I think) if I change the Boolean to a boolean

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I got that wrong. We are now returning the metafields in the response from OpenSearch.. so they're appearing in the tuple. This way the spirit of the test remains the same.

docs/user/dql/functions.rst Outdated Show resolved Hide resolved
}

@Test
public void testMetafieldIdentifierTest() throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when getting a meta field from a prometheus data source?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser will fail. There's no workaround for prometheus data sources for _id, _score, _index, _maxscore and _sort. But it will work for other fields that begin with an underscore (e.g. _routing or __fieldname) that is not strictly defined as an opensearch meta-field.
In other words: this is an improvement on the existing functionality... which would just fail for ALL fields starting with an underscore.

``score_query(search_expression, boost)``
``scorequery(search_expression, boost)``

The score function returns the _score of any documents matching the enclosed relevance-search expression. The SCORE function expects two

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a users it is unexpected that SELECT _score, phrase FROM phrases returns _score value yet SELECT _score, phrase FROM phrases WHERE match('phrase', 'my') returns _score of null.

I'd expect _score to be returned if requested since it can be calculated for any OpenSearch query.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an OpenSearch value. I would think users need to understand the OpenSearch side of things. Once could ask why OpenSearch doesn't also just calculate the score for everything...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin could also set track_scores = true if the user asks for _score field and is using relevance search.

At the very least, it should be mentioned here that _score will be null unless relevance query is wrapped in score(...).

sql/src/main/antlr/OpenSearchSQLParser.g4 Outdated Show resolved Hide resolved
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
public TypeEnvironment(TypeEnvironment parent, SymbolTable symbolTable) {
this.parent = parent;
this.symbolTable = symbolTable;
this.reservedSymbolTable = new SymbolTable();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a separate symbol table needed? Why can't we add metafields to symbolTable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To separate out 'reserved fields' (that are in all tables) from fields that are specific to the index. The code needs to know the difference between these two lists later on.

@acarbonetto acarbonetto merged commit 3e4e9d7 into integ-metadata-fields Mar 21, 2023
forestmvey pushed a commit that referenced this pull request Apr 10, 2023
…ore OpenSearch function (#228) (opensearch-project#1456)

Allow metadata fields and score OpenSearch function.

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
acarbonetto added a commit that referenced this pull request Apr 10, 2023
…ore OpenSearch function (#228) (opensearch-project#1456)

Allow metadata fields and score OpenSearch function.

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
(cherry picked from commit e805151)
acarbonetto added a commit that referenced this pull request Apr 10, 2023
…ore OpenSearch function (#228) (opensearch-project#1456)

Allow metadata fields and score OpenSearch function.

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
(cherry picked from commit e805151)
forestmvey pushed a commit that referenced this pull request Apr 13, 2023
…e opensearch function (#228) (opensearch-project#1508)

* opensearch-project#639: Support OpenSearch metadata fields and the score OpenSearch function (#228) (opensearch-project#1456)

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Co-authored-by: Andrew Carbonetto <andrewc@bitquilltech.com>
forestmvey pushed a commit that referenced this pull request Apr 13, 2023
…e opensearch function (#228) (opensearch-project#1509)

* opensearch-project#639: Support OpenSearch metadata fields and the score OpenSearch function (#228) (opensearch-project#1456)

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Co-authored-by: Andrew Carbonetto <andrewc@bitquilltech.com>
acarbonetto added a commit that referenced this pull request Apr 18, 2023
…elds and the score OpenSearch function (#228) (opensearch-project#1456)

Allow metadata fields and score OpenSearch function.

Signed-off-by: Andrew Carbonetto <andrewc@bitquilltech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants