Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add customized result index in data source etc #2212

Merged
merged 3 commits into from
Oct 5, 2023

Conversation

kaituo
Copy link
Contributor

@kaituo kaituo commented Oct 4, 2023

Description

This PR

  • Introduce spark.flint.datasource.name parameter for data source specification.
  • Enhance data source creation to allow custom result indices; fallback to default if unavailable.
  • Include error details in the async result response, sourced from the result index.
  • Migrate to org.apache.spark.sql.FlintJob following updates in OpenSearch-Spark.
  • Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail.

Testing done:

  1. manual testing including if with or without custom result index async query still works
  2. added new unit tests

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR
- Introduce `spark.flint.datasource.name` parameter for data source specification.
- Enhance data source creation to allow custom result indices; fallback to default if unavailable.
- Include error details in the async result response, sourced from the result index.
- Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark.
- Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail.

Testing done:
1. manual testing including if with or without custom result index async query still works
2. added new unit tests

Signed-off-by: Kaituo Li <kaituo@amazon.com>
@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #2212 (a5a92e8) into main (5df6105) will decrease coverage by 0.01%.
The diff coverage is 96.92%.

@@             Coverage Diff              @@
##               main    #2212      +/-   ##
============================================
- Coverage     96.37%   96.36%   -0.01%     
- Complexity     4722     4727       +5     
============================================
  Files           439      439              
  Lines         12650    12686      +36     
  Branches        869      872       +3     
============================================
+ Hits          12191    12225      +34     
- Misses          450      452       +2     
  Partials          9        9              
Flag Coverage Δ
sql-engine 96.36% <96.92%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...rch/sql/datasources/utils/XContentParserUtils.java 100.00% <100.00%> (ø)
...park/asyncquery/AsyncQueryExecutorServiceImpl.java 100.00% <100.00%> (ø)
...h/sql/spark/asyncquery/model/AsyncQueryResult.java 100.00% <100.00%> (ø)
.../spark/asyncquery/model/SparkSubmitParameters.java 98.59% <100.00%> (+0.02%) ⬆️
...h/sql/spark/client/EmrServerlessClientImplEMR.java 100.00% <100.00%> (ø)
...g/opensearch/sql/spark/client/StartJobRequest.java 100.00% <ø> (ø)
...earch/sql/spark/data/constants/SparkConstants.java 0.00% <ø> (ø)
...rch/sql/spark/dispatcher/SparkQueryDispatcher.java 100.00% <100.00%> (ø)
...sql/spark/response/JobExecutionResponseReader.java 100.00% <100.00%> (ø)
.../transport/TransportGetAsyncQueryResultAction.java 100.00% <100.00%> (ø)
... and 2 more

@@ -8,13 +8,22 @@
public class SparkConstants {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -14,4 +14,6 @@ properties:
keyword:
type: keyword
connector:
type: keyword
resultIndex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need indexing on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we need to save it and retrieve it later.

Copy link
Collaborator

@penghuo penghuo Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyword looks good, no full text search required.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
public static final String SPARK_RESPONSE_BUFFER_INDEX_NAME = ".query_execution_result";
// TODO should be replaced with mvn jar.
public static final String FLINT_INTEGRATION_JAR =
"s3://spark-datasource/flint-spark-integration-assembly-0.1.0-SNAPSHOT.jar";
"s3://flint-data-dp-eu-west-1-beta/code/flint/sql-job.jar";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this, it is for Spark datasource, not used in flint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

penghuo
penghuo previously approved these changes Oct 5, 2023
vmmusings
vmmusings previously approved these changes Oct 5, 2023
// a job is successful does not mean there is no error in execution. For example, even if result
// index mapping
// is incorrect, we still write query result and let the job finish.
if (result.has(DATA_FIELD)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are setting result object only when job status from EMR is success.

In case of index query, EMR JobRUn State is RUNNING, but the status should be succesful in the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can merge and handle this use case as bug separately in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's track in issue #2214

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can handle this use case a bug separately.

@penghuo penghuo added the v2.11.0 Issues targeting release v2.11.0 label Oct 5, 2023
iiiiii-off-by: Kaituo Li <kaituo@amazon.com>
@kaituo kaituo dismissed stale reviews from vmmusings and penghuo via a5a92e8 October 5, 2023 04:35
@penghuo penghuo merged commit 70450e4 into opensearch-project:main Oct 5, 2023
14 of 21 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 5, 2023
* Add customized result index in data source etc

This PR
- Introduce `spark.flint.datasource.name` parameter for data source specification.
- Enhance data source creation to allow custom result indices; fallback to default if unavailable.
- Include error details in the async result response, sourced from the result index.
- Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark.
- Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail.

Testing done:
1. manual testing including if with or without custom result index async query still works
2. added new unit tests

Signed-off-by: Kaituo Li <kaituo@amazon.com>

* address comments

Signed-off-by: Kaituo Li <kaituo@amazon.com>

* revert incorrect change

iiiiii-off-by: Kaituo Li <kaituo@amazon.com>

---------

Signed-off-by: Kaituo Li <kaituo@amazon.com>
(cherry picked from commit 70450e4)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
penghuo pushed a commit that referenced this pull request Oct 5, 2023
* Add customized result index in data source etc

This PR
- Introduce `spark.flint.datasource.name` parameter for data source specification.
- Enhance data source creation to allow custom result indices; fallback to default if unavailable.
- Include error details in the async result response, sourced from the result index.
- Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark.
- Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail.

Testing done:
1. manual testing including if with or without custom result index async query still works
2. added new unit tests



* address comments



* revert incorrect change

iiiiii-off-by: Kaituo Li <kaituo@amazon.com>

---------


(cherry picked from commit 70450e4)

Signed-off-by: Kaituo Li <kaituo@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x v2.11.0 Issues targeting release v2.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants