Add customized result index in data source etc #2212

kaituo · 2023-10-04T20:37:00Z

Description

This PR

Introduce spark.flint.datasource.name parameter for data source specification.
Enhance data source creation to allow custom result indices; fallback to default if unavailable.
Include error details in the async result response, sourced from the result index.
Migrate to org.apache.spark.sql.FlintJob following updates in OpenSearch-Spark.
Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail.

Testing done:

manual testing including if with or without custom result index async query still works
added new unit tests

Check List

New functionality includes testing.
- All tests pass, including unit test, integration test and doctest
New functionality has been documented.
- New functionality has javadoc added
- New functionality has user manual doc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests Signed-off-by: Kaituo Li <kaituo@amazon.com>

codecov · 2023-10-05T02:56:26Z

Codecov Report

Merging #2212 (a5a92e8) into main (5df6105) will decrease coverage by 0.01%.
The diff coverage is 96.92%.

@@             Coverage Diff              @@
##               main    #2212      +/-   ##
============================================
- Coverage     96.37%   96.36%   -0.01%     
- Complexity     4722     4727       +5     
============================================
  Files           439      439              
  Lines         12650    12686      +36     
  Branches        869      872       +3     
============================================
+ Hits          12191    12225      +34     
- Misses          450      452       +2     
  Partials          9        9

Flag	Coverage Δ
sql-engine	`96.36% <96.92%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...rch/sql/datasources/utils/XContentParserUtils.java	`100.00% <100.00%> (ø)`
...park/asyncquery/AsyncQueryExecutorServiceImpl.java	`100.00% <100.00%> (ø)`
...h/sql/spark/asyncquery/model/AsyncQueryResult.java	`100.00% <100.00%> (ø)`
.../spark/asyncquery/model/SparkSubmitParameters.java	`98.59% <100.00%> (+0.02%)`	⬆️
...h/sql/spark/client/EmrServerlessClientImplEMR.java	`100.00% <100.00%> (ø)`
...g/opensearch/sql/spark/client/StartJobRequest.java	`100.00% <ø> (ø)`
...earch/sql/spark/data/constants/SparkConstants.java	`0.00% <ø> (ø)`
...rch/sql/spark/dispatcher/SparkQueryDispatcher.java	`100.00% <100.00%> (ø)`
...sql/spark/response/JobExecutionResponseReader.java	`100.00% <100.00%> (ø)`
.../transport/TransportGetAsyncQueryResultAction.java	`100.00% <100.00%> (ø)`
... and 2 more

penghuo · 2023-10-05T03:35:03Z

spark/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java

@@ -8,13 +8,22 @@
 public class SparkConstants {


could add the change in https://github.com/opensearch-project/sql/pull/2216/files#diff-6eca1b0df87741e862bd1455b8a39e503489990984f8d790ebf7de7016986b86

spark/src/main/java/org/opensearch/sql/spark/dispatcher/SparkQueryDispatcher.java

vmmusings · 2023-10-05T04:07:32Z

datasources/src/main/resources/datasources-index-mapping.yml

@@ -14,4 +14,6 @@ properties:
      keyword:
        type: keyword
  connector:
+    type: keyword
+  resultIndex:


Do we need indexing on this?

yeah, we need to save it and retrieve it later.

keyword looks good, no full text search required.

Signed-off-by: Kaituo Li <kaituo@amazon.com>

penghuo · 2023-10-05T04:29:31Z

spark/src/main/java/org/opensearch/sql/spark/data/constants/SparkConstants.java

  public static final String SPARK_RESPONSE_BUFFER_INDEX_NAME = ".query_execution_result";
  // TODO should be replaced with mvn jar.
  public static final String FLINT_INTEGRATION_JAR =
-      "s3://spark-datasource/flint-spark-integration-assembly-0.1.0-SNAPSHOT.jar";
+      "s3://flint-data-dp-eu-west-1-beta/code/flint/sql-job.jar";


revert this, it is for Spark datasource, not used in flint.

vmmusings · 2023-10-05T04:22:50Z

spark/src/main/java/org/opensearch/sql/spark/dispatcher/SparkQueryDispatcher.java

+    // a job is successful does not mean there is no error in execution. For example, even if result
+    // index mapping
+    // is incorrect, we still write query result and let the job finish.
+    if (result.has(DATA_FIELD)) {


We are setting result object only when job status from EMR is success.

In case of index query, EMR JobRUn State is RUNNING, but the status should be succesful in the result.

Probably we can merge and handle this use case as bug separately in a separate PR.

let's track in issue #2214

yeah, we can handle this use case a bug separately.

iiiiii-off-by: Kaituo Li <kaituo@amazon.com>

* Add customized result index in data source etc This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests Signed-off-by: Kaituo Li <kaituo@amazon.com> * address comments Signed-off-by: Kaituo Li <kaituo@amazon.com> * revert incorrect change iiiiii-off-by: Kaituo Li <kaituo@amazon.com> --------- Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit 70450e4) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add customized result index in data source etc This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests * address comments * revert incorrect change iiiiii-off-by: Kaituo Li <kaituo@amazon.com> --------- (cherry picked from commit 70450e4) Signed-off-by: Kaituo Li <kaituo@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

kaituo requested review from pjfitzgibbons, ps48, kavithacm, derek-ho, joshuali925, dai-chen, YANG-DB, rupal-bq, mengweieric, vmmusings, Swiddis, penghuo, seankao-az, MaxKsyunz, Yury-Fridlyand, anirudha, forestmvey, acarbonetto and GumpacG as code owners October 4, 2023 20:37

kaituo force-pushed the dataSource branch from cc6b5a6 to c2f9577 Compare October 5, 2023 01:05

kaituo force-pushed the dataSource branch from c2f9577 to 10e2d07 Compare October 5, 2023 02:33

penghuo reviewed Oct 5, 2023

View reviewed changes

vmmusings reviewed Oct 5, 2023

View reviewed changes

address comments

6368e67

Signed-off-by: Kaituo Li <kaituo@amazon.com>

penghuo reviewed Oct 5, 2023

View reviewed changes

penghuo previously approved these changes Oct 5, 2023

View reviewed changes

vmmusings previously approved these changes Oct 5, 2023

View reviewed changes

penghuo added the backport 2.x label Oct 5, 2023

penghuo added the v2.11.0 Issues targeting release v2.11.0 label Oct 5, 2023

revert incorrect change

a5a92e8

iiiiii-off-by: Kaituo Li <kaituo@amazon.com>

kaituo dismissed stale reviews from vmmusings and penghuo via a5a92e8 October 5, 2023 04:35

penghuo approved these changes Oct 5, 2023

View reviewed changes

vmmusings approved these changes Oct 5, 2023

View reviewed changes

penghuo merged commit 70450e4 into opensearch-project:main Oct 5, 2023
14 of 21 checks passed

opensearch-trigger-bot bot mentioned this pull request Oct 5, 2023

[Backport 2.x] Add customized result index in data source etc #2220

Merged

penghuo mentioned this pull request Oct 5, 2023

[Refactor] Manage Spark job parameter for different query #2192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add customized result index in data source etc #2212

Add customized result index in data source etc #2212

kaituo commented Oct 4, 2023

codecov bot commented Oct 5, 2023 •

edited

Loading

penghuo Oct 5, 2023

kaituo Oct 5, 2023

vmmusings Oct 5, 2023

kaituo Oct 5, 2023

penghuo Oct 5, 2023 •

edited

Loading

penghuo Oct 5, 2023

kaituo Oct 5, 2023

vmmusings Oct 5, 2023

vmmusings Oct 5, 2023

penghuo Oct 5, 2023

kaituo Oct 5, 2023

Add customized result index in data source etc #2212

Add customized result index in data source etc #2212

Conversation

kaituo commented Oct 4, 2023

Description

Check List

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

penghuo Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 5, 2023 •

edited

Loading

penghuo Oct 5, 2023 •

edited

Loading