-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add customized result index in data source etc #2212
Conversation
This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests Signed-off-by: Kaituo Li <kaituo@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #2212 +/- ##
============================================
- Coverage 96.37% 96.36% -0.01%
- Complexity 4722 4727 +5
============================================
Files 439 439
Lines 12650 12686 +36
Branches 869 872 +3
============================================
+ Hits 12191 12225 +34
- Misses 450 452 +2
Partials 9 9
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -8,13 +8,22 @@ | |||
public class SparkConstants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
spark/src/main/java/org/opensearch/sql/spark/dispatcher/SparkQueryDispatcher.java
Show resolved
Hide resolved
@@ -14,4 +14,6 @@ properties: | |||
keyword: | |||
type: keyword | |||
connector: | |||
type: keyword | |||
resultIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need indexing on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we need to save it and retrieve it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyword looks good, no full text search required.
Signed-off-by: Kaituo Li <kaituo@amazon.com>
public static final String SPARK_RESPONSE_BUFFER_INDEX_NAME = ".query_execution_result"; | ||
// TODO should be replaced with mvn jar. | ||
public static final String FLINT_INTEGRATION_JAR = | ||
"s3://spark-datasource/flint-spark-integration-assembly-0.1.0-SNAPSHOT.jar"; | ||
"s3://flint-data-dp-eu-west-1-beta/code/flint/sql-job.jar"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this, it is for Spark datasource, not used in flint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
// a job is successful does not mean there is no error in execution. For example, even if result | ||
// index mapping | ||
// is incorrect, we still write query result and let the job finish. | ||
if (result.has(DATA_FIELD)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are setting result object only when job status from EMR is success.
In case of index query, EMR JobRUn State is RUNNING, but the status should be succesful in the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can merge and handle this use case as bug separately in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's track in issue #2214
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we can handle this use case a bug separately.
iiiiii-off-by: Kaituo Li <kaituo@amazon.com>
* Add customized result index in data source etc This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests Signed-off-by: Kaituo Li <kaituo@amazon.com> * address comments Signed-off-by: Kaituo Li <kaituo@amazon.com> * revert incorrect change iiiiii-off-by: Kaituo Li <kaituo@amazon.com> --------- Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit 70450e4) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add customized result index in data source etc This PR - Introduce `spark.flint.datasource.name` parameter for data source specification. - Enhance data source creation to allow custom result indices; fallback to default if unavailable. - Include error details in the async result response, sourced from the result index. - Migrate to `org.apache.spark.sql.FlintJob` following updates in OpenSearch-Spark. - Populate query status from result index over EMR-S job status to handle edge cases where jobs may succeed, but queries or mappings fail. Testing done: 1. manual testing including if with or without custom result index async query still works 2. added new unit tests * address comments * revert incorrect change iiiiii-off-by: Kaituo Li <kaituo@amazon.com> --------- (cherry picked from commit 70450e4) Signed-off-by: Kaituo Li <kaituo@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
This PR
spark.flint.datasource.name
parameter for data source specification.org.apache.spark.sql.FlintJob
following updates in OpenSearch-Spark.Testing done:
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.