You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Spark streaming job continues running even if a) OpenSearch cluster or b) metadata log index becomes unavailable. This leads to a situation where data integrity and system stability might be compromised without immediate feedback or error handling in the monitoring system.
How can one reproduce the bug?
Steps to reproduce the behavior:
Set up a Spark streaming job by creating a Flint index.
Simulate a failure such as shut down the OpenSearch cluster, or delete metadata log index.
Observe that the Spark streaming job continues to run without failing or raising significant errors.
Check monitoring or logging outputs and notice the lack of appropriate error handling or job termination.
What is the expected behavior?
The expected behavior is for the Spark streaming job to either halt or switch to a safe mode when it can no longer communicate with the OpenSearch cluster or access the metadata logs. This should trigger alerts or error messages that inform the system administrators or halt further data processing to prevent data corruption or data loss.
Possbile solutions:
External system intervention: Implement a mechanism where an external monitoring system observes heartbeat metrics and terminates the streaming job if it detects error or missing heartbeat metrics.
Internal job termination: Enhance the index monitor so that it can terminate the streaming job after a predefined number of consecutive failures to update the heartbeat. This would allow the job to self-manage its failure states and gracefully shut down if it consistently fails to communicate its status.
24/05/15 12:32:27 INFO FlintSparkIndexMonitor: Scheduler trigger index monitor task for flint_mys3_default_prabarch_p3_index
24/05/15 12:32:27 INFO FlintSparkIndexMonitor: Streaming job is still active
24/05/15 12:32:27 INFO FlintOpenSearchClient: Starting transaction on index flint_mys3_default_prabarch_p3_index and data source mys3
24/05/15 12:32:27 INFO RetryableHttpAsyncClient: Building retryable http async client with options: FlintRetryOptions{maxRetries=3, retryableStatusCodes=429,502, retryableExceptionClassNames=Optional.empty}
24/05/15 12:32:27 INFO HttpStatusCodeResultPredicate: Checking if status code is retryable: 404
24/05/15 12:32:27 INFO HttpStatusCodeResultPredicate: Status code 404 check result: false
24/05/15 12:32:27 WARN FlintOpenSearchClient: Metadata log index not found .query_execution_request_mys3
24/05/15 12:32:27 ERROR FlintSparkIndexMonitor: Failed to update index log entry
java.lang.IllegalStateException: Metadata log index not found .query_execution_request_mys3
at org.opensearch.flint.core.storage.FlintOpenSearchClient.startTransaction(FlintOpenSearchClient.java:114) ~[opensearch-spark-standalone_2.12-latest.jar:0.4.0-SNAPSHOT]
at org.opensearch.flint.core.storage.FlintOpenSearchClient.startTransaction(FlintOpenSearchClient.java:126) ~[opensearch-spark-standalone_2.12-latest.jar:0.4.0-SNAPSHOT]
at org.opensearch.flint.spark.FlintSparkIndexMonitor.$anonfun$startMonitor$1(FlintSparkIndexMonitor.scala:51) ~[opensearch-spark-standalone_2.12-latest.jar:0.4.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
The text was updated successfully, but these errors were encountered:
What is the bug?
The Spark streaming job continues running even if a) OpenSearch cluster or b) metadata log index becomes unavailable. This leads to a situation where data integrity and system stability might be compromised without immediate feedback or error handling in the monitoring system.
How can one reproduce the bug?
Steps to reproduce the behavior:
What is the expected behavior?
The expected behavior is for the Spark streaming job to either halt or switch to a safe mode when it can no longer communicate with the OpenSearch cluster or access the metadata logs. This should trigger alerts or error messages that inform the system administrators or halt further data processing to prevent data corruption or data loss.
Possbile solutions:
Do you have any additional context?
Code: https://github.com/opensearch-project/opensearch-spark/blob/main/flint-spark-integration/src/main/scala/org/opensearch/flint/spark/FlintSparkIndexMonitor.scala#L69
Sample log:
The text was updated successfully, but these errors were encountered: