Enhance index monitor to terminate streaming job on consecutive errors #346

dai-chen · 2024-05-17T18:27:26Z

Description

This PR introduces enhancements to the FlintSparkIndexMonitor to improve the robustness and observability of index monitoring tasks. By tracking consecutive errors, this update helps in better managing the streaming tasks and resource utilization.

Add new Spark conf for monitor initial delay, interval and max error count
Add FlintSparkIndexMonitorTask with counter tracking number of consecutive errors
Add streaming job and monitor task terminating logic once max error count reached

Testing with EMR and AOS:

scala> sc.setLogLevel("INFO")

# Start index refresh streaming job
sql("CREATE SKIPPING INDEX ON myglue.ds_tables.http_logs_perf_tiny (status VALUE_SET) WITH (auto_refresh = true)")

24/05/17 21:28:13 INFO FlintSparkIndexMonitor: Starting index monitor for flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index with configuration:
 - Initial delay: 15 seconds
 - Interval: 60 seconds
 - Max error count: 5

# Set read only on AOS
PUT _cluster/settings
{
   "persistent":{
      "cluster.blocks.read_only": true
   }
}

# Monitor start failing
24/05/17 21:28:24 ERROR FlintSparkIndexMonitor: Failed to update index log entry, consecutive errors: 1

24/05/17 21:29:24 ERROR FlintSparkIndexMonitor: Failed to update index log entry, consecutive errors: 2
...

# Remove read only on AOS
PUT _cluster/settings
{
   "persistent":{
      "cluster.blocks.read_only": false
   }
}

# Monitor can update successfully
24/05/17 21:30:24 INFO FlintSparkIndexMonitor: Scheduler trigger index monitor task for flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index
24/05/17 21:30:24 INFO FlintSparkIndexMonitor: Streaming job is still active

# Set read only on AOS again
PUT _cluster/settings
{
   "persistent":{
      "cluster.blocks.read_only": true
   }
}

# Counter is reset and restarts from 1
24/05/17 21:31:25 ERROR FlintSparkIndexMonitor: Failed to update index log entry, consecutive errors: 1

24/05/17 21:32:25 ERROR FlintSparkIndexMonitor: Failed to update index log entry, consecutive errors: 2

...

# Monitor stops streaming job and itself once max error count reached
24/05/17 21:35:26 ERROR FlintSparkIndexMonitor: Failed to update index log entry, consecutive errors: 5
java.lang.IllegalStateException: Failed to commit transaction operation
	at org.opensearch.flint.core.metadata.log.DefaultOptimisticTransaction.commit(DefaultOptimisticTransaction.java:125) ~[flint-spark-integration-assembly-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.opensearch.flint.spark.FlintSparkIndexMonitor$FlintSparkIndexMonitorTask.run(FlintSparkIndexMonitor.scala:106) ~[flint-spark-integration-assembly-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
24/05/17 21:35:26 INFO FlintSparkIndexMonitor: Terminating streaming job and index monitor for flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index
24/05/17 21:35:26 INFO DAGScheduler: Asked to cancel job group 02c0af66-b3a9-4611-8fc2-9259206f83a4
24/05/17 21:35:26 ERROR MicroBatchExecution: Query flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index [id = ba336769-f33d-4da8-8ba2-46630cbdcd92, runId = 02c0af66-b3a9-4611-8fc2-9259206f83a4] terminated with error ...
24/05/17 21:35:26 INFO DAGScheduler: Asked to cancel job group 02c0af66-b3a9-4611-8fc2-9259206f83a4
24/05/17 21:35:26 INFO MicroBatchExecution: Query flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index [id = ba336769-f33d-4da8-8ba2-46630cbdcd92, runId = 02c0af66-b3a9-4611-8fc2-9259206f83a4] was stopped
24/05/17 21:35:26 INFO FlintSparkIndexMonitor: Cancelling scheduled task for index flint_myglue_ds_tables_http_logs_perf_tiny_skipping_index
24/05/17 21:35:26 INFO FlintSparkIndexMonitor: Streaming job and index monitor terminated

Issues Resolved

#344

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>

#346) * Add error counter and terminate logic in index monitor Signed-off-by: Chen Dai <daichen@amazon.com> * Add new Spark conf for max error count and interval Signed-off-by: Chen Dai <daichen@amazon.com> * Add new Spark conf for initial delay too Signed-off-by: Chen Dai <daichen@amazon.com> * Update user manual Signed-off-by: Chen Dai <daichen@amazon.com> --------- Signed-off-by: Chen Dai <daichen@amazon.com> (cherry picked from commit 9de4f28) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

#346) (#347) * Add error counter and terminate logic in index monitor * Add new Spark conf for max error count and interval * Add new Spark conf for initial delay too * Update user manual --------- (cherry picked from commit 9de4f28) Signed-off-by: Chen Dai <daichen@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add error counter and terminate logic in index monitor

4838949

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen added enhancement New feature or request 0.5 backport 0.4 labels May 17, 2024

dai-chen self-assigned this May 17, 2024

dai-chen linked an issue May 17, 2024 that may be closed by this pull request

[BUG] Flint index refresh job continues despite critical dependency unavailable #344

Closed

dai-chen added 3 commits May 17, 2024 13:35

Add new Spark conf for max error count and interval

4836c6e

Signed-off-by: Chen Dai <daichen@amazon.com>

Add new Spark conf for initial delay too

2990adf

Signed-off-by: Chen Dai <daichen@amazon.com>

Update user manual

2864b28

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen marked this pull request as ready for review May 17, 2024 20:55

dai-chen requested review from rupal-bq, vmmusings, penghuo, seankao-az, anirudha, kaituo and YANG-DB as code owners May 17, 2024 20:55

penghuo approved these changes May 17, 2024

View reviewed changes

dai-chen merged commit 9de4f28 into opensearch-project:main May 17, 2024
4 checks passed

opensearch-trigger-bot bot mentioned this pull request May 17, 2024

[Backport 0.4] Enhance index monitor to terminate streaming job on consecutive errors #347

Merged

dai-chen deleted the terminate-streaming-job-in-index-monitor branch May 20, 2024 23:30

dai-chen mentioned this pull request May 28, 2024

[BUG] Update heartbeat failed, REPL / Streaming Job does not exist correctly #293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance index monitor to terminate streaming job on consecutive errors #346

Enhance index monitor to terminate streaming job on consecutive errors #346

dai-chen commented May 17, 2024 •

edited

Loading

Enhance index monitor to terminate streaming job on consecutive errors #346

Enhance index monitor to terminate streaming job on consecutive errors #346

Conversation

dai-chen commented May 17, 2024 • edited Loading

Description

Issues Resolved

dai-chen commented May 17, 2024 •

edited

Loading