Prevent cycle sending #5251

jiangpengcheng · 2022-05-27T00:22:39Z

Stop sending activation message to QueueManager itself again but recover MemoryQueue instead when cycle happens

Description

Related issue and scope

I opened an issue to propose and discuss this change ([New Scheduler] it's rarely happened but there may be an infinity loop in scheduler #5250)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

bdoyle0182 · 2022-05-27T00:34:10Z

I have an issue where the namespaceContainer metric still emits containers for a namespace even after no activations are being run in a long time for that namespace. The value just remains constant forever until I restart the scheduler in which it then correctly goes to 0 not emit for that namespace. My thought was something getting stuck in memory with the memory queue even after it should have been shut down since the metric is reported from that actor. And it's weird that it would still report that there are containers even if there are none in etcd for the namespace even if the memory queue wasn't properly shut down as I assume it would get updated with the correct value when still emitting unless it's stuck in a zombie state or something. Do you think this could be the same issue?

codecov-commenter · 2022-05-27T01:01:22Z

Codecov Report

Merging #5251 (1a6c99d) into master (1a6c99d) will not change coverage.
The diff coverage is n/a.

❗ Current head 1a6c99d differs from pull request most recent head b5f7aaf. Consider uploading reports for the commit b5f7aaf to get more accurate results

@@           Coverage Diff           @@
##           master    #5251   +/-   ##
=======================================
  Coverage   79.82%   79.82%           
=======================================
  Files         238      238           
  Lines       14009    14009           
  Branches      567      567           
=======================================
  Hits        11183    11183           
  Misses       2826     2826

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a6c99d...b5f7aaf. Read the comment docs.

jiangpengcheng · 2022-05-27T01:25:31Z

I have an issue where the namespaceContainer metric still emits containers for a namespace even after no activations are being run in a long time for that namespace. The value just remains constant forever until I restart the scheduler in which it then correctly goes to 0 not emit for that namespace. My thought was something getting stuck in memory with the memory queue even after it should have been shut down since the metric is reported from that actor. And it's weird that it would still report that there are containers even if there are none in etcd for the namespace even if the memory queue wasn't properly shut down as I assume it would get updated with the correct value when still emitting unless it's stuck in a zombie state or something. Do you think this could be the same issue?

do you mean these metrics in MemoryQueue.scala?

    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_NAMESPACE_CONTAINER(invocationNamespace),
      namespaceContainerCount.existingContainerNumByNamespace)
    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_NAMESPACE_INPROGRESS_CONTAINER(invocationNamespace),
      namespaceContainerCount.inProgressContainerNumByNamespace)

    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_ACTION_CONTAINER(invocationNamespace, action.asString),
      containers.size)
    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_ACTION_INPROGRESS_CONTAINER(invocationNamespace, action.asString),
      creationIds.size)

looks like some memory queue under the namespace are not terminated
the Shcduler provides a http api queue/status which return memory queue status inside it, you can check whether all queues are terminated when error happens

this issue is caused by MemoryQueue is removed while leader key in etcd is not, so they are not related

ningyougang · 2022-05-31T06:25:51Z

LGTM

Prevent cycle sending

b5f7aaf

ningyougang added the scheduler label May 31, 2022

ningyougang approved these changes May 31, 2022

View reviewed changes

bdoyle0182 approved these changes May 31, 2022

View reviewed changes

style95 merged commit a75950a into apache:master May 31, 2022

JesseStutler pushed a commit to JesseStutler/openwhisk that referenced this pull request Jul 13, 2022

Prevent cycle sending (apache#5251)

e84fafd

style95 mentioned this pull request Jul 19, 2022

[New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky #5286

Closed

style95 mentioned this pull request Jul 31, 2022

Revert cycle handling. #5300

Merged

22 tasks

style95 mentioned this pull request Oct 10, 2022

Prevent cycle in the QueueManager #5332

Merged

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent cycle sending #5251

Prevent cycle sending #5251

jiangpengcheng commented May 27, 2022

bdoyle0182 commented May 27, 2022

codecov-commenter commented May 27, 2022 •

edited

Loading

jiangpengcheng commented May 27, 2022

ningyougang commented May 31, 2022

Prevent cycle sending #5251

Prevent cycle sending #5251

Conversation

jiangpengcheng commented May 27, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

bdoyle0182 commented May 27, 2022

codecov-commenter commented May 27, 2022 • edited Loading

Codecov Report

jiangpengcheng commented May 27, 2022

ningyougang commented May 31, 2022

codecov-commenter commented May 27, 2022 •

edited

Loading