Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent cycle sending #5251

Merged
merged 1 commit into from
May 31, 2022
Merged

Conversation

jiangpengcheng
Copy link
Contributor

Stop sending activation message to QueueManager itself again but recover MemoryQueue instead when cycle happens

Description

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@bdoyle0182
Copy link
Contributor

I have an issue where the namespaceContainer metric still emits containers for a namespace even after no activations are being run in a long time for that namespace. The value just remains constant forever until I restart the scheduler in which it then correctly goes to 0 not emit for that namespace. My thought was something getting stuck in memory with the memory queue even after it should have been shut down since the metric is reported from that actor. And it's weird that it would still report that there are containers even if there are none in etcd for the namespace even if the memory queue wasn't properly shut down as I assume it would get updated with the correct value when still emitting unless it's stuck in a zombie state or something. Do you think this could be the same issue?

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2022

Codecov Report

Merging #5251 (1a6c99d) into master (1a6c99d) will not change coverage.
The diff coverage is n/a.

❗ Current head 1a6c99d differs from pull request most recent head b5f7aaf. Consider uploading reports for the commit b5f7aaf to get more accurate results

@@           Coverage Diff           @@
##           master    #5251   +/-   ##
=======================================
  Coverage   79.82%   79.82%           
=======================================
  Files         238      238           
  Lines       14009    14009           
  Branches      567      567           
=======================================
  Hits        11183    11183           
  Misses       2826     2826           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a6c99d...b5f7aaf. Read the comment docs.

@jiangpengcheng
Copy link
Contributor Author

I have an issue where the namespaceContainer metric still emits containers for a namespace even after no activations are being run in a long time for that namespace. The value just remains constant forever until I restart the scheduler in which it then correctly goes to 0 not emit for that namespace. My thought was something getting stuck in memory with the memory queue even after it should have been shut down since the metric is reported from that actor. And it's weird that it would still report that there are containers even if there are none in etcd for the namespace even if the memory queue wasn't properly shut down as I assume it would get updated with the correct value when still emitting unless it's stuck in a zombie state or something. Do you think this could be the same issue?

do you mean these metrics in MemoryQueue.scala?

    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_NAMESPACE_CONTAINER(invocationNamespace),
      namespaceContainerCount.existingContainerNumByNamespace)
    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_NAMESPACE_INPROGRESS_CONTAINER(invocationNamespace),
      namespaceContainerCount.inProgressContainerNumByNamespace)

    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_ACTION_CONTAINER(invocationNamespace, action.asString),
      containers.size)
    MetricEmitter.emitGaugeMetric(
      LoggingMarkers.SCHEDULER_ACTION_INPROGRESS_CONTAINER(invocationNamespace, action.asString),
      creationIds.size)

looks like some memory queue under the namespace are not terminated
the Shcduler provides a http api queue/status which return memory queue status inside it, you can check whether all queues are terminated when error happens

this issue is caused by MemoryQueue is removed while leader key in etcd is not, so they are not related

@ningyougang
Copy link
Contributor

LGTM

@style95 style95 merged commit a75950a into apache:master May 31, 2022
JesseStutler pushed a commit to JesseStutler/openwhisk that referenced this pull request Jul 13, 2022
@style95 style95 mentioned this pull request Jul 31, 2022
22 tasks
@style95 style95 mentioned this pull request Oct 10, 2022
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants