Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Exponential backoff for failed notifications #2555

Merged

Conversation

tevoinea
Copy link
Member

Summary of the Pull Request

What is this about?

#2553

Today, our file-changes-poison queue does not get consumed and we manually re-queue those messages. This PR automatically re-processes those messages using an exponential backoff. The backoff times are:

// 1 - 5 mins
// 2 - 25 mins
// 3 - 2 hours 5 mins
// 4 - 10 hours 25 mins
// 5 - 2 days 4 hours 5 mins
// >6 - 7 days +/- 1 day of random variance so all the messages don't dequeue at the same time

Once a message hits the 1 week backoff point (or 6 attempts), we log a warning (though we should consider adding a webhook as well).

Info on Pull Request

What does this include?

  • Consuming the file-changes-poison queue
  • Exponential backoff retry for poison queue messages
  • Logging when poison queue messages are really failing
  • Rename failTaskOnTransientError to isLastRetryAttempt to more accurately reflect the behavior of that flag (it was changed in a previous PR)

Validation Steps Performed

How does someone test & validate?

  • Enqueue a poison queue message
  • Run the C# code such that it will fail when processing the message (aka: always throw an exception)
  • Validate the message was consumed from the queue
  • Validate the message was put back on the queue but hidden (azure portal shows 0 of 1 messages)
  • Re-run the C# code but don't throw any exceptions
  • Validate the message was consumed from the queue after the visibility period and make sure it was created successfully and not put back in the poison queue

Teo Voinea added 2 commits October 24, 2022 18:59
@codecov-commenter
Copy link

Codecov Report

Merging #2555 (f28a733) into main (cafba18) will decrease coverage by 5.31%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #2555      +/-   ##
==========================================
- Coverage   29.84%   24.52%   -5.32%     
==========================================
  Files         288      119     -169     
  Lines       35682    12446   -23236     
==========================================
- Hits        10648     3052    -7596     
+ Misses      25034     9394   -15640     
Impacted Files Coverage Δ
...piService/ApiService/Functions/QueueFileChanges.cs 0.00% <0.00%> (ø)
...rvice/TestHooks/NotificationOperationsTestHooks.cs 0.00% <0.00%> (ø)
...ce/ApiService/onefuzzlib/NotificationOperations.cs 0.00% <0.00%> (ø)
...Service/ApiService/onefuzzlib/notifications/Ado.cs 0.00% <0.00%> (ø)
...rc/agent/input-tester/src/test_result/fast_fail.rs
src/agent/onefuzz-agent/src/main.rs
src/agent/coverage/src/cache.rs
src/agent/coverage/src/block/pe_provider.rs
src/agent/dynamic-library/src/linux.rs
src/agent/srcview/src/lib.rs
... and 163 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tevoinea tevoinea requested review from stishkin, chkeita and Porges and removed request for stishkin and chkeita October 25, 2022 12:49
@tevoinea tevoinea merged commit 0e8876b into microsoft:main Oct 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants