Skip to content

Commit df5f21a

Browse files
authored
Merge pull request #4490 from hhunter-ms/issue_4463
Resiliency policies updates
2 parents 2b51021 + 60e6c47 commit df5f21a

File tree

14 files changed

+533
-349
lines changed

14 files changed

+533
-349
lines changed

daprdocs/content/en/developing-applications/building-blocks/service-invocation/howto-invoke-services-grpc.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,8 @@ context.AddMetadata("dapr-stream", "true");
309309

310310
### Streaming gRPCs and Resiliency
311311

312+
> Currently, resiliency policies are not supported for service invocation via gRPC.
313+
312314
When proxying streaming gRPCs, due to their long-lived nature, [resiliency]({{< ref "resiliency-overview.md" >}}) policies are applied on the "initial handshake" only. As a consequence:
313315

314316
- If the stream is interrupted after the initial handshake, it will not be automatically re-established by Dapr. Your application will be notified that the stream has ended, and will need to recreate it.

daprdocs/content/en/operations/resiliency/policies.md

Lines changed: 0 additions & 330 deletions
This file was deleted.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
type: docs
3+
title: "Resiliency policies"
4+
linkTitle: "Policies"
5+
weight: 200
6+
description: "Configure resiliency policies for timeouts, retries, and circuit breakers"
7+
---
8+
9+
Define timeouts, retries, and circuit breaker policies under `policies`. Each policy is given a name so you can refer to them from the [`targets` section in the resiliency spec]({{< ref targets.md >}}).
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
---
2+
type: docs
3+
title: "Circuit breaker resiliency policies"
4+
linkTitle: "Circuit breakers"
5+
weight: 30
6+
description: "Configure resiliency policies for circuit breakers"
7+
---
8+
9+
Circuit breaker policies are used when other applications/services/components are experiencing elevated failure rates. Circuit breakers reduce load by monitoring the requests and shutting off all traffic to the impacted service when a certain criteria is met.
10+
11+
After a certain number of requests fail, circuit breakers "trip" or open to prevent cascading failures. By doing this, circuit breakers give the service time to recover from their outage instead of flooding it with events.
12+
13+
The circuit breaker can also enter a “half-open” state, allowing partial traffic through to see if the system has healed.
14+
15+
Once requests resume being successful, the circuit breaker gets into "closed" state and allows traffic to completely resume.
16+
17+
## Circuit breaker policy format
18+
19+
```yaml
20+
spec:
21+
policies:
22+
circuitBreakers:
23+
pubsubCB:
24+
maxRequests: 1
25+
interval: 8s
26+
timeout: 45s
27+
trip: consecutiveFailures > 8
28+
```
29+
30+
## Spec metadata
31+
32+
| Retry option | Description |
33+
| ------------ | ----------- |
34+
| `maxRequests` | The maximum number of requests allowed to pass through when the circuit breaker is half-open (recovering from failure). Defaults to `1`. |
35+
| `interval` | The cyclical period of time used by the circuit breaker to clear its internal counts. If set to 0 seconds, this never clears. Defaults to `0s`. |
36+
| `timeout` | The period of the open state (directly after failure) until the circuit breaker switches to half-open. Defaults to `60s`. |
37+
| `trip` | A [Common Expression Language (CEL)](https://github.com/google/cel-spec) statement that is evaluated by the circuit breaker. When the statement evaluates to true, the circuit breaker trips and becomes open. Defaults to `consecutiveFailures > 5`. Other possible values are `requests` and `totalFailures` where `requests` represents the number of either successful or failed calls before the circuit opens and `totalFailures` represents the total (not necessarily consecutive) number of failed attempts before the circuit opens. Example: `requests > 5` and `totalFailures >3`.|
38+
39+
## Next steps
40+
- [Learn more about default resiliency policies]({{< ref default-policies.md >}})
41+
- Learn more about:
42+
- [Retry policies]({{< ref retries-overview.md >}})
43+
- [Timeout policies]({{< ref timeouts.md >}})
44+
45+
## Related links
46+
47+
Try out one of the Resiliency quickstarts:
48+
- [Resiliency: Service-to-service]({{< ref resiliency-serviceinvo-quickstart.md >}})
49+
- [Resiliency: State Management]({{< ref resiliency-state-quickstart.md >}})
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
type: docs
3+
title: "Default resiliency policies"
4+
linkTitle: "Default policies"
5+
weight: 40
6+
description: "Learn more about the default resiliency policies for timeouts, retries, and circuit breakers"
7+
---
8+
9+
In resiliency, you can set default policies, which have a broad scope. This is done through reserved keywords that let Dapr know when to apply the policy. There are 3 default policy types:
10+
11+
- `DefaultRetryPolicy`
12+
- `DefaultTimeoutPolicy`
13+
- `DefaultCircuitBreakerPolicy`
14+
15+
If these policies are defined, they are used for every operation to a service, application, or component. They can also be modified to be more specific through the appending of additional keywords. The specific policies follow the following pattern, `Default%sRetryPolicy`, `Default%sTimeoutPolicy`, and `Default%sCircuitBreakerPolicy`. Where the `%s` is replaced by a target of the policy.
16+
17+
Below is a table of all possible default policy keywords and how they translate into a policy name.
18+
19+
| Keyword | Target Operation | Example Policy Name |
20+
| -------------------------------- | ---------------------------------------------------- | ----------------------------------------------------------- |
21+
| `App` | Service invocation. | `DefaultAppRetryPolicy` |
22+
| `Actor` | Actor invocation. | `DefaultActorTimeoutPolicy` |
23+
| `Component` | All component operations. | `DefaultComponentCircuitBreakerPolicy` |
24+
| `ComponentInbound` | All inbound component operations. | `DefaultComponentInboundRetryPolicy` |
25+
| `ComponentOutbound` | All outbound component operations. | `DefaultComponentOutboundTimeoutPolicy` |
26+
| `StatestoreComponentOutbound` | All statestore component operations. | `DefaultStatestoreComponentOutboundCircuitBreakerPolicy` |
27+
| `PubsubComponentOutbound` | All outbound pubusub (publish) component operations. | `DefaultPubsubComponentOutboundRetryPolicy` |
28+
| `PubsubComponentInbound` | All inbound pubsub (subscribe) component operations. | `DefaultPubsubComponentInboundTimeoutPolicy` |
29+
| `BindingComponentOutbound` | All outbound binding (invoke) component operations. | `DefaultBindingComponentOutboundCircuitBreakerPolicy` |
30+
| `BindingComponentInbound` | All inbound binding (read) component operations. | `DefaultBindingComponentInboundRetryPolicy` |
31+
| `SecretstoreComponentOutbound` | All secretstore component operations. | `DefaultSecretstoreComponentTimeoutPolicy` |
32+
| `ConfigurationComponentOutbound` | All configuration component operations. | `DefaultConfigurationComponentOutboundCircuitBreakerPolicy` |
33+
| `LockComponentOutbound` | All lock component operations. | `DefaultLockComponentOutboundRetryPolicy` |
34+
35+
## Policy hierarchy resolution
36+
37+
Default policies are applied if the operation being executed matches the policy type and if there is no more specific policy targeting it. For each target type (app, actor, and component), the policy with the highest priority is a Named Policy, one that targets that construct specifically.
38+
39+
If none exists, the policies are applied from most specific to most broad.
40+
41+
## How default policies and built-in retries work together
42+
43+
In the case of the [built-in retries]({{< ref override-default-retries.md >}}), default policies do not stop the built-in retry policies from running. Both are used together but only under specific circumstances.
44+
45+
For service and actor invocation, the built-in retries deal specifically with issues connecting to the remote sidecar (when needed). As these are important to the stability of the Dapr runtime, they are not disabled **unless** a named policy is specifically referenced for an operation. In some instances, there may be additional retries from both the built-in retry and the default retry policy, but this prevents an overly weak default policy from reducing the sidecar's availability/success rate.
46+
47+
Policy resolution hierarchy for applications, from most specific to most broad:
48+
49+
1. Named Policies in App Targets
50+
2. Default App Policies / Built-In Service Retries
51+
3. Default Policies / Built-In Service Retries
52+
53+
Policy resolution hierarchy for actors, from most specific to most broad:
54+
55+
1. Named Policies in Actor Targets
56+
2. Default Actor Policies / Built-In Actor Retries
57+
3. Default Policies / Built-In Actor Retries
58+
59+
Policy resolution hierarchy for components, from most specific to most broad:
60+
61+
1. Named Policies in Component Targets
62+
2. Default Component Type + Component Direction Policies / Built-In Actor Reminder Retries (if applicable)
63+
3. Default Component Direction Policies / Built-In Actor Reminder Retries (if applicable)
64+
4. Default Component Policies / Built-In Actor Reminder Retries (if applicable)
65+
5. Default Policies / Built-In Actor Reminder Retries (if applicable)
66+
67+
As an example, take the following solution consisting of three applications, three components and two actor types:
68+
69+
Applications:
70+
71+
- AppA
72+
- AppB
73+
- AppC
74+
75+
Components:
76+
77+
- Redis Pubsub: pubsub
78+
- Redis statestore: statestore
79+
- CosmosDB Statestore: actorstore
80+
81+
Actors:
82+
83+
- EventActor
84+
- SummaryActor
85+
86+
Below is policy that uses both default and named policies as applies these to the targets.
87+
88+
```yaml
89+
spec:
90+
policies:
91+
retries:
92+
# Global Retry Policy
93+
DefaultRetryPolicy:
94+
policy: constant
95+
duration: 1s
96+
maxRetries: 3
97+
98+
# Global Retry Policy for Apps
99+
DefaultAppRetryPolicy:
100+
policy: constant
101+
duration: 100ms
102+
maxRetries: 5
103+
104+
# Global Retry Policy for Apps
105+
DefaultActorRetryPolicy:
106+
policy: exponential
107+
maxInterval: 15s
108+
maxRetries: 10
109+
110+
# Global Retry Policy for Inbound Component operations
111+
DefaultComponentInboundRetryPolicy:
112+
policy: constant
113+
duration: 5s
114+
maxRetries: 5
115+
116+
# Global Retry Policy for Statestores
117+
DefaultStatestoreComponentOutboundRetryPolicy:
118+
policy: exponential
119+
maxInterval: 60s
120+
maxRetries: -1
121+
122+
# Named policy
123+
fastRetries:
124+
policy: constant
125+
duration: 10ms
126+
maxRetries: 3
127+
128+
# Named policy
129+
retryForever:
130+
policy: exponential
131+
maxInterval: 10s
132+
maxRetries: -1
133+
134+
targets:
135+
apps:
136+
appA:
137+
retry: fastRetries
138+
139+
appB:
140+
retry: retryForever
141+
142+
actors:
143+
EventActor:
144+
retry: retryForever
145+
146+
components:
147+
actorstore:
148+
retry: fastRetries
149+
```
150+
151+
The table below is a break down of which policies are applied when attempting to call the various targets in this solution.
152+
153+
| Target | Policy Used |
154+
| ------------------ | ----------------------------------------------- |
155+
| AppA | fastRetries |
156+
| AppB | retryForever |
157+
| AppC | DefaultAppRetryPolicy / DaprBuiltInActorRetries |
158+
| pubsub - Publish | DefaultRetryPolicy |
159+
| pubsub - Subscribe | DefaultComponentInboundRetryPolicy |
160+
| statestore | DefaultStatestoreComponentOutboundRetryPolicy |
161+
| actorstore | fastRetries |
162+
| EventActor | retryForever |
163+
| SummaryActor | DefaultActorRetryPolicy |
164+
165+
## Next steps
166+
167+
[Learn how to override default retry policies.]({{< ref override-default-retries.md >}})
168+
169+
## Related links
170+
171+
Try out one of the Resiliency quickstarts:
172+
- [Resiliency: Service-to-service]({{< ref resiliency-serviceinvo-quickstart.md >}})
173+
- [Resiliency: State Management]({{< ref resiliency-state-quickstart.md >}})
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
type: docs
3+
title: "Retry and back-off resiliency policies"
4+
linkTitle: "Retries"
5+
weight: 20
6+
description: "Configure resiliency policies for retries and back-offs"
7+
---
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
type: docs
3+
title: "Override default retry resiliency policies"
4+
linkTitle: "Override default retries"
5+
weight: 20
6+
description: "Learn how to override the default retry resiliency policies for specific APIs"
7+
---
8+
9+
Dapr provides [default retries]({{< ref default-policies.md >}}) for any unsuccessful request, such as failures and transient errors. Within a resiliency spec, you have the option to override Dapr's default retry logic by defining policies with reserved, named keywords. For example, defining a policy with the name `DaprBuiltInServiceRetries`, overrides the default retries for failures between sidecars via service-to-service requests. Policy overrides are not applied to specific targets.
10+
11+
> Note: Although you can override default values with more robust retries, you cannot override with lesser values than the provided default value, or completely remove default retries. This prevents unexpected downtime.
12+
13+
Below is a table that describes Dapr's default retries and the policy keywords to override them:
14+
15+
| Capability | Override Keyword | Default Retry Behavior | Description |
16+
| ------------------ | ------------------------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------- |
17+
| Service Invocation | DaprBuiltInServiceRetries | Per call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times. | Sidecar-to-sidecar requests (a service invocation method call) that fail and result in a gRPC code `Unavailable` or `Unauthenticated` |
18+
| Actors | DaprBuiltInActorRetries | Per call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times. | Sidecar-to-sidecar requests (an actor method call) that fail and result in a gRPC code `Unavailable` or `Unauthenticated` |
19+
| Actor Reminders | DaprBuiltInActorReminderRetries | Per call retries are performed with an exponential backoff with an initial interval of 500ms, up to a maximum of 60s for a duration of 15mins | Requests that fail to persist an actor reminder to a state store |
20+
| Initialization Retries | DaprBuiltInInitializationRetries | Per call retries are performed 3 times with an exponential backoff, an initial interval of 500ms and for a duration of 10s | Failures when making a request to an application to retrieve a given spec. For example, failure to retrieve a subscription, component or resiliency specification |
21+
22+
23+
The resiliency spec example below shows overriding the default retries for _all_ service invocation requests by using the reserved, named keyword 'DaprBuiltInServiceRetries'.
24+
25+
Also defined is a retry policy called 'retryForever' that is only applied to the appB target. appB uses the 'retryForever' retry policy, while all other application service invocation retry failures use the overridden 'DaprBuiltInServiceRetries' default policy.
26+
27+
```yaml
28+
spec:
29+
policies:
30+
retries:
31+
DaprBuiltInServiceRetries: # Overrides default retry behavior for service-to-service calls
32+
policy: constant
33+
duration: 5s
34+
maxRetries: 10
35+
36+
retryForever: # A user defined retry policy replaces default retries. Targets rely solely on the applied policy.
37+
policy: exponential
38+
maxInterval: 15s
39+
maxRetries: -1 # Retry indefinitely
40+
41+
targets:
42+
apps:
43+
appB: # app-id of the target service
44+
retry: retryForever
45+
```
46+
47+
## Related links
48+
49+
Try out one of the Resiliency quickstarts:
50+
- [Resiliency: Service-to-service]({{< ref resiliency-serviceinvo-quickstart.md >}})
51+
- [Resiliency: State Management]({{< ref resiliency-state-quickstart.md >}})

0 commit comments

Comments
 (0)