Project SIMMY: Support for failure / chaos injection policies to test resilience of the system #499

mebjas · 2018-09-14T13:11:54Z

Having this library helps me introduce resilience to my project, but I don't want expected or unexpected failures to occur to test it out

The library could be wrongly implemented, testing the scenarios is not straight forward
Mocking failure of a dependency for example a cloud SaaS / PaaS service is not always provided straight away.

Describe your proposed or preferred solution:

A way to mock failures of dependencies (any service dependency for example).
This should be strict framework that takes care of when to fail based on some external factors - maybe global configuration or some rule.
Way to revert easily controlling the blast radius
Production grade to run this in production system with automation.

Describe any alternative options you've considered:
Doing it myself, manually disabling the service which supports that and checking if it really breaks the circuit and moves to fallback.

Any additional info?
We can definitely take learning from Netflix Simian Army.

reisenberger · 2018-09-16T21:09:02Z

Interesting question and ideas @mebjas - I had similar ideas in the past (as discussed with @joelhulen / @jhalterman / others) that the Polly architecture could be used for fault-injection.

Obviously there are established solutions in the market for mocking third-party endpoints, like Mountebank or Wiremock; and likewise many tools for disrupting traffic/terminating instances etc at the broader container or cloud infrastructure layer. But fault-injection within Polly could perhaps complement and serve different use cases:

Injecting faults via Polly by definition would happen in-process. This means no external infrastructure/tool/http-interception needed.
For the dev/test phase, easy to mock failure of a dependency to test the envisaged Polly strategy or other response.
Not all services are amenable to http mocking through tools such as Mountebank or Wiremock, eg if you are calling some Azure/AWS service through the AWS or Azure SDK.
Simulating failure of a dependency in process does allow you to simulate that failure for a specific consumer/group-of-consumers without actually killing the real dependency or necessarily killing it for all consumers.

It would be relatively easy to implement in Polly - ideas coming in a follow-up post.

reisenberger · 2018-09-16T21:14:05Z

Fault injection proposal for Polly

For non-generic Polly policies:

var chaosPolicy = Policy.InjectFault(
    Exception | Func<Context, Exception> fault,  // The fault to inject
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive.  The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
    Func<Context, bool> enabled // Faults are only injected when returns true.
);

By decimal | Func<decimal>, I mean an overload offering this parameter simply as decimal, and another overload offering it as Func<decimal>.

For generic `Policy<TResult>` policies governing executions returning `TResult`:

Same as above, but in addition to the Exception options, there would also be overloads offering the following three options for the fault parameter:

TResult fault
Func<Context, TResult> fault
Func<Context, DelegateResult<TResult>> fault
- This last one gives the ability to build an InjectFault policy which sometimes injects an exception, sometimes a return result (eg http calls may fail with exception or an http status code).

Operation

If enabled(...) returns true, and if a random number between 0 and 1 is less than injectionRate, return or throw the fault instead of executing the delegate passed to the policy for execution.
Otherwise execute the passed delegate as normal.

Usage

Typical usage would probably be to configure a InjectFault policy as the innermost part of a PolicyWrap (so it is closest to where the passed delegate would normally execute):

Policy.WrapAsync(retry, breaker, timeout, faultInject)

When InjectFault didn't inject a fault but merely acted as a pass-through, its performance should be comparable to NoOpPolicy: this has been benchmarked as adding approx 1microsecond (1 millionth of a second) to a call.

It would also be possible to use dynamic configuration during running to include or exclude the InjectFaultPolicy altogether, if desired.

The Func<Context, ...> signatures of all parameters would allow fine-grained control of instances or calls participating in the simulated faults, and indeed of the simulated faults themselves. Polly.Context can be set up with any custom input data; this could be eg drawn from configuration or environment variables, to control when, whether and how instances participate in the simulation.

Thoughts?

What do people think? How can we improve on this proposal?

I am going to mark this 'up-for-grabs' as it is very simple to implement following existing Polly patterns (great way for others to get involved in the proj), and I want to focus my Polly time on features such as Distributed Breaker which depend more on existing deep Polly knowledge. This would be an awesome feature for a new contributor to get public acknowledgment!

The parts to implement a policy are described here. A simple way to implement this would be to copy-and-rename all the parts for NoOpPolicy and then start adding the functionality described above.

reisenberger · 2018-09-16T21:20:19Z

Latency injection proposal

Another obvious failure injection would be latency:

var latencyPolicy = Policy.InjectLatency(
    TimeSpan | Func<Context, Timespan> latency, // The latency to inject.
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 (inclusive).  The policy will inject the latency, randomly, that proportion of the time, eg 0.01 means inject latency 1% of the time.
    Func<Context, bool> enabled // Latency is only injected when returns true.
);

Operation

If enabled returns true, and if a random number between 0 and 1 is less than injectionRate, pause for the specified latency.
Then execute the passed delegate as normal.

Questions

Should the policy pause, then execute the passed delegate? Or should it execute the passed delegate, obtain the result, then pause before returning the result or failure? Are there any use cases where it makes any difference?

mebjas · 2018-09-17T04:09:24Z

@reisenberger Thanks for response! I think I can take this up if that's how it works.

mebjas · 2018-09-17T07:45:11Z

Latency injection proposal

Another obvious failure injection would be latency:
var latencyPolicy = Policy.InjectLatency(
    TimeSpan | Func<Context, Timespan> latency, // The latency to inject.
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 (inclusive).  The policy will inject the latency, randomly, that proportion of the time, eg 0.01 means inject latency 1% of the time.
    Func<Context, bool> enabled // Latency is only injected when returns true.
);
Operation

If enabled returns true, and if a random number between 0 and 1 is less than injectionRate, pause for the specified latency.

Then execute the passed delegate as normal.

Isn't this just another type of fault. If the fault could be passed as an argument we need not write individual classes for each then. Latency injection can be shared as an example.

Questions

Should the policy pause, then execute the passed delegate? Or should it execute the passed delegate, obtain the result, then pause before returning the result or failure? Are there any use cases where it makes any difference?

Both the cases might be needed. While, before the operation makes more sense to me and use case in my mind. If I were to wrap it around the DB call for example - I might be looking at a timeout or so with operation failure. While if the latency is injected post execution that would mean - actual operation succeeded while the upper layer timed out thus forcing me to handle more scenarios. Which ain't bad either 🤕

vany0114 · 2018-09-17T16:41:14Z

Hey guys, I would like to know if someone is working on this, if don't I would like to help on this, then you can assign to me this task. 😃

reisenberger · 2018-09-17T20:03:01Z

@vany0114 @mebjas I think you have both volunteered to work on this. Is there the option to maybe take one policy (of InjectFault; InjectLatency) each, and review each other's contribution? (I will review too). Or find some other way to collaborate? (Open to suggestions...). Thank you both for your enthusiasm!

vany0114 · 2018-09-17T20:13:55Z

Of course, I like that idea, so let @mebjas decide which policy wants to work on since he's the owner of the issue, then I could work on the other issue!

mebjas · 2018-09-17T20:43:39Z

@reisenberger @vany0114 If they were supposed to be different policies I could go with InjectFault. As of now I just started pushing to this branch - https://github.com/mebjas/Polly/tree/dev-mebjas/src/Polly.Shared/InjectFault

I have a doubt though, how is InjectFault different from InjectLatency?
If the definition looked like following:

var chaosPolicy = Policy.InjectFault(
    Exception | Func<Context, Exception> fault | Action<Context> fault,  // The fault to inject
    decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive.  The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
    Func<Context, bool> enabled // Faults are only injected when returns true.
);

I could use it as following for example to inject latency.

Policy.InjectFault(
    (ctx) => { Thread.Sleep(5000); },
    0.2,
    (ctx) => { return true; }
);

In this way we could define different kinds of faults within the delegate fault itself.

reisenberger · 2018-09-17T20:54:51Z

@mebjas Very interesting comment above - I was typing my generalisation/abstraction idea simultaneously. I will still post it (below) and we can all consider further.

@mebjas This is a great question! :

Isn't this [latency] just another type of fault. If the fault could be passed as an argument we need not write individual classes for each then.

We could consider two separate questions:

(a) is the concept of Policy.InjectLatency(...) a primary-enough concept in chaos-engineering, that it is worth providing this direct syntax for?
(b) is there enough commonality of functionality, that we can use a common class (or abstract base class) for the features discussed in this thread?

Re (a), my thought was that latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalise the implementation internally.

Re (b), I had the thought yesterday that we could generalise both cases by abstracting out the general injection concept, something like:

    Policy.InjectBehaviour( // Or: InjectCustom // Or just: Inject
        Func<Context, CancellationToken, (DelegateResult<TResult>, bool)> preExecute, 
            // The DelegateResult<TResult> output parameter of preExecute is an _optional_ substitute result (which would replace executing the passed delegate). 
            // The bool output parameter of preExecute is whether to use this replacement result, or call as normal the delegate which was passed to .Execute/Async(...).
        Func<Context, CancellationToken, DelegateResult<TResult>, DelegateResult<TResult>> postExecute, 
            // The input DelegateResult<TResult> parameter is the result from executing the delegate passed for execution by .Execute/Async(...). 
            // The output DelegateResult<TResult> parameter ... allows postExecute to augment, extend, or even replace the original obtained result.
        Func<Context, bool> isOperative
           // ... encapsulates any logic (however complex) 
    );

Maybe it's easier to communicate that the synchronous implementation (BehaviourInjectionPolicyEngine) would be something like:

internal static TResult Implementation<TResult>(Func<Context, CancellationToken, TResult> action, Context context, CancellationToken cancellationToken, Func<Context, CancellationToken, (DelegateResult<TResult>, bool)> preExecute, Func<Context, CancellationToken, DelegateResult<TResult>, DelegateResult<TResult>> postExecute,  Func<Context, bool> isOperative)
{
    if (!isOperative(context)) return action(context, cancellationToken);

    (DelegateResult<TResult> result, bool substitute) preExecuteResult = preExecute(context, cancellationToken);

    if (preExecuteResult.substitute) return preExecuteResult.result;

    DelegateResult<TResult> normalExecutionResult = action(context, cancellationToken);

    return postExecute(context, cancellationToken, normalExecutionResult);
}

This builds a generalised pre-execution-injection and post-execution-injection architecture, which users could repurpose for any means they want (if we exposed it publicly, not just as a hidden abstract base-class). Maybe they want to use postExecute to augment results returned from a particular call. Maybe they want some custom logging in postExecute without having to use FallbackPolicy as a logging workaround.

I don't have an immediate preference - first, just sharing ideas, and we can all reflect.

reisenberger · 2018-09-17T21:10:15Z

( Apologies: poss some mistakes in syntax around whether some of the return values are DelegateResult<TResult> or just TResult in the example above, but it's to illustrate the concept. I can correct tomorrow. )

mebjas · 2018-09-17T21:34:42Z

@reisenberger

(a) is the concept of Policy.InjectLatency(...) a primary-enough concept in chaos-engineering, that it is worth providing this direct syntax for?
(b) is there enough commonality of functionality, that we can use a common class (or abstract base class) for the features discussed in this thread?

Re (a), my thought was that latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalise the implementation internally.

Thinking of it like this, yes it makes sense. Policy.InjectLatency(...) could be an example implementation of Policy.InjectFault(...) . Thus making both options available as first class citizens within Polly. Or they could both share common execution.

Re (b), I had the thought yesterday that we could generalise both cases by abstracting out the general injection concept, something like:

While this seem to be best generalization which cover both of the above and give users ability to inject any behavior before or after the real delegate execution, is this the real use case of chaos engineering here to inject any piece of code? Would it not lead to wrong usage? Exposing Policy.InjectLatency(...), Policy.InjectFault(...) would ensure focused usage.

Has this kind of generalization been used for any other concepts implemented in Polly?

vany0114 · 2018-09-17T21:34:59Z

@reisenberger I like the first option:

latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalize the implementation internally.

I think the generalization is a great idea, but I also think an API should be descriptive enough, having said that, the latency should be considered as an exception itself? I don't think so. I think we should separate both concerns even if we gonna call/use the same base class or something internally, for both InjectFault and InjectLatency

mebjas · 2018-09-19T16:21:30Z

@reisenberger @vany0114 Added code for InjectFault policy here. Added support for

Action<Context> fault

in InjectFaultEngine.cs which can be used to build InjectLatency policy.

@reisenberger
What all need to be done before sending a PR here. I see unit tests need to be written. Is Readme and wiki changes also needed?
Also, please have a look and tell if I have set of on wrong course code wise.

vany0114 · 2018-09-19T16:39:37Z

thanks a lot, @mebjas!, so, @reisenberger, should I wait for mebjas code will be merged in order to start working on InjectLatency and make my PR?

reisenberger · 2018-09-19T17:00:23Z

Thanks @mebjas @vany0114 both. I will review the code pushed so far and then advise further!

bartelink · 2018-09-20T01:47:30Z

As I read this, I the name Policy.Monkey is jumping out as the name for the high level InjectXxx driver API - alongside that, obviously nodding to and/or sharing names and doc with Simian Army equivalents should be a primary part of the design and naming in order to make it all just make sense without having to read or write docs.

mebjas · 2018-09-20T05:01:42Z

As I read this, I the name Policy.Monkey is jumping out as the name for the high level InjectXxx driver API - alongside that, obviously nodding to and/or sharing names and doc with Simian Army equivalents should be a primary part of the design and naming in order to make it all just make sense without having to read or write docs.

Make sense to me. Just correct me if I am wrong - Your suggestion is to have Policy.Monkey to be the base and we write Policy.InjectXXX on top of it right?

bartelink · 2018-09-20T06:27:15Z

Policy.InjectBehaviour( // Or: InjectCustom // Or just: Inject

Sorry should have been clearer; I feel the top level function and namespace can be such that you write Policy.Handle<X>.Monkey(...) or maybe .Chaos - you're monkeying with the call. Also namespace Polly.Monkey can house specific implementation artifacts.

Func<Context, bool> isOperative

Also, for any chaos method, I'd say the rate parameter and/or decision-to-apply-this-time function should go near the front of the list of args (assuming this fits with other APIs - I don't know the feel of the DSL well). I have given little thought, but assume you'd have a rate and an optional shouldApply which defaults to ()=>true at the desired rate etc.

(this is thinking aloud - I've not looked at a chaos suite or DSL for good ways to do it)

Perhaps Monkey Can control rate and shouldApply and then invoke a Func<context....>, which then gives you access to a mini-DSL which lets you do specific behaviors ?

Policy.Monkey( .1, new [] { Polly.Monkey.Latency(() => <timespan>), Polly.Monkey.Mangle(result => update(result)) } )

or

var latency = Polly.Monkey.Latency(<specify range in various ways>);
var mangle = Polly.Monkey.Mangle<TResult>(<specify handler in various ways>);
var chaos =
    Policy.Monkey( ctx => !ctx.ContainsKey["DisableChaos"], factor =>
    {
       if (factor < .95) return null; /// to signal no chaos this time
       else if (factor < .99) latency
       else mangle
   })....   
// alternately
var chaos2 =
    Policy.Monkey( (ctx, factor) =>
    {
       if (factor < .95 || !ctx.ContainsKey["DisableChaos"]) return null; /// to signal no chaos this time
       else if (factor < .99) return latency;
       else return mangle;
   })....   
} )

Also maybe a .Monkeys overload can return an enumerable of IMonkeys to apply for people who want to admit the potential of >1

The other thing that's important (to me at least!), is to have a way in which to examine the monkey(s) selected for application (and perhaps even post-application) in order that I can log the ones applied in some meaningful way

This should also mesh well with being able to neatly plug in and/or test monkeys you add (or provide custom ones)

declaration: the above is coIoured by me being in the middle of thinking about (and implementing) a call policy wrapping lib (which I intend to open source after bedding into our systems) which uses Polly at its heart. The body of my .Monkey selector method would be driven off a table in a config DSL external to the app. In a yamly type syntax, the policy might define sets of monkeys (tribes?!) which can be applied to actions:-

tribe general:
   rate: .10, type: addLatency(min:5ms, max: 2s, p95: .2s)
   rate: .01, type: badJson,
   rate: .01, type: noBody
   rate: .01, type: timeout
   rate: .01, type: isolate
   rate: .01, type: brokenCircuit
tribe latencyOnly:
   rate: .10, type: addLatency(min:5ms, max: 2s, p95: .2s)
   rate: .01, type: timeout

actions:
  viewCart
    monkeys: general
    logging: whenDebug
  addToCart
    monkeys: general 
  placeOrder
    monkeys: latencyOnly
    logging: max

I'd parse those into a Discriminated Union which holds the values and would then be looking to map each one to constructing an IMonkey, and use the logging to control how much to do.

Also for the Latency one - the latency should probably be configurable to include or exclude the time the actual underlying call took (or whether the circuit is isolated or broken etc).

Now, before I go off the deep end, can someone please provide links to some sensible prior art for this triangular wheel I'm reinventing ;)

reisenberger · 2018-09-20T07:56:05Z

Liking a lot of the thinking in / behind this, @bartelink ! (Some things we mby have slightly different style to do in Polly, but I can cover that off, it's awesome just to hear all the ideas/intent so that we can aim to meet them and/or doco how to achieve.) Longer response to follow ...

Thanks for all the awesome input/contribs on this thread folks - keep it coming!

reisenberger · 2018-09-25T07:43:13Z

@mebjas @vany0114 @bartelink Thanks again for all the great input/contributions! I am on the road rest of week but will aim to look at this one of the evenings if events allow.

mebjas · 2018-09-26T18:50:46Z

@bartelink Thanks for describing it in such details. As @reisenberger said there are certain ways to how things are implemented in Polly, adhering to that we can implement this in layers. Leaving the method names aside it seem every one agree the fault injection policy shall be governed by the following parameters itself

fault - Exception | Func <Context, Exception> | Action<Context> ...
rate - Decimal | Func <Context, Decimal>
isOperative (or enabled) - Func <Context, Decimal>

@reisenberger Is it ok if I send a PR to the dev branch after completing everything required by the code of conduct here and have further discussions over there?

bartelink · 2018-09-29T01:14:36Z

@mebjas If you do post a dev branch WIP PR, I'd be interested in seeing the API you're working on taking shape - I can't promise I'll have any deep insight, but am happy to provide any feedback I can (FYI #496 contains links to a sneak peak of the library I hinted at. I've yet to commence work on fleshing an example format whereby one'd declare chaos rules alongside normal processing rules per Action (that aspect of the code is very far from polished atm, I'm hoping to address that some time next week)).

mebjas · 2018-10-02T09:11:55Z

@reisenberger @bartelink - check this out, I haven't written the tests yet but structured the code like this.
PR - #508

reisenberger · 2018-10-02T09:28:12Z

Thanks @mebjas ! Now looking at this.

reisenberger · 2018-10-02T21:40:41Z

Thanks again @mebjas for #508 ! (comments posted there).

@mebjas @bartelink @vany0114 / anyone: suggest we let @mebjas (and @vany0114 ?) get this to fully functionally tested, then return to any more philosophical questions about scope, naming, abstract base classes etc from this thread ... (tho all detailed comment/review/sugg re WIP on #508 welcome in the meantime!). Thanks.

vany0114 · 2018-10-02T21:50:05Z

sure, let me take a look at @mebjas PR. I will happy to help with this 😃

reisenberger · 2018-10-03T08:56:57Z

@vany0114 Cool. Probably worth letting @mebjas finish his part before beginning yours.

mebjas · 2018-10-14T14:46:01Z

@reisenberger @vany0114

I have covered 3/4 checks. Including the code and unit tests. Please have a look and let me know if anything else is needed.

reisenberger · 2018-10-14T18:28:02Z

Thank you @mebjas ! I will review as soon as possible.

reisenberger · 2018-12-02T10:36:11Z

Cross-posting to catch anybody who has contributed to / is following / later reads this thread: we propose to promote the chaos-engineering dimension to its own package.

reisenberger · 2019-01-20T18:07:04Z

Anyone interested in the chaos-engineering functionality (/cc @martincostello / @bartelink / anyone!), please see the new Simmy repo where the functionality is being finalised.

Comment welcome on any of the Issues there! A few issues represent remaining minor API enhancements I would like to get in place before a v0.1 release.

Some issues shape the API and there is some particular request for community feedback.

reisenberger added the feature label Sep 14, 2018

reisenberger added the up-for-grabs label Sep 16, 2018

mebjas mentioned this issue Oct 1, 2018

[WIP-DO NOT MERGE] Add InjectFault and InjectLatency features #508

Closed

4 tasks

reisenberger added the in progress label Oct 2, 2018

reisenberger mentioned this issue Oct 20, 2018

[Unexpected timeouts observed against local Azure Function cold-starts] #518

Closed

reisenberger added ready and removed in progress labels Dec 30, 2018

reisenberger removed the up-for-grabs label Jan 20, 2019

reisenberger added this to the v7.0.0 milestone Feb 3, 2019

reisenberger mentioned this issue Feb 6, 2019

Update AddTransientHttpErrorPolicy once Polly v7 is released dotnet/extensions#1058

Closed

mebjas mentioned this issue Mar 1, 2019

Initial README.md Polly-Contrib/Simmy#28

Merged

reisenberger changed the title ~~Support for failure / chaos injection policies to test resilience of the system~~ Project SIMMY: Support for failure / chaos injection policies to test resilience of the system Mar 2, 2019

reisenberger mentioned this issue Mar 2, 2019

[WIP-DO NOT MERGE] Chaos and Fault Injection polices #553

Closed

reisenberger closed this as completed Jun 27, 2019

Project SIMMY: Support for failure / chaos injection policies to test resilience of the system #499

Project SIMMY: Support for failure / chaos injection policies to test resilience of the system #499

Comments

mebjas commented Sep 14, 2018

reisenberger commented Sep 16, 2018

reisenberger commented Sep 16, 2018

Fault injection proposal for Polly

For non-generic Polly policies:

For generic Policy<TResult> policies governing executions returning TResult:

Operation

Usage

Thoughts?

reisenberger commented Sep 16, 2018

Latency injection proposal

Operation

Questions

mebjas commented Sep 17, 2018

mebjas commented Sep 17, 2018 • edited Loading

Latency injection proposal

Operation

Questions

vany0114 commented Sep 17, 2018 • edited Loading

reisenberger commented Sep 17, 2018 • edited Loading

vany0114 commented Sep 17, 2018

mebjas commented Sep 17, 2018

reisenberger commented Sep 17, 2018

reisenberger commented Sep 17, 2018

mebjas commented Sep 17, 2018

vany0114 commented Sep 17, 2018 • edited Loading

mebjas commented Sep 19, 2018 • edited Loading

vany0114 commented Sep 19, 2018

reisenberger commented Sep 19, 2018

bartelink commented Sep 20, 2018

mebjas commented Sep 20, 2018

bartelink commented Sep 20, 2018 • edited Loading

reisenberger commented Sep 20, 2018

reisenberger commented Sep 25, 2018

mebjas commented Sep 26, 2018

bartelink commented Sep 29, 2018

mebjas commented Oct 2, 2018

reisenberger commented Oct 2, 2018

reisenberger commented Oct 2, 2018

vany0114 commented Oct 2, 2018

reisenberger commented Oct 3, 2018

mebjas commented Oct 14, 2018

reisenberger commented Oct 14, 2018

reisenberger commented Dec 2, 2018

reisenberger commented Jan 20, 2019

For generic `Policy<TResult>` policies governing executions returning `TResult`:

mebjas commented Sep 17, 2018 •

edited

Loading

vany0114 commented Sep 17, 2018 •

edited

Loading

reisenberger commented Sep 17, 2018 •

edited

Loading

vany0114 commented Sep 17, 2018 •

edited

Loading

mebjas commented Sep 19, 2018 •

edited

Loading

bartelink commented Sep 20, 2018 •

edited

Loading