-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project SIMMY: Support for failure / chaos injection policies to test resilience of the system #499
Comments
Interesting question and ideas @mebjas - I had similar ideas in the past (as discussed with @joelhulen / @jhalterman / others) that the Polly architecture could be used for fault-injection. Obviously there are established solutions in the market for mocking third-party endpoints, like Mountebank or Wiremock; and likewise many tools for disrupting traffic/terminating instances etc at the broader container or cloud infrastructure layer. But fault-injection within Polly could perhaps complement and serve different use cases:
It would be relatively easy to implement in Polly - ideas coming in a follow-up post. |
Fault injection proposal for PollyFor non-generic Polly policies:var chaosPolicy = Policy.InjectFault(
Exception | Func<Context, Exception> fault, // The fault to inject
decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive. The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
Func<Context, bool> enabled // Faults are only injected when returns true.
); By For generic
|
Latency injection proposalAnother obvious failure injection would be latency: var latencyPolicy = Policy.InjectLatency(
TimeSpan | Func<Context, Timespan> latency, // The latency to inject.
decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 (inclusive). The policy will inject the latency, randomly, that proportion of the time, eg 0.01 means inject latency 1% of the time.
Func<Context, bool> enabled // Latency is only injected when returns true.
); Operation
Questions
|
@reisenberger Thanks for response! I think I can take this up if that's how it works. |
Isn't this just another type of fault. If the fault could be passed as an argument we need not write individual classes for each then. Latency injection can be shared as an example.
Both the cases might be needed. While, before the operation makes more sense to me and use case in my mind. If I were to wrap it around the DB call for example - I might be looking at a timeout or so with operation failure. While if the latency is injected post execution that would mean - actual operation succeeded while the upper layer timed out thus forcing me to handle more scenarios. Which ain't bad either 🤕 |
Hey guys, I would like to know if someone is working on this, if don't I would like to help on this, then you can assign to me this task. 😃 |
@vany0114 @mebjas I think you have both volunteered to work on this. Is there the option to maybe take one policy (of InjectFault; InjectLatency) each, and review each other's contribution? (I will review too). Or find some other way to collaborate? (Open to suggestions...). Thank you both for your enthusiasm! |
Of course, I like that idea, so let @mebjas decide which policy wants to work on since he's the owner of the issue, then I could work on the other issue! |
@reisenberger @vany0114 If they were supposed to be different policies I could go with InjectFault. As of now I just started pushing to this branch - https://github.com/mebjas/Polly/tree/dev-mebjas/src/Polly.Shared/InjectFault I have a doubt though, how is var chaosPolicy = Policy.InjectFault(
Exception | Func<Context, Exception> fault | Action<Context> fault, // The fault to inject
decimal | Func<Context, decimal> injectionRate, // A decimal between 0 and 1 inclusive. The policy will inject the fault, randomly, that proportion of the time, eg 0.01 means inject the fault randomly 1% of the time.
Func<Context, bool> enabled // Faults are only injected when returns true.
); I could use it as following for example to inject latency. Policy.InjectFault(
(ctx) => { Thread.Sleep(5000); },
0.2,
(ctx) => { return true; }
); In this way we could define different kinds of faults within the delegate |
@mebjas Very interesting comment above - I was typing my generalisation/abstraction idea simultaneously. I will still post it (below) and we can all consider further. @mebjas This is a great question! :
We could consider two separate questions: (a) is the concept of Re (a), my thought was that latency-injection is a primary-enough concept that it is worth providing some simple/obvious public syntax, even if we (b) generalise the implementation internally. Re (b), I had the thought yesterday that we could generalise both cases by abstracting out the general injection concept, something like: Policy.InjectBehaviour( // Or: InjectCustom // Or just: Inject
Func<Context, CancellationToken, (DelegateResult<TResult>, bool)> preExecute,
// The DelegateResult<TResult> output parameter of preExecute is an _optional_ substitute result (which would replace executing the passed delegate).
// The bool output parameter of preExecute is whether to use this replacement result, or call as normal the delegate which was passed to .Execute/Async(...).
Func<Context, CancellationToken, DelegateResult<TResult>, DelegateResult<TResult>> postExecute,
// The input DelegateResult<TResult> parameter is the result from executing the delegate passed for execution by .Execute/Async(...).
// The output DelegateResult<TResult> parameter ... allows postExecute to augment, extend, or even replace the original obtained result.
Func<Context, bool> isOperative
// ... encapsulates any logic (however complex)
); Maybe it's easier to communicate that the synchronous implementation (
This builds a generalised pre-execution-injection and post-execution-injection architecture, which users could repurpose for any means they want (if we exposed it publicly, not just as a hidden abstract base-class). Maybe they want to use I don't have an immediate preference - first, just sharing ideas, and we can all reflect. |
( Apologies: poss some mistakes in syntax around whether some of the return values are |
Thinking of it like this, yes it makes sense.
While this seem to be best generalization which cover both of the above and give users ability to inject any behavior before or after the real delegate execution, is this the real use case of chaos engineering here to inject any piece of code? Would it not lead to wrong usage? Exposing Has this kind of generalization been used for any other concepts implemented in Polly? |
@reisenberger I like the first option:
I think the generalization is a great idea, but I also think an API should be descriptive enough, having said that, the latency should be considered as an exception itself? I don't think so. I think we should separate both concerns even if we gonna call/use the same base class or something internally, for both |
@reisenberger @vany0114 Added code for
in @reisenberger |
thanks a lot, @mebjas!, so, @reisenberger, should I wait for mebjas code will be merged in order to start working on |
As I read this, I the name |
Make sense to me. Just correct me if I am wrong - Your suggestion is to have |
Sorry should have been clearer; I feel the top level function and namespace can be such that you write
Also, for any chaos method, I'd say the (this is thinking aloud - I've not looked at a chaos suite or DSL for good ways to do it) Perhaps
or
Also maybe a The other thing that's important (to me at least!), is to have a way in which to examine the monkey(s) selected for application (and perhaps even post-application) in order that I can log the ones applied in some meaningful way This should also mesh well with being able to neatly plug in and/or test monkeys you add (or provide custom ones) declaration: the above is coIoured by me being in the middle of thinking about (and implementing) a call policy wrapping lib (which I intend to open source after bedding into our systems) which uses Polly at its heart. The body of my
I'd parse those into a Discriminated Union which holds the values and would then be looking to map each one to constructing an Also for the Latency one - the latency should probably be configurable to include or exclude the time the actual underlying call took (or whether the circuit is isolated or broken etc). Now, before I go off the deep end, can someone please provide links to some sensible prior art for this triangular wheel I'm reinventing ;) |
Liking a lot of the thinking in / behind this, @bartelink ! (Some things we mby have slightly different style to do in Polly, but I can cover that off, it's awesome just to hear all the ideas/intent so that we can aim to meet them and/or doco how to achieve.) Longer response to follow ... Thanks for all the awesome input/contribs on this thread folks - keep it coming! |
@mebjas @vany0114 @bartelink Thanks again for all the great input/contributions! I am on the road rest of week but will aim to look at this one of the evenings if events allow. |
@bartelink Thanks for describing it in such details. As @reisenberger said there are certain ways to how things are implemented in Polly, adhering to that we can implement this in layers. Leaving the method names aside it seem every one agree the fault injection policy shall be governed by the following parameters itself
@reisenberger Is it ok if I send a PR to the dev branch after completing everything required by the code of conduct here and have further discussions over there? |
@mebjas If you do post a dev branch WIP PR, I'd be interested in seeing the API you're working on taking shape - I can't promise I'll have any deep insight, but am happy to provide any feedback I can (FYI #496 contains links to a sneak peak of the library I hinted at. I've yet to commence work on fleshing an example format whereby one'd declare chaos rules alongside normal processing rules per Action (that aspect of the code is very far from polished atm, I'm hoping to address that some time next week)). |
@reisenberger @bartelink - check this out, I haven't written the tests yet but structured the code like this. |
Thanks @mebjas ! Now looking at this. |
Thanks again @mebjas for #508 ! (comments posted there). @mebjas @bartelink @vany0114 / anyone: suggest we let @mebjas (and @vany0114 ?) get this to fully functionally tested, then return to any more philosophical questions about scope, naming, abstract base classes etc from this thread ... (tho all detailed comment/review/sugg re WIP on #508 welcome in the meantime!). Thanks. |
sure, let me take a look at @mebjas PR. I will happy to help with this 😃 |
I have covered 3/4 checks. Including the code and unit tests. Please have a look and let me know if anything else is needed. |
Thank you @mebjas ! I will review as soon as possible. |
Cross-posting to catch anybody who has contributed to / is following / later reads this thread: we propose to promote the chaos-engineering dimension to its own package. |
Anyone interested in the chaos-engineering functionality (/cc @martincostello / @bartelink / anyone!), please see the new Simmy repo where the functionality is being finalised. Comment welcome on any of the Issues there! A few issues represent remaining minor API enhancements I would like to get in place before a v0.1 release. Some issues shape the API and there is some particular request for community feedback. |
Having this library helps me introduce resilience to my project, but I don't want expected or unexpected failures to occur to test it out
Describe your proposed or preferred solution:
Describe any alternative options you've considered:
Doing it myself, manually disabling the service which supports that and checking if it really breaks the circuit and moves to fallback.
Any additional info?
We can definitely take learning from Netflix Simian Army.
The text was updated successfully, but these errors were encountered: