Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use gen_server:cast/2 in ldclient_event_server:add_event/3 #115

Closed

Conversation

vlymar
Copy link

@vlymar vlymar commented Dec 21, 2023

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Related issues

This PR is a followup to #64 . To recap that PR - we're currently unable to use LD in a high volume latency sensitive service because ldclient's event server creates backpressure in the service.

Describe the solution you've provided

#64 converts add_event from a call to a cast. This PR extends #64 with a simple guard that ensures the event server's events buffer is not at capacity before proceeding. Note that this is redundant with existing downstream capacity checks, but by checking capacity early we have the opportunity to load shed as early as possible. It seems like we could consider removing some of the downstream guards but I'll defer that to the LD team.

Describe alternatives you've considered

In #64 it was suggested that we check the size of the mailbox. I discussed the PR with @unix1 and he commended we check the length of events (Zurab, feel free to correct me if I misunderstood). I briefly looked into checking mailbox size and it seems like it would add some complexity that I'm not sure we need. The capacity check should handily shed load. The key problem to solve is to prevent flag evaluation from being delayed by the event server being busy POSTing event payloads.

I haven't added any tests but it seems that this case is already covered by this test?

Additional context

I worked at LD from 2018-2022 and I know the lengths LD goes to to keep flag evaluation as efficient as possible. We're eager to get this problem fixed so we can use LD in some critical parts of our service.

See this snippet of the go sdk for an example of how the problem is solved there.

@vlymar vlymar changed the title Resurrected Proposal: use gen_server:cast/2 in ldclient_event_server:add_event/3 fixl: use gen_server:cast/2 in ldclient_event_server:add_event/3 Dec 21, 2023
@vlymar vlymar changed the title fixl: use gen_server:cast/2 in ldclient_event_server:add_event/3 fix: use gen_server:cast/2 in ldclient_event_server:add_event/3 Dec 21, 2023
@vlymar vlymar marked this pull request as ready for review December 21, 2023 21:15
@vlymar vlymar requested a review from a team December 21, 2023 21:15
handle_call({flush, Tag}, _From, #{events := Events, summary_event := SummaryEvent, flush_interval := FlushInterval, timer_ref := TimerRef} = State) ->
_ = erlang:cancel_timer(TimerRef),
ok = ldclient_event_process_server:send_events(Tag, Events, SummaryEvent),
NewTimerRef = erlang:send_after(FlushInterval, self(), {flush, Tag}),
{reply, ok, State#{events := [], summary_event := #{}, timer_ref := NewTimerRef}}.

handle_cast({add_event, Event, Tag, Options}, #{events := Events, summary_event := SummaryEvent, capacity := Capacity} = State) when length(Events) < Capacity ->
Copy link
Member

@kinyoklion kinyoklion Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my concern from the original PR still applies to this. The capacity is checked after the handle of the cast, and not when casting.

This means that if handle_cast is delayed then the mailbox would have unbounded growth.

I need to do some testing to validate this concern.

So the cast call itself happens on the process doing the eval, handle_cast happens sometime later when the event_server is free to do so. So, if the event server is delayed the casts are accumulating in the mailbox.

The linked go example is equivalent to a check in add_event, as it is checking before putting it into its equivalent of a mailbox. It uses a double buffer system effectively emulating the mailbox.

So, call put in an inbox, inbox is processed later and added to a buffer. The inbox size and the event buffer size are decoupled. The difference is in erlang the mailbox will grow unbounded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally just going to use erlang:process_info(Pid, message_queue_len) to check the size, and if it was over some X limit, then discard the event. But I am unclear if process_info itself would be adding to the demand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, maybe because of the triviality of the code within the handle_cast it could never get backed up. So, I need to consider that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So basic test scenario.

-module(a).
-behaviour(gen_server).

-export([start_link/0, do_thing/1, bombard/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
         terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

bombard() ->
    gen_server:cast(?MODULE, {thing, "BOOM"}),
    bombard().

do_thing(Input) ->
    io:format("sending cast message: ~p~n",[Input]),
    gen_server:cast(?MODULE, {thing, Input}).

init([]) -> {ok, []}. %% no treatment of info here!

handle_call(Msg, _From, Events) ->
    io:format("Unexpected message: ~p~n",[Msg]),
    {reply, Msg, Events}.

handle_cast({thing, Event}, Events) when length(Events) < 10000 ->
    {noreply, [Event|Events]};
handle_cast({thing, _}, Events) ->
    {noreply, [Events]}.

handle_info(Msg, Events) ->
    io:format("Unexpected message: ~p~n",[Msg]),
    {noreply, Events}.

terminate(_, _Events) ->
    ok.

code_change(_OldVsn, Events, _Extra) ->
    {ok, Events}. 

Here we have a gen_server that puts 10000 events into a buffer, and then subsequently does nothing.

Using an erlang shell:

c(a).
a:start_link().
a:bombard().

The memory of the process rapidly grows and then eventually would crash. (On linux the OOM killer will sigterm it.)

So, I think the system either needs back pressure, as provided by call or we need a limiter on the calling side.

I am going to be out next week, but I will check into this again when I return. It is possible I need a better emulation scenario. But, based on the frequency some customers have exceeded event capacity (sometimes by millions), I am concerned about this. Unlike most potential problems unbounded mailbox grows takes down the entire node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! I hear your concerns. I thought this thread was interesting: https://elixirforum.com/t/is-erlang-process-info-pid-message-queue-len-a-heavy-operation/20358 . My interpretation of Jose's answer is that it's not safe to use in our scenario (because we'd be calling it from an external process).

My thinking is that we're discussing tradeoffs between a hypothetical problem (mailbox overwhelmed) and an observed problem (latency via backpressure). My gut says that a system capable of overwhelming the mailbox (in spite of existing load shedding) will be severely degraded by the current implementation too. Unbounded memory growth is bad, but discovering that LaunchDarkly is the bottleneck in your service is also bad. I feel that we can proceed with this iterative improvement (which solves an immediate customer pain point) and consider more robust load shedding if there are reports of high memory usage in the SDK?

That said I appreciate your thoroughness and obviously collecting data is better than my gut feelings 😅.

(By the way Jose Valim is the creator of Elixir).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @kinyoklion, I was out for the holidays. Happy new year!

If the system is consistently slowing down because of back pressure, then it means it would be consistently adding things into that mailbox beyond processing capacity as well.

The question on my mind is how efficiently will the event server load shed the mailbox in-between event POSTs? Assuming the customer is ok with dropping events and load shedding is efficient, it doesn't matter that mailbox processing is beyond capacity. Of course, if the system is able to OOM the mailbox while the event server is busy POSTing, that would be a problem 😅.

I am leaning more toward making this just be an option for now.

This would be a great path forward for us!

I also think I will look into improving the performance of the event server, and a worker pool may be another options.

This makes a lot of sense - moving event POSTs off the flag eval critical path would be huge and might let us keep event enqueueing as a call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy new year!

I put the configuration based version in review here: #117

I am still a little concerned about the underlying performance situation. The post itself is on a cast, so the only thing in the call is basically reformatting some things and appending arrays. Aside from an LRU thing that I think should be optimized.

What is the events capacity you have set, and how large are your contexts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a little concerned about the underlying performance situation. The post itself is on a cast, so the only thing in the call is basically reformatting some things and appending arrays. Aside from an LRU thing that I think should be optimized.

Your comment made me realize that I had incorrectly assumed that the event-server blocks while flushing events to LaunchDarkly. This made me re-consider whether LaunchDarkly was actually creating significant backpressure in our system. I dug into our metrics from the previous incident and I didn't actually find any concrete evidence LD was the cause - a timer metric around flag evals showed flag evaluations slowing way down, but our system was also incredibly cpu constrained at the time. The incident happened before I joined Knock, and was a while ago (March '23) so a lot of context is now missing.

I think we can close this PR (again 😅) due to the lack of evidence that the SDK is impacting our application. I'm going to advocate for us to carefully reintroduce flagging in the high-volume service, and we'll see how flag eval latency behaves. If there are any problems I'll start a new github issue. I apologize for the wild goose chase! Thanks @kinyoklion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I will still keep in mind some things we can do in the future to make sure event delivery is performant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@kinyoklion kinyoklion closed this Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants