-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use gen_server:cast/2 in ldclient_event_server:add_event/3 #115
fix: use gen_server:cast/2 in ldclient_event_server:add_event/3 #115
Conversation
handle_call({flush, Tag}, _From, #{events := Events, summary_event := SummaryEvent, flush_interval := FlushInterval, timer_ref := TimerRef} = State) -> | ||
_ = erlang:cancel_timer(TimerRef), | ||
ok = ldclient_event_process_server:send_events(Tag, Events, SummaryEvent), | ||
NewTimerRef = erlang:send_after(FlushInterval, self(), {flush, Tag}), | ||
{reply, ok, State#{events := [], summary_event := #{}, timer_ref := NewTimerRef}}. | ||
|
||
handle_cast({add_event, Event, Tag, Options}, #{events := Events, summary_event := SummaryEvent, capacity := Capacity} = State) when length(Events) < Capacity -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my concern from the original PR still applies to this. The capacity is checked after the handle of the cast, and not when casting.
This means that if handle_cast is delayed then the mailbox would have unbounded growth.
I need to do some testing to validate this concern.
So the cast call itself happens on the process doing the eval, handle_cast happens sometime later when the event_server is free to do so. So, if the event server is delayed the casts are accumulating in the mailbox.
The linked go example is equivalent to a check in add_event
, as it is checking before putting it into its equivalent of a mailbox. It uses a double buffer system effectively emulating the mailbox.
So, call put in an inbox, inbox is processed later and added to a buffer. The inbox size and the event buffer size are decoupled. The difference is in erlang the mailbox will grow unbounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally just going to use erlang:process_info(Pid, message_queue_len)
to check the size, and if it was over some X limit, then discard the event. But I am unclear if process_info itself would be adding to the demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, maybe because of the triviality of the code within the handle_cast it could never get backed up. So, I need to consider that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So basic test scenario.
-module(a).
-behaviour(gen_server).
-export([start_link/0, do_thing/1, bombard/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
terminate/2, code_change/3]).
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
bombard() ->
gen_server:cast(?MODULE, {thing, "BOOM"}),
bombard().
do_thing(Input) ->
io:format("sending cast message: ~p~n",[Input]),
gen_server:cast(?MODULE, {thing, Input}).
init([]) -> {ok, []}. %% no treatment of info here!
handle_call(Msg, _From, Events) ->
io:format("Unexpected message: ~p~n",[Msg]),
{reply, Msg, Events}.
handle_cast({thing, Event}, Events) when length(Events) < 10000 ->
{noreply, [Event|Events]};
handle_cast({thing, _}, Events) ->
{noreply, [Events]}.
handle_info(Msg, Events) ->
io:format("Unexpected message: ~p~n",[Msg]),
{noreply, Events}.
terminate(_, _Events) ->
ok.
code_change(_OldVsn, Events, _Extra) ->
{ok, Events}.
Here we have a gen_server that puts 10000 events into a buffer, and then subsequently does nothing.
Using an erlang shell:
c(a).
a:start_link().
a:bombard().
The memory of the process rapidly grows and then eventually would crash. (On linux the OOM killer will sigterm it.)
So, I think the system either needs back pressure, as provided by call
or we need a limiter on the calling side.
I am going to be out next week, but I will check into this again when I return. It is possible I need a better emulation scenario. But, based on the frequency some customers have exceeded event capacity (sometimes by millions), I am concerned about this. Unlike most potential problems unbounded mailbox grows takes down the entire node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it! I hear your concerns. I thought this thread was interesting: https://elixirforum.com/t/is-erlang-process-info-pid-message-queue-len-a-heavy-operation/20358 . My interpretation of Jose's answer is that it's not safe to use in our scenario (because we'd be calling it from an external process).
My thinking is that we're discussing tradeoffs between a hypothetical problem (mailbox overwhelmed) and an observed problem (latency via backpressure). My gut says that a system capable of overwhelming the mailbox (in spite of existing load shedding) will be severely degraded by the current implementation too. Unbounded memory growth is bad, but discovering that LaunchDarkly is the bottleneck in your service is also bad. I feel that we can proceed with this iterative improvement (which solves an immediate customer pain point) and consider more robust load shedding if there are reports of high memory usage in the SDK?
That said I appreciate your thoroughness and obviously collecting data is better than my gut feelings 😅.
(By the way Jose Valim is the creator of Elixir).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply @kinyoklion, I was out for the holidays. Happy new year!
If the system is consistently slowing down because of back pressure, then it means it would be consistently adding things into that mailbox beyond processing capacity as well.
The question on my mind is how efficiently will the event server load shed the mailbox in-between event POST
s? Assuming the customer is ok with dropping events and load shedding is efficient, it doesn't matter that mailbox processing is beyond capacity. Of course, if the system is able to OOM the mailbox while the event server is busy POST
ing, that would be a problem 😅.
I am leaning more toward making this just be an option for now.
This would be a great path forward for us!
I also think I will look into improving the performance of the event server, and a worker pool may be another options.
This makes a lot of sense - moving event POST
s off the flag eval critical path would be huge and might let us keep event enqueueing as a call
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy new year!
I put the configuration based version in review here: #117
I am still a little concerned about the underlying performance situation. The post itself is on a cast, so the only thing in the call is basically reformatting some things and appending arrays. Aside from an LRU thing that I think should be optimized.
What is the events capacity you have set, and how large are your contexts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a little concerned about the underlying performance situation. The post itself is on a cast, so the only thing in the call is basically reformatting some things and appending arrays. Aside from an LRU thing that I think should be optimized.
Your comment made me realize that I had incorrectly assumed that the event-server blocks while flushing events to LaunchDarkly. This made me re-consider whether LaunchDarkly was actually creating significant backpressure in our system. I dug into our metrics from the previous incident and I didn't actually find any concrete evidence LD was the cause - a timer metric around flag evals showed flag evaluations slowing way down, but our system was also incredibly cpu constrained at the time. The incident happened before I joined Knock, and was a while ago (March '23) so a lot of context is now missing.
I think we can close this PR (again 😅) due to the lack of evidence that the SDK is impacting our application. I'm going to advocate for us to carefully reintroduce flagging in the high-volume service, and we'll see how flag eval latency behaves. If there are any problems I'll start a new github issue. I apologize for the wild goose chase! Thanks @kinyoklion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I will still keep in mind some things we can do in the future to make sure event delivery is performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
Requirements
Related issues
This PR is a followup to #64 . To recap that PR - we're currently unable to use LD in a high volume latency sensitive service because ldclient's event server creates backpressure in the service.
Describe the solution you've provided
#64 converts
add_event
from acall
to acast
. This PR extends #64 with a simple guard that ensures the event server's events buffer is not at capacity before proceeding. Note that this is redundant with existing downstream capacity checks, but by checking capacity early we have the opportunity to load shed as early as possible. It seems like we could consider removing some of the downstream guards but I'll defer that to the LD team.Describe alternatives you've considered
In #64 it was suggested that we check the size of the mailbox. I discussed the PR with @unix1 and he commended we check the length of events (Zurab, feel free to correct me if I misunderstood). I briefly looked into checking mailbox size and it seems like it would add some complexity that I'm not sure we need. The capacity check should handily shed load. The key problem to solve is to prevent flag evaluation from being delayed by the event server being busy POSTing event payloads.
I haven't added any tests but it seems that this case is already covered by this test?
Additional context
I worked at LD from 2018-2022 and I know the lengths LD goes to to keep flag evaluation as efficient as possible. We're eager to get this problem fixed so we can use LD in some critical parts of our service.
See this snippet of the go sdk for an example of how the problem is solved there.