Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tries to start consumer that is already started #286

Closed
jasongoodwin opened this issue May 9, 2018 · 8 comments
Closed

Tries to start consumer that is already started #286

jasongoodwin opened this issue May 9, 2018 · 8 comments

Comments

@jasongoodwin
Copy link
Contributor

jasongoodwin commented May 9, 2018

We see cases where KafkaEx tries to start a consumer that is already running for some reason. Looks pretty benign. I'm assuming this is internal to kafka_ex but let me know if I'm off here.
Started doing this after a crash observed in a genserver. Continues to throw these errors into the log.

We have a bug, but it seems to spin kafka_ex into this weird behaviour

Kafka_ex 0.8.0

May 08 23:30:03 customer1 funnelcloud-1.log: [2018-05-09 03:30:03.713] error -- : -GenServer #PID<0.16837.0> terminating
May 08 23:30:03 customer1 funnelcloud-1.log: ** (CaseClauseError) no case clause matching: {:error, {:already_started, #PID<0.16839.0>}}
May 08 23:30:03 customer1 funnelcloud-1.log:     (kafka_ex) lib/kafka_ex/consumer_group.ex:261: KafkaEx.ConsumerGroup.start_consumer/5
May 08 23:30:03 customer1 funnelcloud-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:390: KafkaEx.ConsumerGroup.Manager.start_consumer/2
May 08 23:30:03 customer1 funnelcloud-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:276: KafkaEx.ConsumerGroup.Manager.sync/2
May 08 23:30:03 customer1 funnelcloud-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:149: KafkaEx.ConsumerGroup.Manager.handle_info/2
May 08 23:30:03 customer1 funnelcloud-1.log:     (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
May 08 23:30:03 customer1 funnelcloud-1.log:     (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
May 08 23:30:03 customer1 funnelcloud-1.log:     (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
May 08 23:30:03 customer1 funnelcloud-1.log: Last message: :timeout
May 08 23:30:04 customer1 funnelcloud-1.log: =CRASH REPORT==== 9-May-2018::03:30:03 ===
May 08 23:30:04 customer1 funnelcloud-1.log:   crasher:
May 08 23:30:04 customer1 funnelcloud-1.log:     initial call: Elixir.KafkaEx.ConsumerGroup.Manager:init/1
May 08 23:30:04 customer1 funnelcloud-1.log:     pid: <0.16837.0>
May 08 23:30:04 customer1 funnelcloud-1.log:     registered_name: []
May 08 23:30:04 customer1 funnelcloud-1.log:     exception error: no case clause matching {error,
May 08 23:30:04 customer1 funnelcloud-1.log:                                               {already_started,<0.16839.0>}}
May 08 23:30:04 customer1 funnelcloud-1.log:       in function  'Elixir.KafkaEx.ConsumerGroup':start_consumer/5 (lib/kafka_ex/consumer_group.ex, line 261)
May 08 23:30:04 customer1 funnelcloud-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':start_consumer/2 (lib/kafka_ex/consumer_group/manager.ex, line 390)
May 08 23:30:04 customer1 funnelcloud-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':sync/2 (lib/kafka_ex/consumer_group/manager.ex, line 276)
May 08 23:30:04 customer1 funnelcloud-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':handle_info/2 (lib/kafka_ex/consumer_group/manager.ex, line 149)
May 08 23:30:04 customer1 funnelcloud-1.log:       in call from gen_server:try_dispatch/4 (gen_server.erl, line 616)
May 08 23:30:04 customer1 funnelcloud-1.log:       in call from gen_server:handle_msg/6 (gen_server.erl, line 686)
May 08 23:30:04 customer1 funnelcloud-1.log:     ancestors: [<0.2248.0>,'Elixir.MachineState.Supervisor',<0.2241.0>]
May 08 23:30:04 customer1 funnelcloud-1.log:     message_queue_len: 0
May 08 23:30:04 customer1 funnelcloud-1.log:     messages: []
May 08 23:30:04 customer1 funnelcloud-1.log:     links: [<0.2248.0>]
May 08 23:30:04 customer1 funnelcloud-1.log:     dictionary: []
May 08 23:30:04 customer1 funnelcloud-1.log:     trap_exit: true
May 08 23:30:04 customer1 funnelcloud-1.log:     status: running
May 08 23:30:04 customer1 funnelcloud-1.log:     heap_size: 987
May 08 23:30:04 customer1 funnelcloud-1.log:     stack_size: 27
May 08 23:30:04 customer1 funnelcloud-1.log:     reductions: 710
May 08 23:30:04 customer1 funnelcloud-1.log:   neighbours:
May 08 23:30:04 customer1 funnelcloud-1.log: =SUPERVISOR REPORT==== 9-May-2018::03:30:03 ===
May 08 23:30:04 customer1 funnelcloud-1.log:      Supervisor: {<0.2248.0>,'Elixir.KafkaEx.ConsumerGroup'}
May 08 23:30:04 customer1 funnelcloud-1.log:      Context:    child_terminated
May 08 23:30:04 customer1 funnelcloud-1.log:      Reason:     {{case_clause,{error,{already_started,<0.16839.0>}}},
May 08 23:30:04 customer1 funnelcloud-1.log:                   [{'Elixir.KafkaEx.ConsumerGroup',start_consumer,5,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"lib/kafka_ex/consumer_group.ex"},{line,261}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',start_consumer,2,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
May 08 23:30:04 customer1 funnelcloud-1.log:                         {line,390}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',sync,2,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
May 08 23:30:04 customer1 funnelcloud-1.log:                         {line,276}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',handle_info,2,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
May 08 23:30:04 customer1 funnelcloud-1.log:                         {line,149}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {gen_server,try_dispatch,4,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"gen_server.erl"},{line,616}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {gen_server,handle_msg,6,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"gen_server.erl"},{line,686}]},
May 08 23:30:04 customer1 funnelcloud-1.log:                    {proc_lib,init_p_do_apply,3,
May 08 23:30:04 customer1 funnelcloud-1.log:                        [{file,"proc_lib.erl"},{line,247}]}]}
May 08 23:30:04 customer1 funnelcloud-1.log:      Offender:   [{pid,<0.16837.0>},
May 08 23:30:04 customer1 funnelcloud-1.log:                   {id,'Elixir.KafkaEx.ConsumerGroup.Manager'},
May 08 23:30:04 customer1 funnelcloud-1.log:                   {mfargs,
May 08 23:30:04 customer1 funnelcloud-1.log:                       {'Elixir.KafkaEx.ConsumerGroup.Manager',start_link,
May 08 23:30:04 customer1 funnelcloud-1.log:                           ['Elixir.MachineState.KafkaConsumer',
May 08 23:30:04 customer1 funnelcloud-1.log:                            <<"machine_state">>,
May 08 23:30:04 customer1 funnelcloud-1.log:                            [<<"commands">>],
May 08 23:30:04 customer1 funnelcloud-1.log:                            [{supervisor_pid,<0.2248.0>}]]}},
May 08 23:30:04 customer1 funnelcloud-1.log:                   {restart_type,permanent},
May 08 23:30:04 customer1 funnelcloud-1.log:                   {shutdown,5000},
May 08 23:30:04 customer1 funnelcloud-1.log:                   {child_type,worker}]
@jasongoodwin
Copy link
Contributor Author

Seems like it can get stuck in a loop where it does this indefinitely. It's hard for me to characterize further without reproducing but here is an error we see before it spins out trying to restart the existing consumer.

[2018-06-02 12:21:33.465] error -- : application=kafka_ex -Receiving data from broker "....":9092 failed with :timeout
[2018-06-02 12:21:33.465] error -- : application=kafka_ex -Unable to fetch metadata from any brokers.  Timeout is 60000.
[2018-06-02 12:21:33.513] error -- : -GenServer #PID<0.25315.0> terminating
** (RuntimeError) Unable to fetch metadata from any brokers.  Timeout is 60000.
    (kafka_ex) lib/kafka_ex/server.ex:356: KafkaEx.Server0P9P0.retrieve_metadata/6
    (kafka_ex) lib/kafka_ex/server.ex:337: KafkaEx.Server0P9P0.update_metadata/1
    (kafka_ex) lib/kafka_ex/server.ex:333: KafkaEx.Server0P9P0.kafka_server_update_metadata/1
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :update_metadata

@jasongoodwin
Copy link
Contributor Author

jasongoodwin commented Aug 9, 2018

saw this again today. restarting elixir resolved it.
Note that the ancestor process was different in both cases logged.

Above it's one app in elixir (Elixir.MachineState.KafkaConsumer). Today it was another
ancestors: [<0.2127.0>,'Elixir.BroadcastSequencer.Supervisor',

@joshuawscott
Copy link
Member

Are you still on 0.8.0, or have you moved to the latest?

@IanVaughan
Copy link
Contributor

@jasongoodwin You may be interested to read/comment on #260

@jasongoodwin
Copy link
Contributor Author

jasongoodwin commented Aug 16, 2018

I'm using kafka 9+ (0.11 in prod) but maybe it's set incorrectly somewhere? Using the default config which appears to be 0.9.0 so don't think this is relevant. We've been running it for a while and it has only appeared a few times around restarts of the kafka consumer.

@joshuawscott
Copy link
Member

I meant for KafkaEx, sorry @jasongoodwin

@jasongoodwin
Copy link
Contributor Author

Sorry for the slow reply - kafka_ex 0.8.2
I still see these periodically.

I just remembered we're actually on a fork of 0.8.2 but I don't believe any of the changes would cause this - we added the ability to supply init params to gen consumers iirc. But it's another variable. It's just impossible to reproduce - it just periodically appears in running systems. We have 32 or so production servers running 24/7 and they are not common enough to cause users to notice any interruption but can occur for an hour at a time. Eg They seem to appear every minute or two for an hour and then stop in some recent observed incidents.

Fork:
https://github.com/funnelcloudinc/kafka_ex

I'll look at it and make sure it isn't a bug we introduced - if I'm the only one complaining about this I would be suspect of our fork but we also run it at a pretty big volume and we track every crash report we encounter. I'm sure that's different than how other people may be running it.

		Nov 29 19:08:42 ... app-1.log: [2018-11-30 00:08:42.193] error -- : -GenServer #PID<0.12038.0> terminating
		

Nov 29 19:08:42 ... app-1.log: ** (CaseClauseError) no case clause matching: {:error, {:already_started, #PID<0.12040.0>}}
		

Nov 29 19:08:42 ... app-1.log:     (kafka_ex) lib/kafka_ex/consumer_group.ex:261: KafkaEx.ConsumerGroup.start_consumer/5
		

Nov 29 19:08:42 ... app-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:390: KafkaEx.ConsumerGroup.Manager.start_consumer/2
		

Nov 29 19:08:42 ... app-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:276: KafkaEx.ConsumerGroup.Manager.sync/2
		

Nov 29 19:08:42 ... app-1.log:     (kafka_ex) lib/kafka_ex/consumer_group/manager.ex:149: KafkaEx.ConsumerGroup.Manager.handle_info/2
		

Nov 29 19:08:42 ... app-1.log:     (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
		

Nov 29 19:08:42 ... app-1.log:     (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
		

Nov 29 19:08:42 ... app-1.log:     (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
		

Nov 29 19:08:42 ... app-1.log: Last message: :timeout
		

Nov 29 19:08:42 ... app-1.log: =CRASH REPORT==== 30-Nov-2018::00:08:42 ===
		

Nov 29 19:08:42 ... app-1.log:   crasher:
		

Nov 29 19:08:42 ... app-1.log:     initial call: Elixir.KafkaEx.ConsumerGroup.Manager:init/1
		

Nov 29 19:08:42 ... app-1.log:     pid: <0.12038.0>
		

Nov 29 19:08:42 ... app-1.log:     registered_name: []
		

Nov 29 19:08:42 ... app-1.log:     exception error: no case clause matching {error,
		

Nov 29 19:08:42 ... app-1.log:                                               {already_started,<0.12040.0>}}
		

Nov 29 19:08:42 ... app-1.log:       in function  'Elixir.KafkaEx.ConsumerGroup':start_consumer/5 (lib/kafka_ex/consumer_group.ex, line 261)
		

Nov 29 19:08:42 ... app-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':start_consumer/2 (lib/kafka_ex/consumer_group/manager.ex, line 390)
		

Nov 29 19:08:42 ... app-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':sync/2 (lib/kafka_ex/consumer_group/manager.ex, line 276)
		

Nov 29 19:08:42 ... app-1.log:       in call from 'Elixir.KafkaEx.ConsumerGroup.Manager':handle_info/2 (lib/kafka_ex/consumer_group/manager.ex, line 149)
		

Nov 29 19:08:42 ... app-1.log:       in call from gen_server:try_dispatch/4 (gen_server.erl, line 616)
		

Nov 29 19:08:42 ... app-1.log:       in call from gen_server:handle_msg/6 (gen_server.erl, line 686)
		

Nov 29 19:08:42 ... app-1.log:     ancestors: [<0.2155.0>,'Elixir.Notifier.Supervisor',<0.2150.0>]
		

Nov 29 19:08:42 ... app-1.log:     message_queue_len: 0
		

Nov 29 19:08:42 ... app-1.log:     messages: []
		

Nov 29 19:08:42 ... app-1.log:     links: [<0.2155.0>]
		

Nov 29 19:08:42 ... app-1.log:     dictionary: []
		

Nov 29 19:08:42 ... app-1.log:     trap_exit: true
		

Nov 29 19:08:42 ... app-1.log:     status: running
		

Nov 29 19:08:42 ... app-1.log:     heap_size: 610
		

Nov 29 19:08:42 ... app-1.log:     stack_size: 27
		

Nov 29 19:08:42 ... app-1.log:     reductions: 913
		

Nov 29 19:08:42 ... app-1.log:   neighbours:
		

Nov 29 19:08:42 ... app-1.log: =SUPERVISOR REPORT==== 30-Nov-2018::00:08:42 ===
		

Nov 29 19:08:42 ... app-1.log:      Supervisor: {<0.2155.0>,'Elixir.KafkaEx.ConsumerGroup'}
		

Nov 29 19:08:42 ... app-1.log:      Context:    child_terminated
		

Nov 29 19:08:42 ... app-1.log:      Reason:     {{case_clause,{error,{already_started,<0.12040.0>}}},
		

Nov 29 19:08:42 ... app-1.log:                   [{'Elixir.KafkaEx.ConsumerGroup',start_consumer,5,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"lib/kafka_ex/consumer_group.ex"},{line,261}]},
		

Nov 29 19:08:42 ... app-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',start_consumer,2,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
		

Nov 29 19:08:42 ... app-1.log:                         {line,390}]},
		

Nov 29 19:08:42 ... app-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',sync,2,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
		

Nov 29 19:08:42 ... app-1.log:                         {line,276}]},
		

Nov 29 19:08:42 ... app-1.log:                    {'Elixir.KafkaEx.ConsumerGroup.Manager',handle_info,2,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"lib/kafka_ex/consumer_group/manager.ex"},
		

Nov 29 19:08:42 ... app-1.log:                         {line,149}]},
		

Nov 29 19:08:42 ... app-1.log:                    {gen_server,try_dispatch,4,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"gen_server.erl"},{line,616}]},
		

Nov 29 19:08:42 ... app-1.log:                    {gen_server,handle_msg,6,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"gen_server.erl"},{line,686}]},
		

Nov 29 19:08:42 ... app-1.log:                    {proc_lib,init_p_do_apply,3,
		

Nov 29 19:08:42 ... app-1.log:                        [{file,"proc_lib.erl"},{line,247}]}]}
		

Nov 29 19:08:42 ... app-1.log:      Offender:   [{pid,<0.12038.0>},
		

Nov 29 19:08:42 ... app-1.log:                   {id,'Elixir.KafkaEx.ConsumerGroup.Manager'},
		

Nov 29 19:08:42 ... app-1.log:                   {mfargs,
		

Nov 29 19:08:42 ... app-1.log:                       {'Elixir.KafkaEx.ConsumerGroup.Manager',start_link,
		

Nov 29 19:08:42 ... app-1.log:                           ['Elixir.Notifier.KafkaConsumer',<<"notifier">>,
		

Nov 29 19:08:42 ... app-1.log:                            [<<"events">>,<<"incidents">>],
		

Nov 29 19:08:42 ... app-1.log:                            [{supervisor_pid,<0.2155.0>}]]}},
		

Nov 29 19:08:42 ... app-1.log:                   {restart_type,permanent},
		

Nov 29 19:08:42 ... app-1.log:                   {shutdown,5000},
		

Nov 29 19:08:42 ... app-1.log:                   {child_type,worker}]

@joshuawscott
Copy link
Member

This sounds like a race condition, and a few of those were fixed in KafkaEx 0.8.3 when moving the heartbeat to a separate process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants