Partition stuck and do not receive any message. fetch_state:validate-epoch-wait #4484

Phil1972 · 2023-10-25T14:55:16Z

Phil1972
Oct 25, 2023

Hi,

Before opening an issue, I prefer to ask a question here to see if there is anything I can find about our problem.
We do not use transactions for message producer in our implementation. We have many producers and in this simple case, only one consumer process (in a kubernetes pod) which processes all 60 partitions. Messages are spread more or less evenly between producers/partitions. I do not see any evident issues on the producer side.

At some point, one (or more) partition do not receive messages anymore and I can observe a consumer lag on confluent's cluster.

We implemented a partition monitor so that we check when we get no message after a certain period of time. We got into this situation recently where we stopped getting messages from partition 35 (out of 60) and this is what we have in the stats for that partition:

"35": { "partition": 35, "broker": 21, "leader": 21, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 1245176, "app_offset": 1245176, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 1245176, "committed_offset": 1245176, "committed_leader_epoch": 120, "eof_offset": 1245176, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 121, "txmsgs": 0, "txbytes": 0, "rxmsgs": 615, "rxbytes": 710580, "msgs": 615, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },

what is the meaning of validate-epoch-wait ?

We plan to restart the pod whenever we encounter this type of issue since we do not know how to recover from this nor do we know why it happens.

Any insight or help on this issue would be appreciated,

Thanks,

mensfeld · 2023-10-25T14:59:47Z

mensfeld
Oct 25, 2023

What librdkafka version?

1 reply

Phil1972 Oct 25, 2023
Author

Sorry, should have mentionned, we usually follow the latest nuget:
Include="Confluent.Kafka" Version="2.2.0"

Phil1972 · 2023-10-25T20:30:43Z

Phil1972
Oct 25, 2023
Author

I also looked at the documentation and I do not see this as a documented state for the 'fetch_state'.

0 replies

Phil1972 · 2023-10-30T12:29:16Z

Phil1972
Oct 30, 2023
Author

I am not sure if anyone reads this discussion, but the issue happened again during the last few days. Same problem, partition would not receive messages anymore and the partition statistics shows the same behavior

{ "partition": 32, "broker": 4, "leader": 4, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 9245051, "app_offset": 9245051, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 9245051, "committed_offset": 9245051, "committed_leader_epoch": 163, "eof_offset": 9245050, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 164, "txmsgs": 0, "txbytes": 0, "rxmsgs": 16839, "rxbytes": 21849190, "msgs": 16839, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 }

ie: "fetch_state": "validate-epoch-wait"

3 replies

mensfeld Oct 30, 2023

I read them. I just haven't had time to look into the code yet ;(

Phil1972 Oct 30, 2023
Author

I looked in the dotnet code and could not find that enum value. I guess this might be in the c code and I am not sure if it is available. In any case I will let you ook into this and see if you can come up with something that would explain our issue. For now I had to make a monitoring service that restarts the pods when this problem is detected which really is a temporary patch. Thanks

Phil1972 Oct 30, 2023
Author

OK, found it in the c++ code, not sure what it means though and why it happens :)

KentPihl · 2023-11-01T13:13:13Z

KentPihl
Nov 1, 2023

We had the same problem. Try out the new release 2.3.0 It seems to solve the problem - so far. Been running for several days now without problems.

https://github.com/confluentinc/confluent-kafka-dotnet/releases

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition stuck and do not receive any message. fetch_state:validate-epoch-wait #4484

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Partition stuck and do not receive any message. fetch_state:validate-epoch-wait #4484

Phil1972 Oct 25, 2023

Replies: 4 comments · 4 replies

mensfeld Oct 25, 2023

Phil1972 Oct 25, 2023 Author

Phil1972 Oct 25, 2023 Author

Phil1972 Oct 30, 2023 Author

mensfeld Oct 30, 2023

Phil1972 Oct 30, 2023 Author

Phil1972 Oct 30, 2023 Author

KentPihl Nov 1, 2023

Phil1972
Oct 25, 2023

Replies: 4 comments 4 replies

mensfeld
Oct 25, 2023

Phil1972 Oct 25, 2023
Author

Phil1972
Oct 25, 2023
Author

Phil1972
Oct 30, 2023
Author

Phil1972 Oct 30, 2023
Author

Phil1972 Oct 30, 2023
Author

KentPihl
Nov 1, 2023