Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] retry producer creation upon error after succssful topic lookup #1139

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zzzming
Copy link
Contributor

@zzzming zzzming commented Nov 24, 2023

Fixes #1138

Motivation

In the newPartitionProducer() function, there should be a retry of grabCnx(). It will be similar to the reconnectToBroker's grabCnx() retry logic.

Java producer has this retry logic.

At the producer creation call, after a successful topic lookup at grabCnx() in producer_partition.go, if there is a network issue before the COMMAND to create producer sent, the grabCnx() will exit without retry.

The same connectoToBroker retry logic is observed in this implementation.

We had frequent failures upon the initial producer creation under unstable network conditions .

It's tricky to reproduce. But we observe the problem more frequently on Azure pod's initialization stage. After implementing the grabCnx() retry in the newPartitionProducer(), the problem has gone away. The error often shows a connection closed (EOF) by the other side. But it's not by the broker (or Pulsar) based on the logs on the Pulsar side. It can be network issues in between the producer pod and the Pulsar cluster. That's why a grabCnx() retry is much needed.

System configuration

Pulsar version: 2.10

Modifications

In the newPartitionProducer() function, adding a retry of grabCnx() with the same retry logic specified in reconnectToBroker's grabCnx() retry logic.

Verifying this change

  • [ x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as (please describe tests).

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@zzzming zzzming changed the title [fix] retry producer creation upon error after succssful topic lookip [fix] retry producer creation upon error after succssful topic lookup Nov 24, 2023
@lhotari
Copy link
Member

lhotari commented Nov 24, 2023

Great work @zzzming! I'll review again after you reply to the question.

}
p.log.WithError(err).Error("Failed to create producer at newPartitionProducer")
errMsg := err.Error()
if strings.Contains(errMsg, errTopicNotFount) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(errMsg, errTopicNotFount) {
if errors.Is(err, ErrTopicNotfound) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase with the latest and fixed the error evaluation per your review comment

break
}

if strings.Contains(errMsg, "TopicTerminatedError") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(errMsg, "TopicTerminatedError") {
if errors.Is(err, ErrTopicTerminated) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you rebase this PR? #1143 exports some error var, so you need to update your PR.

@zzzming zzzming force-pushed the reconnectAfterLookup branch from c676c7b to a84c97d Compare January 12, 2024 19:00
@zzzming
Copy link
Contributor Author

zzzming commented Jan 12, 2024

@nodece I fixed based on your review comments. CI does not seem to run. Does it require any approval to run CI?

@eolivelli
Copy link

Ci triggered

@nodece
Copy link
Member

nodece commented Jan 17, 2024

Ping @zzzming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retry producer creation upon error after successful topic lookup
4 participants