Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix producer deadlock after write failure #378

Merged
merged 1 commit into from
Oct 9, 2020

Conversation

merlimat
Copy link
Contributor

@merlimat merlimat commented Oct 9, 2020

Motivation

There is a deadlock that can happen in Go client when the client has a write failure and tries to process that.

The issue is that Go mutexes are not re-entrant and we trigger a connection.Close() while already holding the connection mutex.

goroutine 1077 [semacquire, 83 minutes]:
sync.runtime_SemacquireMutex(0xc00c31fb04, 0xc110a12000, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00c31fb00)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).Close(0xc00c31fb00)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:718 +0x547
github.com/apache/pulsar-client-go/pulsar.(*partitionProducer).ReceivedSendReceipt(0xc0033926e0, 0xc09ba0fe00)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/producer_partition.go:475 +0x6f0
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).handleSendReceipt(0xc00c31fb00, 0xc09ba0fe00)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:588 +0xee
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).internalReceivedCommand(0xc00c31fb00, 0xc00e40e8c0, 0x0, 0x0)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:507 +0x1ce
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).run(0xc00c31fb00)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:368 +0x2db
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start.func1(0xc00c31fb00)
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:230 +0x71
created by github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start
	/go/pkg/mod/cd.splunkdev.com/streamlio/pulsar-client-go@v0.1.1-streamlio-5/pulsar/internal/connection.go:226 +0x3f

Modifications

We don't need to hold the connection lock while the producer is processing the write failure. Releasing the lock earlier is fixing the problem.

@merlimat merlimat added this to the 0.3.0 milestone Oct 9, 2020
@merlimat merlimat self-assigned this Oct 9, 2020
Copy link
Member

@wolfstudy wolfstudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@wolfstudy wolfstudy merged commit 02b244e into apache:master Oct 9, 2020
@merlimat merlimat deleted the fix-producer-deadlock branch October 9, 2020 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants