Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proxy] Fix port exhaustion and connection issues in Pulsar Proxy #14078

Merged
merged 11 commits into from
Feb 8, 2022

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jan 31, 2022

Fixes #14075
Fixes #13923

Motivation

Pulsar Proxy can get into a state where it stops proxying Broker connections while Admin API proxying keeps working.
The proxy logs are filled with this type of warnings:

[pulsar-proxy-io-2-1] WARN  org.apache.pulsar.client.impl.ConnectionPool - Failed to open connection to pulsar-dev-broker/172.20.4.120:6650 : io.netty.channel.AbstractChannel$AnnotatedConnectException:
connect(.      .) failed: Cannot assign requested address: pulsar-dev-broker.pulsar.svc.cluster.local/172.20.4.120:6650

The "Cannot assign requested address" error message is a sign of a port exhaustion issue where there are many connections open, possibly hanging.

Additional context

One possible reason for the broken hanging connections could be a race condition that shows up in logs this way:

[pulsar-proxy-io-2-3] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.UnsupportedOperationException: null
        at org.apache.pulsar.common.protocol.PulsarDecoder.handleProducer(PulsarDecoder.java:479)
        at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:193)
        at org.apache.pulsar.proxy.server.ProxyConnection.channelRead(ProxyConnection.java:193)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.72.Final.jar:4.1.72.Final]

This is reported as #13923 and it is fixed as part of the same PR.

Modifications

  • Optimize the proxy connection to fail-fast if the target broker isn't active

    • This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached.
    • Pulsar client will retry connecting after a back off timeout
  • Fixes the race condition in the Pulsar Proxy when opening a connection since that
    could lead to invalid states and hanging connections

  • Add connect timeout handling to proxy connection

    • default to 10000 ms which is also the default of client's connect timeout
  • Add read timeout handling to incoming connection and proxied connection

    • the ping/pong keepalive messages should prevent the timeout happening,
      however the connection might be in a state where keepalives aren't happening.
      • therefore, it's better to have a connection level read timeout to prevent broken connections left
        hanging in the proxy

@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug area/proxy doc-not-needed Your PR changes do not impact docs labels Jan 31, 2022
@lhotari lhotari self-assigned this Jan 31, 2022
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I left some minor suggestions

we should cherry pick this patch to 2.9 and 2.8 branches

@lhotari lhotari force-pushed the lh-fix-port-exhaustion-in-proxy branch from e4b4081 to d8c21cf Compare January 31, 2022 16:25
lhotari added a commit to lhotari/pulsar-helm-chart that referenced this pull request Jan 31, 2022
- enables service discovery in the proxy
- required by apache/pulsar#14078 changes
@lhotari lhotari marked this pull request as draft February 2, 2022 20:48
@lhotari lhotari force-pushed the lh-fix-port-exhaustion-in-proxy branch from d8c21cf to 01f4d47 Compare February 4, 2022 15:40
@lhotari
Copy link
Member Author

lhotari commented Feb 4, 2022

@merlimat There are some test failures, which I'll address next week. Please review the high level approach to see if that's fine.

Fixes apache#14075
Fixes apache#13923

- Optimize the proxy connection to fail-fast if the target broker isn't active
  - This reduces the number of hanging connections when unavailable brokers aren't unnecessarily attempted to be reached.
  - Pulsar client will retry connecting after a back off timeout

- Fixes the race condition in the Pulsar Proxy when opening a connection since that
  could lead to invalid states and hanging connections

- Add connect timeout handling to proxy connection
  - default to 10000 ms which is also the default of client's connect timeout

- Add read timeout handling to incoming connection and proxied connection
  - the ping/pong keepalive messages should prevent the timeout happening,
    however it's possible that the connection is in a state where keepalives aren't happening.
    - therefore it's better to have a connection level read timeout prevent broken connections left
      hanging in the proxy
- the test wasn't testing the proxy at all
@lhotari lhotari marked this pull request as ready for review February 7, 2022 12:15
@lhotari lhotari requested review from merlimat and eolivelli February 7, 2022 12:22
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one comment about a test for ports.

import org.apache.curator.shaded.com.google.common.net.InetAddresses;
import org.testng.annotations.Test;

public class BrokerProxyValidatorTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be appropriate to add a test named shouldPreventInvalidPort.

@eolivelli eolivelli merged commit 640b4e6 into apache:master Feb 8, 2022
lhotari added a commit to lhotari/pulsar that referenced this pull request Feb 9, 2022
lhotari added a commit that referenced this pull request Feb 9, 2022
@lhotari lhotari added cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life labels Feb 9, 2022
lhotari added a commit that referenced this pull request Feb 9, 2022
@lhotari lhotari added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Feb 9, 2022
lhotari added a commit to datastax/pulsar that referenced this pull request Feb 9, 2022
…ache#14078)

(cherry picked from commit 640b4e6)
(cherry picked from commit 3d2e6ce)
(cherry picked from commit b3bac91)
lhotari added a commit to datastax/pulsar that referenced this pull request Feb 9, 2022
Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.7.5 release/2.8.3 release/2.9.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
6 participants