Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set minPartitionsAutoDiscoveryInterval to prevent partition metadata lookup overwhelm brokers #876

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zzzming
Copy link
Contributor

@zzzming zzzming commented Oct 28, 2022

set minPartitionsAutoDiscoveryInterval to prevent partition metadata lookup overwhelm brokers

Motivation

This issue is described by this client application's PR signalfx/splunk-otel-collector#2185

On a 2.7 broker, a very short interval of partition auto discovery, the producer can overwhelm the broker. We have observed very high CPU usage. In an extreme case, a broker can run 100% CPU even without any topic loaded. The broker trace stack looks like

org.apache.pulsar.broker.service.PulsarCommandSenderImpl.sendPartitionMetadataResponse(PulsarCommandSenderImpl.java:65)
	at org.apache.pulsar.broker.service.ServerCnx.lambda$null$7(ServerCnx.java:455)
	at org.apache.pulsar.broker.service.ServerCnx$$Lambda$666/0x000000084070a440.apply(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniHandle(java.base@11.0.13/CompletableFuture.java:930)
	at java.util.concurrent.CompletableFuture.uniHandleStage(java.base@11.0.13/CompletableFuture.java:946)
	at java.util.concurrent.CompletableFuture.handle(java.base@11.0.13/CompletableFuture.java:2266)
	at org.apache.pulsar.broker.service.ServerCnx.lambda$handlePartitionMetadataRequest$8(ServerCnx.java:452)
	at org.apache.pulsar.broker.service.ServerCnx$$Lambda$661/0x0000000840708040.apply(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniApplyNow(java.base@11.0.13/CompletableFuture.java:680)
	at java.util.concurrent.CompletableFuture.uniApplyStage(java.base@11.0.13/CompletableFuture.java:658)
	at java.util.concurrent.CompletableFuture.thenApply(java.base@11.0.13/CompletableFuture.java:2094)
	at org.apache.pulsar.broker.service.ServerCnx.handlePartitionMetadataRequest(ServerCnx.java:449)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:122)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1534)
	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1295)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1332)

Modifications

Set one second as the floor value for the PartitionsAutoDiscoveryInterval. This will prevent a high frequent look up call from the client.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)

Documentation

  • Does this pull request introduce a new feature? ( no)
  • If yes, how is the feature documented? (not applicable)

@pgier
Copy link
Contributor

pgier commented Oct 31, 2022

It might be nice in the future to add a field partitionsAutoDiscoveryIntervalSeconds and deprecate the current one. Then the time unit is more clear to the user.

@shibd
Copy link
Member

shibd commented Feb 9, 2023

It might be nice in the future to add a field partitionsAutoDiscoveryIntervalSeconds and deprecate the current one. Then the time unit is more clear to the user.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants