-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIP-121: Pulsar cluster level auto failover on client side #13316
PIP-121: Pulsar cluster level auto failover on client side #13316
Conversation
@hangc0276:Thanks for your contribution. For this PR, do we need to update docs? |
@hangc0276:Thanks for providing doc info! |
pulsar-client/src/main/java/org/apache/pulsar/client/impl/AutoClusterFailover.java
Show resolved
Hide resolved
* Close the resource that the provider allocated. | ||
* | ||
*/ | ||
default void close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make the interface inherit AutoCloseable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we can remove the default empty implementation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other serviceUrlProvider implementation also need to implement close()
method, so I add the default implementation.
this.primaryFailedTimestamp = -1; | ||
this.primaryRecoverTimestamp = -1; | ||
this.secondaryFailedTimestamp = -1; | ||
this.timer = new Timer("pulsar-service-provider"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use ScheduledThreadPoolExecutor
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
if (primaryFailedTimestamp == -1) { | ||
primaryFailedTimestamp = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use System.nanoTime()
to calculate the time locally to avoid time drifts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public void run() { | ||
// current pulsar serviceUrl is primary | ||
if (currentPulsarServiceUrl.equals(primary)) { | ||
if (probeAvailable(primary, timeout)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about failover after a succession of failures (3 for example, and should be configurable) to avoid unnecessary switches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is switch by failure delay time, it can control the client failover time. If we use failure times, the failover time will out of control.
try { | ||
URLConnection conn = pulsarUrlProvider.openConnection(); | ||
inputStream = conn.getInputStream(); | ||
return new String(IOUtils.toByteArray(inputStream), StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the service url as the response body might not be very extensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We should use a JSON response here
} | ||
|
||
Socket socket = new Socket(); | ||
socket.connect(new InetSocketAddress(parseHost(hostAndPort), parsePort(hostAndPort)), timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to make a request to the server to ensure it is actually in a good state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is probe whether the service port is open or not. If we want to probe the broker state, it's better to provide a health check command for broker service on broker side. I can use another PR to provide this feature on broker side. @merlimat Do you have any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, we use geo-replication to support cluster failover, that means there might be have more than two cluster. In the implementation, we only can configure two clusters for the failover. I am thinking of, do we need to support more clusters to failover to make the cluster is more robust?
private static String parseHostAndPort(String url) { | ||
if (Strings.isNullOrEmpty(url) || !url.startsWith("pulsar")) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an Pulsar service URL"); | ||
} | ||
|
||
int uriSeparatorPos = url.indexOf("://"); | ||
if (uriSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an URI."); | ||
} | ||
return url.substring(uriSeparatorPos + 3); | ||
} | ||
|
||
private static String parseHost(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return hostAndPort.substring(0, portSeparatorPos); | ||
} | ||
|
||
private static Integer parsePort(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return Integer.valueOf(hostAndPort.substring(portSeparatorPos+1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use URL
class to parse a URL and you can get the host and port easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think URI
could be better here, because URL
assumes a "known" protocol scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hangc0276 please address this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
currentPulsarServiceUrl, System.currentTimeMillis() - primaryFailedTimestamp); | ||
} | ||
} else { // current pulsar service URL is secondary, probe whether we need to switch back to primary. | ||
if (!probeAvailable(currentPulsarServiceUrl, timeout)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the url check and switch as a method? I think most of the following logic is duplicated with the primary service url check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Maybe one could even go further and automatically choose from all of the replicated clusters, the nearest one (via latency probe/from pre-probed list)? |
@zymap if we have multi secondary clusters, different clients will switch to different secondary clusters. There are some problems.
So, in current implementation, we can support only one secondary cluster first. |
@hpvd Please refer to #13316 (comment) |
* @param secondary | ||
* @return | ||
*/ | ||
AutoClusterFailoverBuilder secondary(String secondary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take a list/set of serviceUrls, just to make it more general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take a list/set of serviceUrls, just to make it more general
done
} | ||
|
||
Socket socket = new Socket(); | ||
socket.connect(new InetSocketAddress(parseHost(hostAndPort), parsePort(hostAndPort)), TIMEOUT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will check for TCP connectivity, which is a good start, though there can be many cases in which it will give the false impression that cluster is healthy:
- We're connecting to a Pulsar proxy, but there are no available brokers
- Using Istio on server side, which always accept the connection even if the broker is in a bad state
- We might have deadlocks in (all) brokers and while the connections get accepted, the brokers are not able to serve them.
We should consider to have a more in depth test to:
- Check that we can authenticate with brokers
- (Maybe) Estimate how many brokers are up & healthy in the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current probe port method has many disadvantages. We can use this method first, and in next step, we'd better provide a health check command, just like Zookeeper ruok
command, to probe the healthy of the target Pulsar cluster.
private static String parseHostAndPort(String url) { | ||
if (Strings.isNullOrEmpty(url) || !url.startsWith("pulsar")) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an Pulsar service URL"); | ||
} | ||
|
||
int uriSeparatorPos = url.indexOf("://"); | ||
if (uriSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an URI."); | ||
} | ||
return url.substring(uriSeparatorPos + 3); | ||
} | ||
|
||
private static String parseHost(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return hostAndPort.substring(0, portSeparatorPos); | ||
} | ||
|
||
private static Integer parsePort(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return Integer.valueOf(hostAndPort.substring(portSeparatorPos+1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think URI
could be better here, because URL
assumes a "known" protocol scheme.
try { | ||
URLConnection conn = pulsarUrlProvider.openConnection(); | ||
inputStream = conn.getInputStream(); | ||
return new String(IOUtils.toByteArray(inputStream), StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We should use a JSON response here
private volatile String currentPulsarServiceUrl; | ||
private final URL pulsarUrlProvider; | ||
private final ScheduledExecutorService executor; | ||
private final int interval = 30_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should take the poll interval as a config in the builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should take the poll interval as a config in the builder
done
private final long intervalMs; | ||
private static final int TIMEOUT = 30_000; | ||
|
||
private AutoClusterFailover(String primary, List<String> secondary, long failoverDelayNs, long switchBackDelayNs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the arguments as configuration data or something else to avoid passing so many configurations into the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param authentication | ||
* @return | ||
*/ | ||
AutoClusterFailoverBuilder secondaryAuthentication(List<Authentication> authentication); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to use a map to keep the secondary URLs and the authentication? I think we should consider users will not add the authentication in order with the secondary URL lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param tlsTrustCertsFilePath | ||
* @return | ||
*/ | ||
AutoClusterFailoverBuilder secondaryTlsTrustCertsFilePath(List<String> tlsTrustCertsFilePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above.
* @param tlsTrustStorePath | ||
* @return | ||
*/ | ||
AutoClusterFailoverBuilder secondaryTlsTrustStorePath(List<String> tlsTrustStorePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above.
* @param tlsTrustStorePassword | ||
* @return | ||
*/ | ||
AutoClusterFailoverBuilder secondaryTlsTrustStorePassword(List<String> tlsTrustStorePassword); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above
private static String parseHostAndPort(String url) { | ||
if (Strings.isNullOrEmpty(url) || !url.startsWith("pulsar")) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an Pulsar service URL"); | ||
} | ||
|
||
int uriSeparatorPos = url.indexOf("://"); | ||
if (uriSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + url + "' isn't an URI."); | ||
} | ||
return url.substring(uriSeparatorPos + 3); | ||
} | ||
|
||
private static String parseHost(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return hostAndPort.substring(0, portSeparatorPos); | ||
} | ||
|
||
private static Integer parsePort(String hostAndPort) { | ||
int portSeparatorPos = hostAndPort.indexOf(":"); | ||
if (portSeparatorPos == -1) { | ||
throw new IllegalArgumentException("'" + hostAndPort + "' isn't an URI."); | ||
} | ||
return Integer.valueOf(hostAndPort.substring(portSeparatorPos+1)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hangc0276 please address this comment
@@ -746,7 +754,7 @@ public void shutdown() throws PulsarClientException { | |||
} catch (PulsarClientException e) { | |||
throwable = e; | |||
} | |||
if (conf != null && conf.getAuthentication() != null) { | |||
if (conf.getAuthentication() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the conf != null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add it back.
String authParamsString = controlledConfiguration.getAuthParamsString(); | ||
String token = controlledConfiguration.getToken(); | ||
|
||
switch (authPluginClassName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered that we can configure the authPlugin
and authParams
for all auth plugins, why do we need to case the auth plugins? Does the AuthenticationFactory.create(authPlugin, authParam)
can not handle all things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@hpvd I added the |
967ace6
to
5d9b48b
Compare
) Related to apache#13315 ### Modification 1. add Pulsar cluster level auto failover
Related to #13315
Modification