Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Authentication in Kubernetes, Pulsar 2.6.1 - Broker crash loop on startup due to 401 in WorkerService.start(..) #8536

Closed
devinbost opened this issue Nov 12, 2020 · 1 comment
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@devinbost
Copy link
Contributor

devinbost commented Nov 12, 2020

Describe the bug
After configuring TLS Authentication in Pulsar 2.6.1 with this helm chart: https://github.com/devinbost/pulsar-helm-chart/tree/tls-auth
the broker gets stuck in a restart loop due to the WorkerService crashing with:

WARN org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/public/functions/assignments] Failed to perform http put request: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized

during the WorkerService.start(..) method execution.
With TLS Authentication enabled, the endpoint above should be the TLS endpoint (https://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8443/admin/v2/persistent/public/functions/assignments), not the non-TLS endpoint. This may be the reason why we're getting a 401 on the PUT for function/assignments upon the broker startup.

To Reproduce
Steps to reproduce the behavior:

  1. Clone the tls-auth branch of my fork of the Pulsar helm chart by running:
git clone https://github.com/devinbost/pulsar-helm-chart.git
git checkout tls-auth
  1. Start minikube with an appropriate number of CPUs:
    minikube start --memory=8192 --cpus=6 --cni=bridge

  2. Run the following commands to setup the kubernetes environment, tokens, certs, and keys:

./scripts/cert-manager/install-cert-manager.sh
./scripts/pulsar/prepare_helm_release.sh -n pulsar -k pulsar-ci -c --pulsar-superusers superadmin,proxy-admin,broker-admin,client-admin
./scripts/pulsar/upload_tls.sh -k pulsar-ci -d ./.ci/tls
  1. Install the local helm chart with the values file specified:
    helm install --values examples/values-minikube-with-tls-and-jwt.yaml pulsar-ci ./charts/pulsar/

  2. After waiting for a time, get logs from the broker:
    kubectl -n pulsar logs pulsar-ci-broker-0

The logs should demonstrate the problem.
Expected behavior
A clear and concise description of what you expected to happen.

Environment

  • minikube v1.14.2 on Darwin 10.15.7
  • Kubernetes v1.19.2 on Docker 19.03.8 ...
  • Enabled addons: storage-provisioner, default-storageclass
  • kubectl is configured to use "minikube"

Additional Context
Here is the WorkerConfig provided to the WorkerService, as reported in the logs:

01:07:20.757 [main] INFO  org.apache.pulsar.functions.worker.WorkerService - Worker Configs: {
  "workerId" : "c-pulsar-ci-fw-pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local-8080",
  "workerHostname" : "pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local",
  "workerPort" : 8080,
  "workerPortTls" : 6751,
  "authenticateMetricsEndpoint" : true,
  "includeStandardPrometheusMetrics" : false,
  "jvmGCMetricsLoggerClassName" : null,
  "numHttpServerThreads" : 8,
  "configurationStoreServers" : "pulsar-ci-zookeeper:2281",
  "zooKeeperSessionTimeoutMillis" : 30000,
  "zooKeeperOperationTimeoutSeconds" : 30,
  "zooKeeperCacheExpirySeconds" : 300,
  "connectorsDirectory" : "./connectors",
  "narExtractionDirectory" : "/tmp",
  "validateConnectorConfig" : false,
  "functionsDirectory" : "./functions",
  "functionMetadataTopicName" : "metadata",
  "functionWebServiceUrl" : null,
  "pulsarServiceUrl" : "pulsar://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:6650",
  "pulsarWebServiceUrl" : "http://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8080",
  "clusterCoordinationTopicName" : "coordinate",
  "pulsarFunctionsNamespace" : "public/functions",
  "pulsarFunctionsCluster" : "pulsar-ci",
  "numFunctionPackageReplicas" : 1,
  "downloadDirectory" : "download/pulsar_functions",
  "stateStorageServiceUrl" : null,
  "functionAssignmentTopicName" : "assignments",
  "schedulerClassName" : "org.apache.pulsar.functions.worker.scheduler.RoundRobinScheduler",
  "failureCheckFreqMs" : 30000,
  "rescheduleTimeoutMs" : 60000,
  "initialBrokerReconnectMaxRetries" : 60,
  "assignmentWriteMaxRetries" : 60,
  "instanceLivenessCheckFreqMs" : 30000,
  "clientAuthenticationPlugin" : "org.apache.pulsar.client.impl.auth.AuthenticationTls",
  "clientAuthenticationParameters" : "tlsCertFile:/pulsar/certs/broker/tls.crt,tlsKeyFile:/pulsar/certs/broker/tls.key",
  "bookkeeperClientAuthenticationPlugin" : null,
  "bookkeeperClientAuthenticationParametersName" : null,
  "bookkeeperClientAuthenticationParameters" : null,
  "topicCompactionFrequencySec" : 1800,
  "tlsEnabled" : true,
  "tlsCertificateFilePath" : null,
  "tlsKeyFilePath" : null,
  "tlsTrustCertsFilePath" : "/pulsar/certs/ca/ca.crt",
  "tlsAllowInsecureConnection" : false,
  "tlsRequireTrustedClientCertOnConnect" : false,
  "useTls" : false,
  "tlsHostnameVerificationEnable" : false,
  "tlsCertRefreshCheckDurationSec" : 300,
  "authenticationEnabled" : true,
  "authenticationProviders" : [ "org.apache.pulsar.broker.authentication.AuthenticationProviderToken", "org.apache.pulsar.broker.authentication.AuthenticationProviderTls" ],
  "authorizationEnabled" : true,
  "authorizationProvider" : "org.apache.pulsar.broker.authorization.PulsarAuthorizationProvider",
  "superUserRoles" : [ "broker-admin", "client-admin", "proxy-admin" ],
  "properties" : { },
  "brokerClientTrustCertsFilePath" : null,
  "functionRuntimeFactoryClassName" : "org.apache.pulsar.functions.runtime.kubernetes.KubernetesRuntimeFactory",
  "functionRuntimeFactoryConfigs" : {
    "changeConfigMap" : "pulsar-ci-functions-worker-config",
    "changeConfigMapNamespace" : "pulsar",
    "expectedMetricsCollectionInterval" : "30",
    "extraFunctionDependenciesDir" : null,
    "installUserCodeDependencies" : "true",
    "javaInstanceJarLocation" : null,
    "jobNamespace" : "pulsar",
    "logDirectory" : "/tmp",
    "pulsarAdminUrl" : "https://pulsar-ci-broker:8443/",
    "pulsarDockerImageName" : "apachepulsar/pulsar:2.6.1",
    "pulsarRootDir" : "/pulsar",
    "pulsarServiceUrl" : "pulsar+ssl://pulsar-ci-broker:6651/",
    "pythonInstanceLocation" : null,
    "submittingInsidePod" : "true"
  },
  "secretsProviderConfiguratorClassName" : null,
  "secretsProviderConfiguratorConfig" : null,
  "functionInstanceMinResources" : null,
  "functionAuthProviderClassName" : "org.apache.pulsar.functions.auth.KubernetesSecretsTokenAuthProvider",
  "runtimeCustomizerClassName" : null,
  "runtimeCustomizerConfig" : { },
  "maxPendingAsyncRequests" : 1000,
  "threadContainerFactory" : null,
  "processContainerFactory" : null,
  "kubernetesContainerFactory" : {
    "k8Uri" : null,
    "jobNamespace" : "pulsar",
    "pulsarDockerImageName" : "apachepulsar/pulsar:2.6.1",
    "imagePullPolicy" : null,
    "pulsarRootDir" : "/pulsar",
    "configAdminCLI" : null,
    "submittingInsidePod" : true,
    "pulsarServiceUrl" : "pulsar+ssl://pulsar-ci-broker:6651/",
    "pulsarAdminUrl" : "https://pulsar-ci-broker:8443/",
    "installUserCodeDependencies" : true,
    "pythonDependencyRepository" : null,
    "pythonExtraDependencyRepository" : null,
    "extraFunctionDependenciesDir" : null,
    "customLabels" : null,
    "expectedMetricsCollectionInterval" : 30,
    "changeConfigMap" : "pulsar-ci-functions-worker-config",
    "changeConfigMapNamespace" : "pulsar",
    "percentMemoryPadding" : 0,
    "cpuOverCommitRatio" : 1.0,
    "memoryOverCommitRatio" : 1.0,
    "grpcPort" : 9093,
    "metricsPort" : 9094,
    "narExtractionDirectory" : "/tmp"
  },
  "functionMetadataTopic" : "persistent://public/functions/metadata",
  "clusterCoordinationTopic" : "persistent://public/functions/coordinate",
  "functionAssignmentTopic" : "persistent://public/functions/assignments",
  "tlsTrustChainBytes" : "LS0tLS1C. . . =",
  "workerWebAddress" : "http://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8080"
}

Clues and Possible Solution

The only admin endpoints in the WorkerConfig that are NOT TLS are:

When we create the brokerAdmin client, we use the pulsarWebServiceUrl: https://github.com/apache/pulsar/blob/master/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/WorkerService.java#L146

The first PUT on the function assignment topic uses the brokerAdminclient here: https://github.com/apache/pulsar/blob/master/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/WorkerService.java#L169

Although we check for a few TLS-related configurations (tlsTrustCertsFilePath, allowTlsInsecureConnection, enableTlsHostnameVerificationEnable) when we create the PulsarAdmin client, it doesn't appear that we ever resolve to obtain a TLS endpoint if TLS Authentication is enabled.

If a TLS endpoint is required to resolve the 401 response issue, we need to add logic to check if TLS Authentication is enabled; and when TLS Authentication is enabled, we need to use a TLS endpoint when creating the AdminClient instances, such as brokerAdmin.
We could easily add the logic to resolve the correct URL to the brokerConfig class (ServiceConfiguration) since this class already knows if brokerClientTlsEnabled is true/false: https://github.com/apache/pulsar/blob/master/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java#L1656
Then, that value could be assigned to a property on workerConfig before injecting workerConfig into our WorkerService: https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/PulsarBrokerStarter.java#L175

I'd like to get some feedback on this proposal.

@devinbost devinbost added the type/bug The PR fixed a bug or issue reported a bug label Nov 12, 2020
@sijie
Copy link
Member

sijie commented Nov 12, 2020

@devinbost It doesn't seem to be a pulsar core issue. Can you create the issue under pulsar-helm-chart repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants