Mutating webhook times out when EventListener has multiple triggers #356

kaczyns · 2020-01-17T18:49:47Z

Expected Behavior

I can use the tekton dashboard to create multiple webhooks (greater than 8 or so).

Actual Behavior

When creating the 9th webhook, the mutating webhook triggers-webhook.tekton.dev times out while processing the EventListener that was updated by the tekton dashboard.

Steps to Reproduce the Problem

Install tekton dashboard v0.3.0 https://github.com/tektoncd/dashboard/releases/tag/v0.3.0
Create multiple webhooks https://github.com/tektoncd/experimental/blob/master/webhooks-extension/docs/GettingStarted.md#creating-a-new-webhook
After the 8th (or so) webhook is created, subsequent creations fail because the mutating webhook times out.

Additional Info

Multiple users have reported this problem (kabanero-io/kabanero-foundation#150). All are using OpenShift 4.2, openshift-pipelines-operator 0.8 (includes triggers 0.1.0) and tekton dashboard 0.3.0. It appears that in OpenShift the mutating webhook timeout is 13 seconds. The EventListener has about 24 triggers in it (3 per github webhook) before the mutating webhook starts to time out.

Perhaps a single EventListener was not intended to contain so many triggers... in any case it would be good to know if this is a reasonable configuration.

I re-built the tekton-triggers-webhook container and put an elapsed time in the mutating admission webhook method (vendor/knative.dev/pkg/webhook/webhook.go, added time.Since(ttStart)):

	logger.Infof("AdmissionReview for %#v: %s/%s response=%#v elapsedTime=%v",
		review.Request.Kind, review.Request.Namespace, review.Request.Name, reviewResponse, time.Since(ttStart))

The elapsed time increases by about 1.2 seconds per github webhook that I create on my system (using the dashboard). The container can be found here:

docker.io/kaczyns/webhook-dd1edc925ee1772a9f76e2c1bc291ef6@sha256:a1c5a6bce855e97135cbbc7bfea8d7a6862084b19d4ef8434df4b677f7503227

The text was updated successfully, but these errors were encountered:

ncskier · 2020-01-17T20:10:09Z

I was able to reproduce this issue with Tekton Triggers v0.1.0 installed on OpenShift after creating an EventListener with 37 triggers:

Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s

I wrote up this quick bash script to reproduce the error if anyone wants to try it out:

#!/bin/bash

# Cleanup
rm eventlistener_*

# Define variables
max=50
if [ ! -z "$1" ]; then
  max="$1"
fi
range=40
if [ ! -z "$2" ]; then
  range="$2"
fi
min=$((max - range))
if [ $min -le 0 ]; then
  min=1
fi
echo "Max triggers: $max"
echo "Min triggers: $min"
echo ""

# Generate EventListener files
cat << EOF >> "eventlistener_0.yaml"
apiVersion: tekton.dev/v1alpha1
kind: EventListener
metadata:
  name: listener
spec:
  serviceAccountName: tekton-triggers-example-sa
  triggers:
EOF
for ((i=1; i<="$max"; i++)); do
  prevFile="eventlistener_$((i - 1)).yaml"
  newFile="eventlistener_${i}.yaml"
  cp "$prevFile" "$newFile"
  cat << EOF >> $newFile
    - name: foo-trig
      binding:
        name: pipeline-binding
      template:
        name: pipeline-template
EOF
done

# Apply files
for ((i="$min"; i<="$max"; i++)); do
  # Apply files and record how long it takes
  echo "### $i triggers ###"
  file="eventlistener_${i}.yaml"
  startTime="$(date -u +%s)"
  kubectl apply -f "$file"
  endTime="$(date -u +%s)"
  elapsed="$(($endTime - $startTime))"
  echo "${elapsed}s"
  echo ""
done

Next I'll see at what point I can reproduce the error with Triggers v0.2.0

Update:
So far, this is not an issue in Triggers v0.2.0. I was able to use my testing script above and create an EventListener with 2600 triggers. After this point I started running into an issue with my yaml files being too big:

The EventListener "listener" is invalid: metadata.annotations: Too long: must have at most 262144 characters

The validation in v0.2.0 is a lot simpler than in v0.1.0. I think that the timeout error in v0.1.0 is probably coming from us querying the Kubernetes clientset a number of times: https://github.com/tektoncd/triggers/blob/v0.1.0/pkg/apis/triggers/v1alpha1/event_listener_validation.go#L54.

My test script is not very realistic, because it just creates a number of duplicate triggers. However, it doesn't look like there's a way to exploit anything in the EventListener validation logic to make it time out.

ncskier · 2020-01-17T20:48:31Z

Also, we should probably create an e2e test to prevent a regression from occurring. To do this, I think we'll need a "realistic" maximum number of triggers that we think an EventListener should be able to have. I'm not really sure what this semi-arbitrary number should be.

Issue tektoncd#356 found a bug in Triggers v0.1.0. EventListeners with too many triggers (around 27 in this user's case) could not be created due to an Admission Webhook timeout. This bug makes Triggers v0.1.0 unusable at a large scale. PR tektoncd#263 (which came after v0.1.0) fixes the timeout issue by removing clientset use from the EventListener validation code. So, I have manually re-applied the changes made in PR tektoncd#263 on top of v0.1.0. This should fix the timeout bug seen in v0.1.0.

dibyom · 2020-01-27T17:00:19Z

I think having some performance numbers on the number of Triggers we can support would be nice!

Given that we relaxed the validation to not check for the presence of bindings/templates, I don't think creating the EL is where we'll run into issues.

We do use the Kubernetes/Triggers clientset to fetch resources in the EL sink at runtime (e.g. fetch the EL, then iterate over all the triggers; fetch the secret for GH/GL interceptors etc.) That might be problematic if we have a large number of incoming requests

vdemeester · 2020-01-27T17:39:58Z

/kind bug

ncskier · 2020-02-03T16:44:00Z

I'm closing this issue since the bug has been resolved in the latest release of Triggers (v0.2.1)

I'm opening a couple of issues for follow-up work:

regression test: Regression test for multiple triggers admission webhook timeout #405
performance metrics: Performance metrics #406

Fixes tektoncd#405 Issue tektoncd#356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the admission webhook times out after adding around 35 Triggers to an EventListener. This PR adds a regression test where we create an EventListener with 1000 Triggers.

Fixes tektoncd#405 Issue tektoncd#356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the admission webhook times out after adding around 35 Triggers to an EventListener. This PR adds a regression test where we create an EventListener with 1000 Triggers. Also, moves the cluster-scope resource cleanup code (from tektoncd@7ba80a7) out of init_test.go and into eventlistener_test.go.

Fixes #405 Issue #356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the admission webhook times out after adding around 35 Triggers to an EventListener. This PR adds a regression test where we create an EventListener with 1000 Triggers. Also, moves the cluster-scope resource cleanup code (from 7ba80a7) out of init_test.go and into eventlistener_test.go.

Update scripts for new cmd added upstream

kaczyns mentioned this issue Jan 17, 2020

Unable to create webhook in Tekton dashboard kabanero-io/kabanero-foundation#150

Closed

This was referenced Jan 20, 2020

Assess the move to Triggers 0.2 (Dashboard) tektoncd/dashboard#918

Closed

Assess the move to Triggers 0.2 (webhooks) tektoncd/experimental#431

Closed

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 27, 2020

ncskier added the maybe-next-milestone label Jan 27, 2020

This was referenced Feb 3, 2020

Regression test for multiple triggers admission webhook timeout #405

Closed

Performance metrics #406

Closed

ncskier closed this as completed Feb 3, 2020

ncskier mentioned this issue Feb 19, 2020

Add Regression test for large number of Triggers #449

Merged

3 tasks

ncskier mentioned this issue Mar 19, 2020

Update Webhooks Extension to use Triggers 0.4 tektoncd/experimental#493

Closed

pradeepitm12 pushed a commit to pradeepitm12/triggers that referenced this issue Mar 16, 2021

Merge pull request tektoncd#356 from piyush-garg/fix_interceptor_image

18cf9dd

Update scripts for new cmd added upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutating webhook times out when EventListener has multiple triggers #356

Mutating webhook times out when EventListener has multiple triggers #356

kaczyns commented Jan 17, 2020

ncskier commented Jan 17, 2020 •

edited

Loading

ncskier commented Jan 17, 2020

dibyom commented Jan 27, 2020

vdemeester commented Jan 27, 2020

ncskier commented Feb 3, 2020

Mutating webhook times out when EventListener has multiple triggers #356

Mutating webhook times out when EventListener has multiple triggers #356

Comments

kaczyns commented Jan 17, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

ncskier commented Jan 17, 2020 • edited Loading

ncskier commented Jan 17, 2020

dibyom commented Jan 27, 2020

vdemeester commented Jan 27, 2020

ncskier commented Feb 3, 2020

ncskier commented Jan 17, 2020 •

edited

Loading