Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutating webhook times out when EventListener has multiple triggers #356

Closed
kaczyns opened this issue Jan 17, 2020 · 5 comments
Closed

Mutating webhook times out when EventListener has multiple triggers #356

kaczyns opened this issue Jan 17, 2020 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. maybe-next-milestone

Comments

@kaczyns
Copy link

kaczyns commented Jan 17, 2020

Expected Behavior

I can use the tekton dashboard to create multiple webhooks (greater than 8 or so).

Actual Behavior

When creating the 9th webhook, the mutating webhook triggers-webhook.tekton.dev times out while processing the EventListener that was updated by the tekton dashboard.

Steps to Reproduce the Problem

  1. Install tekton dashboard v0.3.0 https://github.com/tektoncd/dashboard/releases/tag/v0.3.0
  2. Create multiple webhooks https://github.com/tektoncd/experimental/blob/master/webhooks-extension/docs/GettingStarted.md#creating-a-new-webhook
  3. After the 8th (or so) webhook is created, subsequent creations fail because the mutating webhook times out.

Additional Info

Multiple users have reported this problem (kabanero-io/kabanero-foundation#150). All are using OpenShift 4.2, openshift-pipelines-operator 0.8 (includes triggers 0.1.0) and tekton dashboard 0.3.0. It appears that in OpenShift the mutating webhook timeout is 13 seconds. The EventListener has about 24 triggers in it (3 per github webhook) before the mutating webhook starts to time out.

Perhaps a single EventListener was not intended to contain so many triggers... in any case it would be good to know if this is a reasonable configuration.

I re-built the tekton-triggers-webhook container and put an elapsed time in the mutating admission webhook method (vendor/knative.dev/pkg/webhook/webhook.go, added time.Since(ttStart)):

	logger.Infof("AdmissionReview for %#v: %s/%s response=%#v elapsedTime=%v",
		review.Request.Kind, review.Request.Namespace, review.Request.Name, reviewResponse, time.Since(ttStart))

The elapsed time increases by about 1.2 seconds per github webhook that I create on my system (using the dashboard). The container can be found here:

docker.io/kaczyns/webhook-dd1edc925ee1772a9f76e2c1bc291ef6@sha256:a1c5a6bce855e97135cbbc7bfea8d7a6862084b19d4ef8434df4b677f7503227
@ncskier
Copy link
Member

ncskier commented Jan 17, 2020

I was able to reproduce this issue with Tekton Triggers v0.1.0 installed on OpenShift after creating an EventListener with 37 triggers:

Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s

I wrote up this quick bash script to reproduce the error if anyone wants to try it out:

#!/bin/bash

# Cleanup
rm eventlistener_*

# Define variables
max=50
if [ ! -z "$1" ]; then
  max="$1"
fi
range=40
if [ ! -z "$2" ]; then
  range="$2"
fi
min=$((max - range))
if [ $min -le 0 ]; then
  min=1
fi
echo "Max triggers: $max"
echo "Min triggers: $min"
echo ""

# Generate EventListener files
cat << EOF >> "eventlistener_0.yaml"
apiVersion: tekton.dev/v1alpha1
kind: EventListener
metadata:
  name: listener
spec:
  serviceAccountName: tekton-triggers-example-sa
  triggers:
EOF
for ((i=1; i<="$max"; i++)); do
  prevFile="eventlistener_$((i - 1)).yaml"
  newFile="eventlistener_${i}.yaml"
  cp "$prevFile" "$newFile"
  cat << EOF >> $newFile
    - name: foo-trig
      binding:
        name: pipeline-binding
      template:
        name: pipeline-template
EOF
done

# Apply files
for ((i="$min"; i<="$max"; i++)); do
  # Apply files and record how long it takes
  echo "### $i triggers ###"
  file="eventlistener_${i}.yaml"
  startTime="$(date -u +%s)"
  kubectl apply -f "$file"
  endTime="$(date -u +%s)"
  elapsed="$(($endTime - $startTime))"
  echo "${elapsed}s"
  echo ""
done

Next I'll see at what point I can reproduce the error with Triggers v0.2.0

Update:
So far, this is not an issue in Triggers v0.2.0. I was able to use my testing script above and create an EventListener with 2600 triggers. After this point I started running into an issue with my yaml files being too big:

The EventListener "listener" is invalid: metadata.annotations: Too long: must have at most 262144 characters

The validation in v0.2.0 is a lot simpler than in v0.1.0. I think that the timeout error in v0.1.0 is probably coming from us querying the Kubernetes clientset a number of times: https://github.com/tektoncd/triggers/blob/v0.1.0/pkg/apis/triggers/v1alpha1/event_listener_validation.go#L54.

My test script is not very realistic, because it just creates a number of duplicate triggers. However, it doesn't look like there's a way to exploit anything in the EventListener validation logic to make it time out.

@ncskier
Copy link
Member

ncskier commented Jan 17, 2020

Also, we should probably create an e2e test to prevent a regression from occurring. To do this, I think we'll need a "realistic" maximum number of triggers that we think an EventListener should be able to have. I'm not really sure what this semi-arbitrary number should be.

ncskier pushed a commit to ncskier/triggers that referenced this issue Jan 20, 2020
Issue tektoncd#356 found a bug in Triggers v0.1.0. EventListeners with too many
triggers (around 27 in this user's case) could not be created due to an
Admission Webhook timeout. This bug makes Triggers v0.1.0 unusable at a
large scale.

PR tektoncd#263 (which came after v0.1.0) fixes the timeout issue by removing
clientset use from the EventListener validation code. So, I have
manually re-applied the changes made in PR tektoncd#263 on top of v0.1.0. This
should fix the timeout bug seen in v0.1.0.
@dibyom
Copy link
Member

dibyom commented Jan 27, 2020

I think having some performance numbers on the number of Triggers we can support would be nice!

Given that we relaxed the validation to not check for the presence of bindings/templates, I don't think creating the EL is where we'll run into issues.

We do use the Kubernetes/Triggers clientset to fetch resources in the EL sink at runtime (e.g. fetch the EL, then iterate over all the triggers; fetch the secret for GH/GL interceptors etc.) That might be problematic if we have a large number of incoming requests

@vdemeester
Copy link
Member

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 27, 2020
@ncskier
Copy link
Member

ncskier commented Feb 3, 2020

I'm closing this issue since the bug has been resolved in the latest release of Triggers (v0.2.1)

I'm opening a couple of issues for follow-up work:

@ncskier ncskier closed this as completed Feb 3, 2020
ncskier pushed a commit to ncskier/triggers that referenced this issue Feb 19, 2020
Fixes tektoncd#405

Issue tektoncd#356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the
admission webhook times out after adding around 35 Triggers to an
EventListener.

This PR adds a regression test where we create an EventListener with
1000 Triggers.
ncskier pushed a commit to ncskier/triggers that referenced this issue Feb 19, 2020
Fixes tektoncd#405

Issue tektoncd#356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the
admission webhook times out after adding around 35 Triggers to an
EventListener.

This PR adds a regression test where we create an EventListener with
1000 Triggers.

Also, moves the cluster-scope resource cleanup code (from
tektoncd@7ba80a7)
out of init_test.go and into eventlistener_test.go.
tekton-robot pushed a commit that referenced this issue Feb 20, 2020
Fixes #405

Issue #356 found a bug in Triggers v0.1.0 (fixed in v0.2.0) where the
admission webhook times out after adding around 35 Triggers to an
EventListener.

This PR adds a regression test where we create an EventListener with
1000 Triggers.

Also, moves the cluster-scope resource cleanup code (from
7ba80a7)
out of init_test.go and into eventlistener_test.go.
pradeepitm12 pushed a commit to pradeepitm12/triggers that referenced this issue Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. maybe-next-milestone
Projects
None yet
Development

No branches or pull requests

5 participants