Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forcing eventlistener sink to resolve eventlistener #977

Merged
merged 2 commits into from
Mar 5, 2021

Conversation

jmcshane
Copy link
Contributor

@jmcshane jmcshane commented Mar 2, 2021

Changes

This change forces the eventlistenersink process to be able to resolve
the EventListener before starting the HTTP server.

This means that the sink HTTP process (and readiness probe)
will not be listen until the EventListenerLister is able to resolve
the EventListener from the API server.

This is especially useful in startup cases, but can also assist
if the pod is started without permission to read the EventListener
object. In this situation, given this change, the eventlistener
pod will restart with a logged error message about lack of access to
that specific API resource.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Release Notes

Force eventlistener pod to resolve EventListener resource before startup

@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 2, 2021
@tekton-robot tekton-robot requested review from khrm and vtereso March 2, 2021 03:42
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 2, 2021
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 32.0% -21.3

@jmcshane jmcshane force-pushed the eventlistener-resolve branch from 40536ad to a64e56e Compare March 2, 2021 04:00
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 2, 2021
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 58.3% 5.0

@jmcshane jmcshane force-pushed the eventlistener-resolve branch from a64e56e to b2c0d1a Compare March 2, 2021 12:06
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 58.3% 5.0

Duration: 50 * time.Millisecond,
Factor: 2.0,
Jitter: 0.3,
Steps: 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, it will try a max of 10 times right? (Not try for a max of 5seconds?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, its a little of both. the steps is the max number of times, if it ever hits the cap it will set the wait time to that amount and then bail out.

in this case, the first wait time is 50 milliseconds with a scale factor of 2 and jitter of 0.3. The jitter means the scale factor will be somewhere between 1.7 and 2.3. This means that after 10 steps, it waits somewhere around 10 seconds max before it fails

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense...do you think 10 seconds is quick enough? I was thinking 30s might be a better default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me increase the factor and the num steps. Again, this is basically a question of:

how long will it take for the k8s API server to commit the eventlistener

I think closer to the 15-20 seconds is appropriate, especially considering that this is a function that blocks the startup of the HTTP endpoint. Let me modify the factor so it ends up around this value

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 54.5% 1.2

Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor comments. Otherwise good to go!

Duration: 50 * time.Millisecond,
Factor: 2.0,
Jitter: 0.3,
Steps: 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense...do you think 10 seconds is quick enough? I was thinking 30s might be a better default?

pkg/sink/initialization.go Outdated Show resolved Hide resolved
r.WaitForEventListenerOrDie()
}
cmd := exec.Command(os.Args[0], "-test.run=TestWaitForEventlistener_Fatal") //nolint:gosec
cmd.Env = append(os.Environ(), "EL_FATAL_CRASH=1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clever...though we might be able to get away with passing in a test logger and capturing its output and then comparing (like we do here:

logger := zaptest.NewLogger(t, zaptest.WrapOptions(zap.WrapCore(func(zapcore.Core) zapcore.Core { return core }))).Sugar()
sink.Logger = logger
ts := httptest.NewServer(http.HandlerFunc(sink.HandleEvent))
defer ts.Close()
resp, err := http.Post(ts.URL, "application/json", bytes.NewReader(tc.eventBody))
if err != nil {
t.Fatalf("error making request to eventListener: %s", err)
}
if resp.StatusCode != tc.wantStatusCode {
t.Fatalf("Status code mismatch: got %d, want %d", resp.StatusCode, http.StatusInternalServerError)
}
if tc.wantErrLogMsg != "" {
matches := logs.FilterMessage(tc.wantErrLogMsg)
if matches == nil || matches.Len() == 0 {
t.Fatalf("did not find log entry: %s.\n Logs are: %v", tc.wantErrLogMsg, logs.All())
}
)

What do you think?

Another option would be to have the function return an error and then have main.go call the log.Fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me take a look at this strategy and try to implement it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, it doesn't look like this is possible because the behavior of fatal can't be overridden even though there is a noop action in the library. basically, its not a supported action on fatal.

i updated this function to return an error and panic instead

@dibyom dibyom added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 4, 2021
@jmcshane jmcshane mentioned this pull request Mar 5, 2021
4 tasks
@jmcshane jmcshane force-pushed the eventlistener-resolve branch from 13bd2d7 to 7d926bb Compare March 5, 2021 03:19
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 65.2% 11.9

@jmcshane jmcshane force-pushed the eventlistener-resolve branch from 7d926bb to 4401def Compare March 5, 2021 03:21
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 65.2% 11.9

This means that the sink HTTP process (and readiness probe)
will not pass until the EventListenerLister is able to resolve
the EventListener from the API server.

This is especially useful in startup cases, but can also assist
if the pod is started without permission to read the EventListener
object. In this situation, given this change, the eventlistener
pod will restart with a logged error message about lack of access to
that specific API resource.
@jmcshane jmcshane force-pushed the eventlistener-resolve branch from 4401def to 78c2587 Compare March 5, 2021 03:29
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 65.2% 11.9

@dibyom
Copy link
Member

dibyom commented Mar 5, 2021

/approve

Just one minor thing -> panic to logger.Fatal in main

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2021
Co-authored-by: Dibyo Mukherjee <dibyo@google.com>
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-triggers-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/sink/initialization.go 53.3% 65.2% 11.9

@dibyom
Copy link
Member

dibyom commented Mar 5, 2021

/test pull-tekton-triggers-integration-tests

@dibyom
Copy link
Member

dibyom commented Mar 5, 2021

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2021
@tekton-robot tekton-robot merged commit 42be9fa into tektoncd:master Mar 5, 2021
dibyom added a commit to dibyom/triggers that referenced this pull request Mar 19, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic.

Tested by modifying the e2e test to run the intermittently failing test
10 times. Without this fix, it fails while with it it does not (See tektoncd#1012).

Fixes tektoncd#896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
dibyom added a commit to dibyom/triggers that referenced this pull request Mar 19, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic.

Tested by modifying the e2e test to run the intermittently failing test
10 times. Without this fix, it fails while with it it does not (See tektoncd#1012).

Fixes tektoncd#896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
dibyom added a commit to dibyom/triggers that referenced this pull request Mar 19, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic.

Tested by modifying the e2e test to run the intermittently failing test
10 times. Without this fix, it fails while with it it does not (See tektoncd#1012).

Fixes tektoncd#896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
dibyom added a commit to dibyom/triggers that referenced this pull request Mar 22, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic.

Tested by modifying the e2e test to run the intermittently failing test
10 times. Without this fix, it fails while with it it does not (See tektoncd#1012).

Fixes tektoncd#896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
dibyom added a commit to dibyom/triggers that referenced this pull request Mar 24, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in tektoncd#896. tektoncd#977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic. We will wait for the cache to sync for 1 minute before timing
out.

Tested by modifying the e2e test to run the intermittently failing test 10
times. Without this fix, it fails while with it it does not (See tektoncd#1012).

Fixes tektoncd#896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
tekton-robot pushed a commit that referenced this pull request Mar 25, 2021
Previously, we had a race where the EL would start serving traffic
before the lister caches were synced. This leads to the intermittent
resolution issues described in #896. #977 was an attempt to fix this for
the EventListener resource. This commit fixes it for all resource types
by first registering all the listers, and then syncing the cache before
serving traffic. We will wait for the cache to sync for 1 minute before timing
out.

Tested by modifying the e2e test to run the intermittently failing test 10
times. Without this fix, it fails while with it it does not (See #1012).

Fixes #896

Signed-off-by: Dibyo Mukherjee <dibyo@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants