Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve tunnel availability #375

Merged
merged 1 commit into from
Jul 21, 2021
Merged

Conversation

aholic
Copy link
Contributor

@aholic aholic commented Jul 2, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

  1. before this pr, all csr(pending/approved/other status) will try enqueue every 10 seconds, but the default queue is only 10qps, 100 bucket. tunnel-server is not able to approve csr when large number of csr comes.
  2. the connection is not available before certificate signed, so tunnel-agent must wait for the certificate signed

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

/assign @rambohe-ch

Does this PR introduce a user-facing change?

None

other Note

@openyurt-bot
Copy link
Collaborator

@aholic: GitHub didn't allow me to assign the following users: your_reviewer.

Note that only openyurtio members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

  1. before this pr, all csr(pending/approved/other status) will try enqueue every 10 seconds, but the default queue is only 10qps, 100 bucket. tunnel-server is not able to approve csr when large number of csr comes.
  2. the connection is not available before certificate signed, so tunnel-agent must wait for the certificate signed

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

/assign @rambohe-ch

Does this PR introduce a user-facing change?

None

other Note

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openyurt-bot openyurt-bot added the kind/bug kind/bug label Jul 2, 2021
@openyurt-bot openyurt-bot added the size/M size/M: 30-99 label Jul 2, 2021
@aholic
Copy link
Contributor Author

aholic commented Jul 2, 2021

/assign @Fei-Guo

@rambohe-ch
Copy link
Member

/lgtm

@openyurt-bot openyurt-bot added the lgtm lgtm label Jul 2, 2021
@@ -150,7 +150,7 @@ func (o *ServerOptions) Config() (*config.Config, error) {
if err != nil {
return nil, err
}
cfg.SharedInformerFactory = informers.NewSharedInformerFactory(cfg.Client, 10*time.Second)
cfg.SharedInformerFactory = informers.NewSharedInformerFactory(cfg.Client, 24*time.Hour)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change the resync time?

Copy link
Contributor Author

@aholic aholic Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my situation, i saw thousands tunnel certificate-sign-requests was in pending/approved status. All of them was enqueued every 10 seconds, which full filled the work-queue. The pending certificate-sign-requests could not be approved(after 24hours pending, it was deleted automatically). So i did two things:

  1. make the re-sync period longer. I think there's no need for the re-sync period to be so short, it's just a method to fix some unexpected situation.
  2. filter before enqueue, only do enqueue for pending tunnel certificate-sign-requests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Alternatively, we may set it to 0 to disable resync .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 0 is also ok. let's just make it a relative long period, in case of some unexpected situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fei-Guo Set a relative long period(like 24 * time.Hour) for resync is more reasonable than disabling resync, in case of some unexpected situation.

@@ -91,6 +94,17 @@ func Run(cfg *config.CompletedConfig, stopCh <-chan struct{}) error {
}
agentCertMgr.Start()

// 2.1. waiting for the certificate is generated
_ = wait.PollUntil(15*time.Second, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 15 second poll interval too long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it to reduce qps pressure of api-server. I don't want to see an avalanche when thousands of node comes(or other situations). 5 second or 15 second will not hurt user-experience, but an avalanche will.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everybody waits 15 seconds does not eliminated the bursty quests but only delay them. You will need to add random delay to smooth the QPS or just give a reasonable delay for getting a signed certificate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's set it to 5 second first
i will update it to a random delay sometime later, when I'm free..

@openyurt-bot openyurt-bot removed the lgtm lgtm label Jul 7, 2021
@rambohe-ch
Copy link
Member

/lgtm

@openyurt-bot openyurt-bot added the lgtm lgtm label Jul 13, 2021
@rambohe-ch
Copy link
Member

@Fei-Guo If you have not any other comments, i'd like to approve this pr.

@Fei-Guo
Copy link
Member

Fei-Guo commented Jul 21, 2021

/approve
/lgtm

@openyurt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aholic, Fei-Guo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openyurt-bot openyurt-bot added the approved approved label Jul 21, 2021
@openyurt-bot openyurt-bot merged commit 0bc28b7 into openyurtio:master Jul 21, 2021
MrGirl pushed a commit to MrGirl/openyurt that referenced this pull request Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved approved kind/bug kind/bug lgtm lgtm size/M size/M: 30-99
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants