-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AppCreate diagnostic #16658
AppCreate diagnostic #16658
Conversation
75397b5
to
969d4e8
Compare
ae25bd4
to
01994df
Compare
01994df
to
353f7ec
Compare
14ade1a
to
c3c7591
Compare
@openshift/sig-master I would like to have online ops start trying this out and getting feedback on actual usage with 3.9; for that to happen, I will need some reviews this week. |
pkg/oc/admin/diagnostics/config.go
Outdated
switch index { | ||
case 0: | ||
errmsg = fmt.Sprintf("--%s specified that client config should be at %s\n", confFlagName, path) | ||
case len(paths) - 1: // config in ~/.kube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really fragile...
pkg/oc/admin/diagnostics/config.go
Outdated
errmsg := "" | ||
switch index { | ||
case 0: | ||
errmsg = fmt.Sprintf("--%s specified that client config should be at %s\n", confFlagName, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no explicit config file is passed in, does this check even make sense? won't confFlagValue be ""?
pkg/oc/admin/diagnostics/config.go
Outdated
case len(paths) - 1: // config in ~/.kube | ||
// no error message indicated if it is not there... user didn't say it would be | ||
default: // can be multiple paths from the env var in theory; all cases should go here | ||
if len(os.Getenv(config.OpenShiftConfigPathEnvVar)) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to craft specific messages for specific indices in the loading order seems weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is very old code... I couldn't think of anything cleaner at the time; perhaps I can do better now.
signal.Notify(sig, os.Interrupt, syscall.SIGTERM) | ||
go func() { | ||
<-sig | ||
d.out.Warn("DCluAC001", nil, "Received interrupt; aborting diagnostic") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually abort the other gofunc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't. There is no way to abort a goroutine - it's collaborative concurrency, the goroutine has to want to stop. The main one keeps running after an interrupt, we're just not paying attention to it any longer. The only thing I could think of is to set up another channel and check it at various points to see if an interrupt occurred, but that seemed even messier. Do you have a different suggestion?
This is not new at all BTW, just code moved around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed it was a move
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the second commit is stuff moving around
|
||
<-done // wait until either finishes | ||
signal.Stop(sig) | ||
d.logResult() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can run in parallel with the assignment on line 282 or 268-269 if interrupt is received, and crash with a data race error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you hit interrupt, how concerned are you about race conditions crashing diagnostics? I'm not sure it's even possible here - yes, data could conceivably be written into the object while the result is being logged, but would you get anything worse than bad output? - however it doesn't seem like an important edge case. But I'd be happy to use a better pattern for handling interrupts if one is known. I think the rest of the product... probably just exits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if an individual diagnostic could get interrupted and the overall process was expected to keep going
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the idea... it moves on to run the next diagnostic if any. but at that point nothing is going to use the result from the previous diagnostic. Part of the reason you want it to keep going is to give it a chance to clean up the resources it created, so there's an actual benefit...
6d82a45
to
a2d934d
Compare
pkg/oc/admin/project/new_project.go
Outdated
@@ -105,6 +108,9 @@ func (o *NewProjectOptions) complete(f *clientcmd.Factory, args []string) error | |||
} | |||
|
|||
func (o *NewProjectOptions) Run(useNodeSelector bool) error { | |||
if o.Output == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either move this to complete(), or compute a local var defaulting to os.Stdout... don't generally want to mutate options in the Run() method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that makes sense. Hopefully I can rely on complete()
being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, it turns out there are a bunch of tests that simply construct the options directly and don't run complete()
on them (indeed, they can't since it's private). So I can either change every test, or use a local var with default like you said.
672f41c
to
d115033
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/retest |
looks like about 50 things went wrong... |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
More strange errors... |
un lgtm-ing to calm down the retest bot. Those test integration failures are real and caused by this pull. |
/test origin-verify |
d115033
to
6a22b82
Compare
looks like we're back to normal flakes. |
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16658/test_pull_request_origin_end_to_end/9231/ was #18522 (already resolved) |
updated and rebased last week, @soltysh can i get a re-lgtm now that the merge window is reopened? the bit that I needed to change was in new-project... using a local variable to default the output writer because tests didn't set it or complete it |
6a22b82
to
6c78e37
Compare
ready for re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: soltysh, sosiouxme The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
@sosiouxme: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 16658, 18643). |
Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic.
https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work
https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic
https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality
Status:
Not yet addressed in this PR (depending on how reviews progress vs development):
Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.