Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tctl "leaking" plugin server processes when exiting with an error #235

Open
agy opened this issue Jul 22, 2022 · 2 comments
Open

[Bug] tctl "leaking" plugin server processes when exiting with an error #235

agy opened this issue Jul 22, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@agy
Copy link

agy commented Jul 22, 2022

What are you really trying to do?

We use a tctl-authorization-plugin with the tctl CLI for our Temporal clusters. We had reports from our users that many instances of our plugin was left running in the background after they had completed running their commands.

After some investigation I realized that the plugin server and client were handled correctly when CLI commands were completing successfully. However whenever the CLI commands were returning an error (specifically exiting with non-zero) the plugin server was left running in the background (aka "leaking").

I am very familiar with Hashicorp's go-plugin framework.

Describe the bug

When tctl exits with an error it calls os.Exit(1). This results in the process being immediately terminated without calling any of the deferred functions. This means that the stopPlugins function is not called, leaving the plugin server running.

Example command:

$ tctl wf show
Error: Option workflow_id is required
exit status 1

More in depth description

Stepping through the CLI application we can see that:

  1. The application calls ShowHistory(c)
  2. Which calls getRequiredOption(c, FlagWorkflowID)
  3. Which in turn validates the workflow ID and calls ErrorAndExit(...) since the value hasn't been set
  4. This then calls osExit(1)
  5. This is a variable which is set to os.Exit

Quoting the Go documentation:

Exit causes the current program to exit with the given status code. Conventionally, code zero indicates success, non-zero an error. The program terminates immediately; deferred functions are not run.

For portability, the status code should be in the range [0, 125].

This means that the deferred stopPlugins function is never called and the plugin server is left running.

Note: I'm intentionally eliding the plugin setup, GRPC interceptor, etc. since it works as expected on the happy path.

Minimal Reproduction

I don't know of a minimal authorization plugin that I can run so this is a little more vague than I would like. Apologies in advance.

The Temporal environment variables (somewhat redacted):

$ env | grep '^TEMPORAL_CLI'
TEMPORAL_CLI_PLUGIN_HEADERS_PROVIDER=tctl-auth
TEMPORAL_CLI_NAMESPACE=default
TEMPORAL_CLI_TLS_KEY=
TEMPORAL_CLI_TLS_DISABLE_HOST_VERIFICATION=false
TEMPORAL_CLI_TLS_SERVER_NAME=
TEMPORAL_CLI_PLUGIN_DATA_CONVERTER=tctl-data-converter
TEMPORAL_CLI_ADDRESS=temporal.server.host.name:port
TEMPORAL_CLI_TLS_CERT=
TEMPORAL_CLI_TLS_CA=https://ca.server.host.name/path/to/ca/chain

The unhappy path:

  1. Ensure that the authorization plugin is installed and in your path (tctl-auth in our case)
  2. Run tctl wf show
  3. See that the command exits with a non-zero exit code and a useful error message
  4. See that the tctl-auth process is left running in the background

Example:

$ which tctl-auth
/usr/local/bin/tctl-auth

$ ps aux | grep '[t]ctl-auth' ; echo $?
1

$ tctl wf show
Error: Option workflow_id is required
exit status 1

$ ps aux | grep '[t]ctl-auth' ; echo $?
<username> 50280   0.0  0.1 34882124  20196 s000  S     7:29PM   0:00.03 tctl-auth
0

The happy path:

  1. Ensure that the authorization plugin is installed and in your path (tctl-auth in our case)
  2. Run tctl wf show -w <valid-workflow-id>
  3. See that the command exits with a zero exit code
  4. See that no tctl-auth processes are left running in the background

Example:

$ which tctl-auth
/usr/local/bin/tctl-auth

$ ps aux | grep '[t]ctl-auth' ; echo $?
1

$ tctl wf show -w <valid-workflow-id>
[... workflow output elided ...]

$ ps aux | grep '[t]ctl-auth' ; echo $?
1

Environment/Versions

  • OS and processor: Mac AMD64
  • tctl Version: 1.16.2
  • Temporal Version: n/a
  • Are you using Docker or Kubernetes or building Temporal from source? n/a

Additional context

Our authentication plugin is a Go application using cobra for CLI command parsing and boils down to:

func TctlAuthCommand() *cobra.Command {
	return &cobra.Command{
		Use:   "tctl-auth",
		Short: "[...]",
		RunE: func(_ *cobra.Command, _ []string) error {
			pluginMap := map[string]plugin.Plugin{
				cliPlugin.HeadersProviderPluginType: &cliPlugin.HeadersProviderPlugin{
					Impl: &provider{},
				},
			}
			plugin.Serve(&plugin.ServeConfig{
				HandshakeConfig: cliPlugin.PluginHandshakeConfig,
				Plugins:         pluginMap,
			})
			return nil
		},
	}
}

var _ cliPlugin.HeadersProvider = &provider{}

type provider struct {}

func (p provider) GetHeaders(_ context.Context) (map[string]string, error) {
	jwtToken, err := GetJWT()
	return map[string]string{"Authorization": jwtToken}, err
}
@agy agy added the bug Something isn't working label Jul 22, 2022
@jlegrone
Copy link

jlegrone commented Jul 29, 2022

It seems like this is probably the main culprit:

tctl/cli_curr/util.go

Lines 401 to 405 in 681f0e4

// ErrorAndExit print easy to understand error msg first then error detail in a new line
func ErrorAndExit(msg string, err error) {
printError(msg, err)
osExit(1)
}

Rather than calling os.Exit directly, tctl should be able to return an ExitCoder error from each command's ActionFunc.

@feedmeapples
Copy link
Contributor

feedmeapples commented Aug 16, 2022

this should be fixed in the new tctl v2 since we don't os.Exit anymore and always process error handling through urfave/cli/v2 error handler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants