Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add cockroach debug job-trace #65324

Merged
merged 1 commit into from
Jun 1, 2021

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented May 17, 2021

This change adds a new cli command - cockroach debug job-trace
that can be used to grab information from inflight trace spans
associated with the execution of a particular job.
The command takes a jobID and file destination where the
information will be dumped. It also requires a --url of the
node from which to pull the inflight spans.

In the future we will have a means of pulling inflight spans from
the cluster, but at the moment this command must be run against every
node to get a complete picture of job execution.

Informs: #64992

Release note (cli change): Adds a cockroach debug job-trace command
that takes 2 arguments: and file destination, along with a
--url pointing to the node on which to execute this command.
The command pulls information about inflight trace spans associated
with the job and dumps it to the file destination.

@adityamaru adityamaru requested a review from dt May 17, 2021 10:39
@adityamaru adityamaru requested a review from a team as a code owner May 17, 2021 10:39
@adityamaru adityamaru requested a review from a team May 17, 2021 10:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru adityamaru requested review from pbardea and removed request for a team and dt May 17, 2021 10:40
@adityamaru
Copy link
Contributor Author

The first two commits are from #65322.

@jordanlewis
Copy link
Member

Excellent.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, 2 of 2 files at r2, 4 of 4 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @pbardea)


pkg/cli/debug_job_trace.go, line 34 at r3 (raw file):

	Args:  cobra.MinimumNArgs(2),
	RunE: MaybeDecorateGRPCError(
		func(command *cobra.Command, args []string) error {

nit: please move this function to a named function below.


pkg/cli/debug_job_trace.go, line 65 at r3 (raw file):

	}

	jobID, err := strconv.Atoi(jobIDStr)

This check is parameter validation, and should be done upfront in the command before starting to open connections etc.


pkg/cli/debug_job_trace.go, line 71 at r3 (raw file):

	var jobQuery = fmt.Sprintf(`SELECT trace_id FROM crdb_internal.jobs WHERE job_id='%d'`, jobID)
	rows, err := sqlConn.Query(jobQuery, nil)

please pass the query inside sqlConn.Query and use a placeholder to pass jobID.


pkg/cli/debug_job_trace.go, line 72 at r3 (raw file):

	var jobQuery = fmt.Sprintf(`SELECT trace_id FROM crdb_internal.jobs WHERE job_id='%d'`, jobID)
	rows, err := sqlConn.Query(jobQuery, nil)
	if err != nil {

I think you need to call rows.Close().
Please extract the retrieval of the traceID into a separate function, and ensure rows.Close() is called within that function.


pkg/cli/debug_job_trace.go, line 103 at r3 (raw file):

	var resRows [][]string
	spanRows, err := sqlConn.Query(query, nil)

ditto calling spanRows.Close().


pkg/cli/debug_job_trace.go, line 108 at r3 (raw file):

	}

	populateRow := func(vals []driver.Value) ([]string, error) {

I don't understand why you need this code -- what was wrong with calling runQueryAndFormatResults()?


pkg/sql/crdb_internal.go, line 623 at r2 (raw file):

	error              		STRING,
	coordinator_id     		INT,
  trace_id              INT

nit: spacing is off

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @knz, and @pbardea)


pkg/cli/debug_job_trace.go, line 34 at r3 (raw file):

Previously, knz (kena) wrote…

nit: please move this function to a named function below.

Done.


pkg/cli/debug_job_trace.go, line 65 at r3 (raw file):

Previously, knz (kena) wrote…

This check is parameter validation, and should be done upfront in the command before starting to open connections etc.

Done.


pkg/cli/debug_job_trace.go, line 71 at r3 (raw file):

Previously, knz (kena) wrote…

please pass the query inside sqlConn.Query and use a placeholder to pass jobID.

Done.


pkg/cli/debug_job_trace.go, line 72 at r3 (raw file):

Previously, knz (kena) wrote…

I think you need to call rows.Close().
Please extract the retrieval of the traceID into a separate function, and ensure rows.Close() is called within that function.

Done.


pkg/cli/debug_job_trace.go, line 103 at r3 (raw file):

Previously, knz (kena) wrote…

ditto calling spanRows.Close().

Deleted this code since it is no longer required.


pkg/cli/debug_job_trace.go, line 108 at r3 (raw file):

Previously, knz (kena) wrote…

I don't understand why you need this code -- what was wrong with calling runQueryAndFormatResults()?

doh, that's exactly what I want. Thanks for always pointing me to the right place 🙂

@adityamaru
Copy link
Contributor Author

Rebased on #65322 that was merged, and also cherry-picked #65454. The last commit is the pertinent one!

@adityamaru adityamaru requested a review from knz May 19, 2021 10:12
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good overall, but I'm missing the test code. Can you point me to it?

Reviewed 5 of 11 files at r4, 7 of 7 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @pbardea)


pkg/cli/debug_job_trace.go, line 72 at r3 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Done.

nice 💯


pkg/cli/debug_job_trace.go, line 35 at r5 (raw file):

func runDebugJobTrace(_ *cobra.Command, args []string) error {
	jobID, err := strconv.Atoi(args[0])

Maybe strconv.ParseInt here and pass it as int64 below. This avoids a cast.


pkg/sql/crdb_internal.go, line 623 at r2 (raw file):

Previously, knz (kena) wrote…

nit: spacing is off

still off

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to tests

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru)


pkg/cli/debug_job_trace.go, line 104 at r5 (raw file):

	}

	var inflightSpanQuery = `WITH spans AS(

nit: formatting of the query looks a bit of

@adityamaru adityamaru force-pushed the debug-job-tracing branch 2 times, most recently from 628654a to faef8c3 Compare May 25, 2021 14:12
@adityamaru
Copy link
Contributor Author

I'm actually struggling a little while writing a test for this. I've added a WIP commit for the general structure but I'm running into two main issues:

  1. When running the command through the CLI test, the outputted file is of a different format, as compared to testing it locally.

Test is outputting:

Screen Shot 2021-05-25 at 2 52 51 PM

Local testing is outputting:

Screen Shot 2021-05-25 at 2 56 57 PM

The latter is what I would expect, any ideas why there is a difference in behavior @knz?

  1. I'm not too sure what to really test the output against. I was thinking of grepping the outputted file for the payload and matching that column against what we expect to see. We can't compare it against a static file in testdata because of the nature of the output. Do you all have any other ideas?

@adityamaru adityamaru requested a review from pbardea May 25, 2021 15:12
@knz
Copy link
Contributor

knz commented May 25, 2021

you need to use --format to choose a specific output format and make your output the same in every case

(The default value of --format is different depending on how you invoke the command)

@knz
Copy link
Contributor

knz commented May 25, 2021

Do you all have any other ideas?

We have other tests (e.g. for node status) that expect specific outputs based on regular rexpressions.

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @pbardea)


pkg/cli/debug_job_trace.go, line 35 at r5 (raw file):

Previously, knz (kena) wrote…

Maybe strconv.ParseInt here and pass it as int64 below. This avoids a cast.

Done.


pkg/sql/crdb_internal.go, line 623 at r2 (raw file):

Previously, knz (kena) wrote…

still off

fixed the trace_id prefix indent. the col type indentation appears fine in my editor, but not in reviewable.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code structure/style LGTM, but I'll let your team chime in on correctness.

Reviewed 2 of 3 files at r6, 5 of 5 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @knz)


pkg/sql/crdb_internal.go, line 621 at r6 (raw file):

	fraction_completed 		FLOAT,
	high_water_timestamp	DECIMAL,
	error              		STRING,

While you're at it, perhaps you could put that STRING and the INT just below it on the same column as the rest.

@adityamaru
Copy link
Contributor Author

While you're at it, perhaps you could put that STRING and the INT just below it on the same column as the rest.

These only appear shifted in reviewable for some reason. Github and in my IDE the spacing is identical for all the columns.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @knz, and @pbardea)


pkg/sql/crdb_internal.go, line 621 at r6 (raw file):

Previously, knz (kena) wrote…

While you're at it, perhaps you could put that STRING and the INT just below it on the same column as the rest.

I'm realizing what the issue is: this vtable definition uses tabs, not spaces, in its schema.
all the other vtables uses spaces - this one is the exception.

So the improvement would be to convert this to use spaces like everything else: two spaces at the beginning, then spaces to align the columns.

Copy link
Contributor Author

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @pbardea)


pkg/sql/crdb_internal.go, line 621 at r6 (raw file):

Previously, knz (kena) wrote…

I'm realizing what the issue is: this vtable definition uses tabs, not spaces, in its schema.
all the other vtables uses spaces - this one is the exception.

So the improvement would be to convert this to use spaces like everything else: two spaces at the beginning, then spaces to align the columns.

nice! all done 🙂

@knz
Copy link
Contributor

knz commented May 26, 2021

my previous review stands

Code structure/style LGTM, but I'll let your team chime in on correctness.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adityamaru adityamaru force-pushed the debug-job-tracing branch 3 times, most recently from 8d78069 to 77cb68e Compare May 31, 2021 13:34
@adityamaru
Copy link
Contributor Author

Had to fix up a few things to get the test passing:

  • debug job-trace no longer creates a "job-trace" directory, but just creates whatever path is passed in as the CLI argument.
  • test cleans up directory after itself.
  • we would occasionally see a "contention event" in the trace for a job. We are only really interested in the header and the trace we injected, so the test only checks for those now.

This change adds a new cli command - `cockroach debug job-trace`
that can be used to grab information from inflight trace spans
associated with a particular jobs execution.
The command takes a `jobID` and file destination where the
information will be dumped. It also requires a `--url` of the
node from which to pull the inflight spans.

In the future we will have a means of pulling inflight spans from
the cluster, but at the moment this command must be run against every
node to get a complete picture of job execution.

Release note (cli change): Adds a `cockroach debug job-trace` command
that takes 2 arguments: <jobID> and file destination, alongwith a
`--url` pointing to the node on which to execute this command against.
The command pulls information about inflight trace spans associated
with the job and dumps it to the file destination.
@adityamaru
Copy link
Contributor Author

Seeing an Execution timeout in CI. Not sure what is causing this yet.

@adityamaru
Copy link
Contributor Author

Going ahead and merging this. The bazel failure is unrelated and seems to have been occurring on most builds since ~3-4 days ago - pkg/sql/sql_test_/sql_test

TFTR!
bors r=pbardea,knz

@craig
Copy link
Contributor

craig bot commented Jun 1, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants