Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Parallelize avro import #45269

Merged
merged 4 commits into from
Mar 3, 2020
Merged

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Feb 21, 2020

Parallelize avro importer to improve its throughput (2.8x improvement).

Touches #40374.
Fixes #45097.

Release notes (performance): Faster avro import

@miretskiy miretskiy requested review from dt and spaskob February 21, 2020 00:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy force-pushed the fast_avro branch 4 times, most recently from 6f93b6c to 4805318 Compare February 25, 2020 18:25
Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_avro.go, line 232 at r1 (raw file):

		return false
	}
	c.rejected <- fmt.Sprintf("%s\n", row)

I am confused, isn't avro a binary format? What is the meaning of row in this case?


pkg/ccl/importccl/read_import_base.go, line 356 at r1 (raw file):

type namedInput struct {
	reader *fileReader
	name   string

could you add comments what name and idx represent?


pkg/ccl/importccl/read_import_base.go, line 360 at r1 (raw file):

}

// A scanner over avro input.

why is the comment specific to avro?

@miretskiy miretskiy force-pushed the fast_avro branch 5 times, most recently from ab759ac to 5c30103 Compare February 26, 2020 17:42
Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_avro.go, line 232 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I am confused, isn't avro a binary format? What is the meaning of row in this case?

You are not confused; avro may or may not be binary.
However, at this stage, we have read avro input file and converted that input into a native go type
(for example: map, or an array, or a byte array, or an int, etc). So, i'm just printing that representation of the row that we have failed to handle.

I wonder if I should print %v instead though....


pkg/ccl/importccl/read_import_base.go, line 356 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

could you add comments what name and idx represent?

Done.


pkg/ccl/importccl/read_import_base.go, line 360 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

why is the comment specific to avro?

Because of copy & paste.

Updated.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 360 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Because of copy & paste.

Updated.

Done.

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_avro.go, line 232 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

You are not confused; avro may or may not be binary.
However, at this stage, we have read avro input file and converted that input into a native go type
(for example: map, or an array, or a byte array, or an int, etc). So, i'm just printing that representation of the row that we have failed to handle.

I wonder if I should print %v instead though....

We may need to discuss how rejected file is going to be used. The initial idea is that it is in the same format as the input so that the user can repeat the import but only reading from the rejected file. If this is not feasible for avro we should reconsider this approach.


pkg/ccl/importccl/read_import_csv.go, line 82 at r2 (raw file):

}

func (c *csvInputReader) readFile(

I wonder if this function can be refactored away in the future as well.


pkg/ccl/importccl/read_import_csv.go, line 192 at r2 (raw file):

				col := conv.VisibleCols[i]
				err = wrapRowErr(err, s.input.name, rowNum, pgcode.Syntax,
					"parse %q as %s", col.Name, col.Type.SQLString())

I wonder if we should also include the field in the error message.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_avro.go, line 232 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

We may need to discuss how rejected file is going to be used. The initial idea is that it is in the same format as the input so that the user can repeat the import but only reading from the rejected file. If this is not feasible for avro we should reconsider this approach.

It's not feasible for avro.


pkg/ccl/importccl/read_import_csv.go, line 82 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I wonder if this function can be refactored away in the future as well.

perhaps... Not sure if it's possible to refactor e.g. mysqldump or pgdump formats,
but perhaps move this to parallelImporter in base and just call that


pkg/ccl/importccl/read_import_csv.go, line 192 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I wonder if we should also include the field in the error message.

Perhaps -- I tried to keep things the same where possible. if you want, I can change this err message.

@miretskiy
Copy link
Contributor Author

@spaskob @dt
finally: all ci builds are green. I found some issues (through existing tests), which I have fixed in avro records handling.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_avro.go, line 232 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

It's not feasible for avro.

Done.

@miretskiy
Copy link
Contributor Author

@dt @spaskob please review latest revision. Thanks.

@miretskiy miretskiy force-pushed the fast_avro branch 3 times, most recently from 746014a to fc53315 Compare February 28, 2020 23:20
Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 10 files at r1, 4 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 431 at r3 (raw file):

	Scan() bool

	// Err returns an error (if any) encountered when processing avro stream.

avro mentioned in comment


pkg/ccl/importccl/read_import_base.go, line 523 at r3 (raw file):

			// Scan more data, unless we're done.
			select {
			case <-ctx.Done():

I'm surprised this didn't show up in profiling, but .Done() in hot loops is usually performance problem -- we usually pull it into local above the loop done := ctx.Done() and even then sometimes even avoid checking the ch on every iteration as it is still somewhat expensive (i believe the prior CSV code only checked it on flushes).


pkg/ccl/importccl/read_import_base.go, line 568 at r3 (raw file):

	p.b.data = append(p.b.data, data)

	if len(p.b.data) == cap(p.b.data) {

this seems brittle, or at least subtle -- i had to read this a couple times before I'd convinced myself that this would actually behave correctly, i.e. that append or something else would never re-alloc the batch. IMO it'd be clearer just to add batchSize as a field on p.

@miretskiy miretskiy force-pushed the fast_avro branch 2 times, most recently from d407f37 to 6f9e247 Compare March 2, 2020 15:30
Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 431 at r3 (raw file):

Previously, dt (David Taylor) wrote…

avro mentioned in comment

Done.


pkg/ccl/importccl/read_import_base.go, line 523 at r3 (raw file):

Previously, dt (David Taylor) wrote…

I'm surprised this didn't show up in profiling, but .Done() in hot loops is usually performance problem -- we usually pull it into local above the loop done := ctx.Done() and even then sometimes even avoid checking the ch on every iteration as it is still somewhat expensive (i believe the prior CSV code only checked it on flushes).

Done.


pkg/ccl/importccl/read_import_base.go, line 568 at r3 (raw file):

Previously, dt (David Taylor) wrote…

this seems brittle, or at least subtle -- i had to read this a couple times before I'd convinced myself that this would actually behave correctly, i.e. that append or something else would never re-alloc the batch. IMO it'd be clearer just to add batchSize as a field on p.

Done.

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @miretskiy, and @spaskob)


pkg/ccl/importccl/read_import_csv.go, line 192 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Perhaps -- I tried to keep things the same where possible. if you want, I can change this err message.

You have 4 things to show but only 2 placeholders

Rerun and update stats for csv import benchmark.

Release notes: None
Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/read_import_base.go, line 523 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Done.

Good point. Removed, and, as per your suggestion moved to flush.


pkg/ccl/importccl/read_import_csv.go, line 192 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

You have 4 things to show but only 2 placeholders

You misread the code a bit; I've changed line wrapping to make it clear that i'm calling a function with 3 arguments (one of which is an error constructed with Wrapf(...))

@miretskiy
Copy link
Contributor Author

I reran and updated before/after benchmark for csv import. Refactoring and introduction of the interface{} (instead of the string) didn't have a significant impact:
benchstat ~/before.txt ~/after.txt
name old time/op new time/op delta
ConvertRecord-16 2.27µs ±11% 2.29µs ± 8% ~ (p=0.494 n=10+10)

name old speed new speed delta
ConvertRecord-16 53.0MB/s ±12% 52.5MB/s ± 9% ~ (p=0.494 n=10+10)

name old alloc/op new alloc/op delta
ConvertRecord-16 3.63kB ± 0% 3.61kB ± 0% -0.68% (p=0.000 n=9+9)

name old allocs/op new allocs/op delta
ConvertRecord-16 100 ± 0% 101 ± 0% +1.00% (p=0.000 n=10+10)

Yevgeniy Miretskiy added 3 commits March 2, 2020 14:09
Refactor csv importer and pull its parallel import related
logic into separate helpers.

This refactoring is done in preparation of making avro imports
(and possibly others) use the same functionality.

Release notes: None
Add a benchmark testing avro import performance.

Release notes: None
Parallelize avro importer to improve its throughput (2.8x improvement).

Fixes cockroachdb#45097

Release notes (performance): Faster avro import
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 2, 2020

Canceled (will resume)

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 2, 2020

Already running a review

@craig
Copy link
Contributor

craig bot commented Mar 2, 2020

Build failed (retrying...)

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 2, 2020

Already running a review

@craig
Copy link
Contributor

craig bot commented Mar 3, 2020

Build succeeded

@craig craig bot merged commit cf8cd92 into cockroachdb:master Mar 3, 2020
@miretskiy miretskiy deleted the fast_avro branch March 3, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

importccl: faster Avro imports
4 participants