Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importer: use new parquet writer for EXPORT INTO PARQUET #104234

Merged
merged 2 commits into from
Jun 5, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Jun 1, 2023

roachtest: add test for EXPORT INTO PARQUET

This change adds a rochtest which sets up a 3-node CRDB cluster
on 8vCPU machines initialized with the TPC-C database containing
250 warehouses. Then, it executes 30 EXPORT INTO PARQUET statements
concurrently, repeatedly for 10 minutes. The main purpose of this
test is to use it for benchmarking. It sets up grafana as well
so metrics can be observed during the test.

This change also adds a daily roachtest which runs exports
on a 100 warehouse TPCC database, without setting up grafana.

Informs: #103317
Release note: None


importer: use new parquet writer for EXPORT INTO PARQUET

Previously, EXPORT INTO PARQUET used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in util/parquet, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old EXPORT INTO PARQUET
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: #104329
Informs: #104278
Informs: #104279
Closes: #103317

Release note (general change): EXPORT INTO PARQUET will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:

  • all columns written to parquet files will be nullable
    (ie. the parquet repetition type is OPTIONAL)

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava changed the title Export new parquet importer: use new parquet writer for EXPORT INTO PARQUET Jun 2, 2023
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted the memstats here: https://drive.google.com/drive/folders/1xIevGiHsC95jnsQQF3Y1N5kSzltKMo1w?usp=share_link

In summary, I ran the roachtest and found that EXPORT with the new library uses

  • 42% less memory (1703707920B / 2948454120B). This is a difference 1.24 GB when running concurrent exports on a 14GB database.
  • 72% fewer allocs (2598922890 / 9259896304)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/importer/exportparquet_test.go line 140 at r2 (raw file):

		require.Equal(t, expected.(*tree.DJSON).JSON.String(),
			actual.(*tree.DJSON).JSON.String())
	case types.EnumFamily:

Before I added these changes, we would not write the additional metadata (ex. collation) to parquet files. The metadata was a part of the encoders and decoders here:

return tree.NewDCollatedString(string(x.([]byte)), typ.Locale(), &tree.CollationEnvironment{})

IMO if we aren't writing such things to files, then there is no point in testing them.

@jayshrivastava jayshrivastava requested review from a team and miretskiy and removed request for a team June 2, 2023 20:52
@jayshrivastava jayshrivastava marked this pull request as ready for review June 2, 2023 20:52
@jayshrivastava jayshrivastava requested review from a team as code owners June 2, 2023 20:52
@jayshrivastava jayshrivastava requested review from herkolategan, smg260 and cucaroach and removed request for a team June 2, 2023 20:52
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 5 of 7 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @herkolategan, @jayshrivastava, and @smg260)


go.mod line 216 at r3 (raw file):

	github.com/stretchr/testify v1.8.3
	github.com/twpayne/go-geom v1.4.2
	github.com/uber-go/atomic v1.4.0

maybe we shouldn't add this dependency? What does uber-go/atomic has that atomic does not?


pkg/cmd/roachtest/tests/export_parquet.go line 35 at r3 (raw file):

		Name:            "export/parquet",
		Owner:           registry.OwnerCDC,
		Tags:            registry.Tags("manual"),

do we want to have smaller non-manual test? Or does it exist?


pkg/sql/importer/exportparquet.go line 37 at r3 (raw file):

	"github.com/cockroachdb/cockroach/pkg/util/bitarray"
	"github.com/cockroachdb/cockroach/pkg/util/duration"
	crlparquet "github.com/cockroachdb/cockroach/pkg/util/parquet"

add todo or just do it here -- we need memory monitor for this stuff.


pkg/sql/importer/exportparquet.go line 643 at r3 (raw file):

			compression = crlparquet.CompressionGZIP
		case roachpb.IOFileFormat_Auto, roachpb.IOFileFormat_None:
			compression = crlparquet.CompressionNone

crlparquet supports other compression formats, right?
Perhaps add an issue/todo to make export support those too.
Also, todo to refactor string -> codec parsing.


pkg/sql/importer/exportparquet_test.go line 145 at r3 (raw file):

			actual.(*tree.DEnum).LogicalRep)
	case types.CollatedStringFamily:
		// Only the value of the collated string matters, not that additional properties

and it does appear that parquet does not have notion of collation... am I correct?

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @herkolategan, @miretskiy, and @smg260)


go.mod line 216 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

maybe we shouldn't add this dependency? What does uber-go/atomic has that atomic does not?

Done. Accidentally used the wrong "atomic" pkg.


pkg/cmd/roachtest/tests/export_parquet.go line 35 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

do we want to have smaller non-manual test? Or does it exist?

I added one for TPCC-100.


pkg/sql/importer/exportparquet.go line 37 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

add todo or just do it here -- we need memory monitor for this stuff.

Done. Added an TODO and an issue.


pkg/sql/importer/exportparquet.go line 643 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

crlparquet supports other compression formats, right?
Perhaps add an issue/todo to make export support those too.
Also, todo to refactor string -> codec parsing.

Done. Added TODO.


pkg/sql/importer/exportparquet_test.go line 145 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

and it does appear that parquet does not have notion of collation... am I correct?

You are correct.

This change adds a rochtest which sets up a 3-node CRDB cluster
on 8vCPU machines initialized with the TPC-C database containing
250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements
concurrently, repeatedly for 10 minutes. The main purpose of this
test is to use it for benchmarking. It sets up grafana as well
so metrics can be observed during the test.

This change also adds a daily roachtest which runs exports
on a 100 warehouse TPCC database, without setting up grafana.

Informs: cockroachdb#103317
Release note: None
Previously, `EXPORT INTO PARQUET` used a library which had
very poor memory usage. This change updates this command to
use the new parquet writer in `util/parquet`, which is newer
and does not use excessive memory.

To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old `EXPORT INTO PARQUET`
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.

This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.

Informs: cockroachdb#104329
Informs: cockroachdb#104278
Informs: cockroachdb#104279
Closes: cockroachdb#103317

Release note (general change): `EXPORT INTO PARQUET` will now
use a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
- all columns written to parquet files will be nullable
  (ie. the parquet repetition type is `OPTIONAL`)
@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Jun 5, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 5, 2023

Build succeeded:

@RaduBerinde
Copy link
Member

pkg/cmd/roachtest/tests/export_parquet.go line 125 at r5 (raw file):

		Name:            "export/parquet/tpcc-100",
		Owner:           registry.OwnerCDC,
		Tags:            registry.Tags("daily"),

"daily" is not a tag used by anything. AFAICT this means that this test will not run as part of any suite, just like "manual" above. Was the intention for this to be part of the nightly suite?

@jayshrivastava
Copy link
Contributor Author

@RaduBerinde Yeah this should be nightly.

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this pull request Mar 21, 2024
This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This
libary has been unused since cockroachdb#104234 landed. This change officially removes the code and dependency
from the repo.

Closes: cockroachdb#104278
Release note: None
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this pull request Mar 28, 2024
This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This
libary has been unused since cockroachdb#104234 landed. This change officially removes the code and dependency
from the repo.

This change also adds support for Oid-like types such as regrole, regtype etc to util/parquet. These types are
used in export parquet testing and were previously not supported by util/parquet.

Closes: cockroachdb#104278
Release note: None
Epic: None
craig bot pushed a commit that referenced this pull request Apr 1, 2024
120834: export: remove deprecated parquet library r=jayshrivastava a=jayshrivastava

This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This library has been unused since #104234 landed. This change officially removes the code and dependency from the repo.

This change also adds support for Oid-like types such as regrole, regtype etc to util/parquet. These types are
used in export parquet testing and were previously not supported by util/parquet.

Closes: #104278
Release note: None
Epic: None

Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

export: parquet causes node to OOM
4 participants