-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importer: use new parquet writer for EXPORT INTO PARQUET #104234
Conversation
50605c3
to
826809e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posted the memstats here: https://drive.google.com/drive/folders/1xIevGiHsC95jnsQQF3Y1N5kSzltKMo1w?usp=share_link
In summary, I ran the roachtest and found that EXPORT with the new library uses
- 42% less memory (1703707920B / 2948454120B). This is a difference 1.24 GB when running concurrent exports on a 14GB database.
- 72% fewer allocs (2598922890 / 9259896304)
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/importer/exportparquet_test.go
line 140 at r2 (raw file):
require.Equal(t, expected.(*tree.DJSON).JSON.String(), actual.(*tree.DJSON).JSON.String()) case types.EnumFamily:
Before I added these changes, we would not write the additional metadata (ex. collation) to parquet files. The metadata was a part of the encoders and decoders here:
cockroach/pkg/sql/importer/exportparquet.go
Line 163 in 826809e
return tree.NewDCollatedString(string(x.([]byte)), typ.Locale(), &tree.CollationEnvironment{}) |
IMO if we aren't writing such things to files, then there is no point in testing them.
826809e
to
593d98d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 5 of 7 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @herkolategan, @jayshrivastava, and @smg260)
go.mod
line 216 at r3 (raw file):
github.com/stretchr/testify v1.8.3 github.com/twpayne/go-geom v1.4.2 github.com/uber-go/atomic v1.4.0
maybe we shouldn't add this dependency? What does uber-go/atomic has that atomic does not?
pkg/cmd/roachtest/tests/export_parquet.go
line 35 at r3 (raw file):
Name: "export/parquet", Owner: registry.OwnerCDC, Tags: registry.Tags("manual"),
do we want to have smaller non-manual test? Or does it exist?
pkg/sql/importer/exportparquet.go
line 37 at r3 (raw file):
"github.com/cockroachdb/cockroach/pkg/util/bitarray" "github.com/cockroachdb/cockroach/pkg/util/duration" crlparquet "github.com/cockroachdb/cockroach/pkg/util/parquet"
add todo or just do it here -- we need memory monitor for this stuff.
pkg/sql/importer/exportparquet.go
line 643 at r3 (raw file):
compression = crlparquet.CompressionGZIP case roachpb.IOFileFormat_Auto, roachpb.IOFileFormat_None: compression = crlparquet.CompressionNone
crlparquet supports other compression formats, right?
Perhaps add an issue/todo to make export support those too.
Also, todo to refactor string -> codec parsing.
pkg/sql/importer/exportparquet_test.go
line 145 at r3 (raw file):
actual.(*tree.DEnum).LogicalRep) case types.CollatedStringFamily: // Only the value of the collated string matters, not that additional properties
and it does appear that parquet does not have notion of collation... am I correct?
593d98d
to
81bd612
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @herkolategan, @miretskiy, and @smg260)
go.mod
line 216 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
maybe we shouldn't add this dependency? What does uber-go/atomic has that atomic does not?
Done. Accidentally used the wrong "atomic" pkg.
pkg/cmd/roachtest/tests/export_parquet.go
line 35 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
do we want to have smaller non-manual test? Or does it exist?
I added one for TPCC-100.
pkg/sql/importer/exportparquet.go
line 37 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
add todo or just do it here -- we need memory monitor for this stuff.
Done. Added an TODO and an issue.
pkg/sql/importer/exportparquet.go
line 643 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
crlparquet supports other compression formats, right?
Perhaps add an issue/todo to make export support those too.
Also, todo to refactor string -> codec parsing.
Done. Added TODO.
pkg/sql/importer/exportparquet_test.go
line 145 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
and it does appear that parquet does not have notion of collation... am I correct?
You are correct.
81bd612
to
68ae867
Compare
This change adds a rochtest which sets up a 3-node CRDB cluster on 8vCPU machines initialized with the TPC-C database containing 250 warehouses. Then, it executes 30 `EXPORT INTO PARQUET` statements concurrently, repeatedly for 10 minutes. The main purpose of this test is to use it for benchmarking. It sets up grafana as well so metrics can be observed during the test. This change also adds a daily roachtest which runs exports on a 100 warehouse TPCC database, without setting up grafana. Informs: cockroachdb#103317 Release note: None
68ae867
to
859fb68
Compare
Previously, `EXPORT INTO PARQUET` used a library which had very poor memory usage. This change updates this command to use the new parquet writer in `util/parquet`, which is newer and does not use excessive memory. To ensure backwards compatability, as much testing code as possible is left untouched to ensure the old `EXPORT INTO PARQUET` behavior is the same when using the new library. Some of the the old production code and dependencies are also left untouched because they are used in the test code. This commit leaves some TODOs to clean up the tests and production code once we have confidence in the new library. Informs: cockroachdb#104329 Informs: cockroachdb#104278 Informs: cockroachdb#104279 Closes: cockroachdb#103317 Release note (general change): `EXPORT INTO PARQUET` will now use a new internal implementation for writing parquet files using parquet spec version 2.6. There should be no significant impact to the structure of files being written. There is one minor change: - all columns written to parquet files will be nullable (ie. the parquet repetition type is `OPTIONAL`)
859fb68
to
c1e6a64
Compare
bors r=miretskiy |
Build failed (retrying...): |
Build succeeded: |
"daily" is not a tag used by anything. AFAICT this means that this test will not run as part of any suite, just like "manual" above. Was the intention for this to be part of the nightly suite? |
@RaduBerinde Yeah this should be nightly. |
This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This libary has been unused since cockroachdb#104234 landed. This change officially removes the code and dependency from the repo. Closes: cockroachdb#104278 Release note: None Epic: None
This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This libary has been unused since cockroachdb#104234 landed. This change officially removes the code and dependency from the repo. This change also adds support for Oid-like types such as regrole, regtype etc to util/parquet. These types are used in export parquet testing and were previously not supported by util/parquet. Closes: cockroachdb#104278 Release note: None Epic: None
120834: export: remove deprecated parquet library r=jayshrivastava a=jayshrivastava This change removes the github.com/fraugster/parquet-go dependency and all code which uses it. This library has been unused since #104234 landed. This change officially removes the code and dependency from the repo. This change also adds support for Oid-like types such as regrole, regtype etc to util/parquet. These types are used in export parquet testing and were previously not supported by util/parquet. Closes: #104278 Release note: None Epic: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
roachtest: add test for EXPORT INTO PARQUET
This change adds a rochtest which sets up a 3-node CRDB cluster
on 8vCPU machines initialized with the TPC-C database containing
250 warehouses. Then, it executes 30
EXPORT INTO PARQUET
statementsconcurrently, repeatedly for 10 minutes. The main purpose of this
test is to use it for benchmarking. It sets up grafana as well
so metrics can be observed during the test.
This change also adds a daily roachtest which runs exports
on a 100 warehouse TPCC database, without setting up grafana.
Informs: #103317
Release note: None
importer: use new parquet writer for EXPORT INTO PARQUET
Previously,
EXPORT INTO PARQUET
used a library which hadvery poor memory usage. This change updates this command to
use the new parquet writer in
util/parquet
, which is newerand does not use excessive memory.
To ensure backwards compatability, as much testing code as
possible is left untouched to ensure the old
EXPORT INTO PARQUET
behavior is the same when using the new library. Some of the the
old production code and dependencies are also left untouched
because they are used in the test code.
This commit leaves some TODOs to clean up the tests and production
code once we have confidence in the new library.
Informs: #104329
Informs: #104278
Informs: #104279
Closes: #103317
Release note (general change):
EXPORT INTO PARQUET
will nowuse a new internal implementation for writing parquet files
using parquet spec version 2.6. There should be no significant
impact to the structure of files being written. There is one
minor change:
(ie. the parquet repetition type is
OPTIONAL
)