-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: enabling saving of bad csv rows to a side file #41430
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)
pkg/ccl/importccl/import_stmt.go, line 292 at r1 (raw file):
// is passed in all import formats not just DELIMITED.. if _, ok := opts[importOptionSaveRejected]; ok { format.MysqlOut.SaveRejected = true
this doesn't feel like a per-format option to me? I guess the level at which we can reject depends on the format, and support will arrive in formats one-by-one, but the option feels more general to me and should be either at the job level or at least at the top level in format options.
pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):
// MySQL OUTFILE // If err field is non-empty, the query filed specifies what expect // to get from the rows that are parsed correctly (see option experimental_save_rejected).
I think we should also have a field that is there content of the rejected row file?
pkg/ccl/importccl/read_import_base.go, line 133 at r1 (raw file):
rejected = make(chan string) } grp := ctxgroup.WithContext(ctx)
tiny nit, but I might move the if rejected != nil
out here and only spin up the group / use the the group to call fileFunc in the non-nil branch, keeping the direct call to fileFunc
in normal, non-rejecting execution, just to avoid the extra goroutine(s) unless we need them.
pkg/ccl/importccl/read_import_base.go, line 140 at r1 (raw file):
suffix := ".rejected" if es.Conf().Provider == roachpb.ExportStorageProvider_Http { suffix = "/rejected"
Why is this case different?
pkg/ccl/importccl/read_import_base.go, line 142 at r1 (raw file):
suffix = "/rejected" } conf, err := storageccl.ExportStorageConfFromURI(dataFile + suffix)
I don't think you can blindly append to the string URI since almost all of them use query parameters to pass options, like keys or settings.
I think you need to do net/url.Parse()
, append to the url.Path
field, and then serialize the result to a string and pass that to makeStorage, so that trailing parameters are correctly preserved.
pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):
return err } rejectedStorage, err := storageccl.MakeExportStorage(ctx, conf, settings)
I think we should wait to open this file until the first rejected row shows up in the loop below, so that we don't open (and create) it at all if all the rows process without rejection.
pkg/roachpb/io-formats.proto, line 77 at r1 (raw file):
optional int32 escape = 6 [(gogoproto.nullable) = false]; // If true, don't abort on failures but instead save the offending row and keep on. optional bool save_rejected = 7 [(gogoproto.nullable) = false];
ditto comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/ccl/importccl/import_stmt.go, line 292 at r1 (raw file):
Previously, dt (David Taylor) wrote…
this doesn't feel like a per-format option to me? I guess the level at which we can reject depends on the format, and support will arrive in formats one-by-one, but the option feels more general to me and should be either at the job level or at least at the top level in format options.
agreed, notice the TODO just above that addresses this. Since this will be fairly big refactor are you ok with keeping it in DELIMITED at first and once we nail DELIMITED I will refactor it out and start implementing for other modes.
pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I think we should also have a field that is there content of the rejected row file?
yes, this is tbd until i figure out the threading through the http test server
pkg/ccl/importccl/read_import_base.go, line 133 at r1 (raw file):
Previously, dt (David Taylor) wrote…
tiny nit, but I might move the
if rejected != nil
out here and only spin up the group / use the the group to call fileFunc in the non-nil branch, keeping the direct call tofileFunc
in normal, non-rejecting execution, just to avoid the extra goroutine(s) unless we need them.
done
pkg/ccl/importccl/read_import_base.go, line 140 at r1 (raw file):
Previously, dt (David Taylor) wrote…
Why is this case different?
http uri is 192.0.0.1:<port#> and you append .rejected to it becomes invalid http request
this is still WIP, hacking it like this for now to have unit tests working
pkg/ccl/importccl/read_import_base.go, line 142 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I don't think you can blindly append to the string URI since almost all of them use query parameters to pass options, like keys or settings.
I think you need to do
net/url.Parse()
, append to theurl.Path
field, and then serialize the result to a string and pass that to makeStorage, so that trailing parameters are correctly preserved.
thanks, TBD
pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I think we should wait to open this file until the first rejected row shows up in the loop below, so that we don't open (and create) it at all if all the rows process without rejection.
good point, done
pkg/roachpb/io-formats.proto, line 77 at r1 (raw file):
Previously, dt (David Taylor) wrote…
ditto comment above.
ack
828da27
to
75f30cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
yes, this is tbd until i figure out the threading through the http test server
done
pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
good point, done
Done.
PTAL |
75f30cd
to
c077306
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 11 files at r2, 1 of 3 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)
pkg/ccl/importccl/read_import_base.go, line 115 at r4 (raw file):
atFirstLine = false } if atFirstLine {
nit: I think we could just use len(buf) > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/ccl/importccl/read_import_base.go, line 115 at r4 (raw file):
Previously, dt (David Taylor) wrote…
nit: I think we could just use
len(buf) > 0
you are right, I initially thought that buf can be empty but that there's still an error but in that case it still does not make sense to write the reject file anyway.
When importing from large csv files it's common to have a few offending rows. Currently `IMPORT` will abort at the first error which makes for a tedious task of manually fixing ech problem and re-running. Instead users can specify new option `WITH experimental_save_rejected` which will not stop on bad rows but save them in a side file called `<original_csv_file>.rejected` and continue. The user then can re-run the import command using `IMPORT INTO` using the rejected file after fixing the problems in it. Release note (sql change): enable skipping of faulty rows in IMPORT.
c077306
to
97788de
Compare
bors r+ |
41430: importccl: enabling saving of bad csv rows to a side file r=spaskob a=spaskob importccl: enabling saving of bad csv rows to a side file When importing from large csv files it's common to have a few offending rows. Currently `IMPORT` will abort at the first error which makes for a tedious task of manually fixing ech problem and re-running. Instead users can specify new option `WITH experimental_save_rejected` which will not stop on bad rows but save them in a side file called `<original_csv_file>.rejected` and continue. The user then can re-run the import command using `IMPORT INTO` using the rejected file after fixing the problems in it. Touches #40374. Release note (sql change): enable skipping of faulty rows in IMPORT. Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
Build succeeded |
There was severe error introduced by cockroachdb#41430 that causes the original IMPORT file to be overwritten by the rejected file. Release note (bug fix): fix bug when using ‘experimental_save_rejected’ that would cause the rejected row file to overwrite the original input file.
42398: importccl: fix rejected file overwriting the original file r=spaskob a=spaskob There was severe error introduced by #41430 that causes the original IMPORT file to be overwritten by the rejected file. Release note (bug fix): Release note (bug fix): fix bug when using ‘experimental_save_rejected’ that would cause the rejected row file to overwrite the original input file. Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
importccl: enabling saving of bad csv rows to a side file
When importing from large csv files it's common to have a few offending rows.
Currently
IMPORT
will abort at the first error which makes for a tedioustask of manually fixing ech problem and re-running. Instead users can specify
new option
WITH experimental_save_rejected
which will not stop on bad rowsbut save them in a side file called
<original_csv_file>.rejected
andcontinue. The user then can re-run the import command using
IMPORT INTO
usingthe rejected file after fixing the problems in it.
Touches #40374.
Release note (sql change): enable skipping of faulty rows in IMPORT.