release-19.2: IMPORT CSV experimental_save_rejected support #42391

spaskob · 2019-11-12T02:29:16Z

Backport:

1/1 commits from "importccl: enabling saving of bad csv rows to a side file" (importccl: enabling saving of bad csv rows to a side file #41430)
1/1 commits from "importccl: move parsing of the input files in a separate function" (importccl: move parsing of the input files in a separate function #41864)
1/1 commits from "importccl: refactor out all business logic from the processor" (importccl: refactor out all business logic from the processor #41909)
1/1 commits from "importccl: remove unnecessary interface function" (importccl: remove unnecessary interface function #41972)
1/1 commits from "importccl: add experimental_save_rejected option to CSV IMPORT " (importccl: add experimental_save_rejected option to CSV IMPORT #42112)

Please see individual PRs for details.

/cc @cockroachdb/release

When importing from large csv files it's common to have a few offending rows. Currently `IMPORT` will abort at the first error which makes for a tedious task of manually fixing ech problem and re-running. Instead users can specify new option `WITH experimental_save_rejected` which will not stop on bad rows but save them in a side file called `<original_csv_file>.rejected` and continue. The user then can re-run the import command using `IMPORT INTO` using the rejected file after fixing the problems in it. Release note (sql change): enable skipping of faulty rows in IMPORT.

This will allow for easier testing of this logic that does not need to use the `SQL` layer. Release note: none.

This helps in being able to run an import standalone and makes it clear that the distSql processor is only used for propagating error and status messages to the controller. Release note: none.

This is part of an ongoing refactor to simplify the IMPORT code base. Particularly here we remove calls to inputFinished which is supposed to be called after all input files are ingested to close the cannel kvCh on which the KVs are sent to the routine that drains this channel and sends them to KV. Instead the creation and close of the channel is moved closer to where it is used. inputFinished was really only used for a special case in CSV where we have a fan out to a set of workers that forward the KVs to kvCh and so the closing logic needs to be called after these workers are done. Now instead the reading of the files and the workers are grouped so that we can wait for all routines from the group to finish and then close the channel. This will simplify how we save rejected rows. See the issue below. Touches: cockroachdb#40374. Release note: none.

This was promised to a client and will be backported to 19.2.1. The feature should stay undocumented for now since the semantics and UX are still not well understood. To make the change work, we had to remove the parsing workers from `conv.start` and move them to `readFile` which means that for each file a separate set of workers will be brought up and down. Also the tally for all total number of rejected rows was moved to `read_import_base.go`. Release note (sql change): add undocumented experimental_save_rejected option to CSV IMPORT.

cockroach-teamcity · 2019-11-12T02:29:26Z

This change is

bdarnell

We shouldn't be introducing new features (especially experimental ones) in patch releases.

dt · 2019-11-12T16:12:48Z

This change was intended to move the experimental_save_rejected functionality that was added to the the DELIMITED format in 19.2 to also apply to the CSV format, as the customer for whom it was introduced is now using CSV instead. I think we do want to backport this for them, since added to 19.2 at their request, but as discussed offline, we'll wait to do so in 19.2.2, in case any serious bugs are discovered in 19.2.0 in the coming weeks that require a quick release of a pure-bug-fix 19.2.1.

In the meantime, @spaskob and I are going to look though this again to see if there are any ways to minimize the backport diff further / more tightly confine it to just the applicable CSV code (i.e. if there is any way to not include the runImport refactor / leave the no-op inputFinished calls for now, etc).

dt · 2020-03-31T14:37:02Z

I think we ended up not wanting to add this in a patch release after all, and with 20.1 coming out in the coming weeks, i think we can close this.

Spas Bojanov and others added 5 commits November 11, 2019 10:30

importccl: move parsing of the input files in a separate function

1a340f2

This will allow for easier testing of this logic that does not need to use the `SQL` layer. Release note: none.

importccl: refactor out all business logic from the processor

ff18944

This helps in being able to run an import standalone and makes it clear that the distSql processor is only used for propagating error and status messages to the controller. Release note: none.

spaskob requested a review from dt November 12, 2019 02:29

bdarnell suggested changes Nov 12, 2019

View reviewed changes

spaskob closed this Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-19.2: IMPORT CSV experimental_save_rejected support #42391

release-19.2: IMPORT CSV experimental_save_rejected support #42391

spaskob commented Nov 12, 2019

cockroach-teamcity commented Nov 12, 2019

bdarnell left a comment

dt commented Nov 12, 2019

dt commented Mar 31, 2020

release-19.2: IMPORT CSV experimental_save_rejected support #42391

release-19.2: IMPORT CSV experimental_save_rejected support #42391

Conversation

spaskob commented Nov 12, 2019

cockroach-teamcity commented Nov 12, 2019

bdarnell left a comment

Choose a reason for hiding this comment

dt commented Nov 12, 2019

dt commented Mar 31, 2020