-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-19.2: IMPORT CSV experimental_save_rejected support #42391
release-19.2: IMPORT CSV experimental_save_rejected support #42391
Conversation
When importing from large csv files it's common to have a few offending rows. Currently `IMPORT` will abort at the first error which makes for a tedious task of manually fixing ech problem and re-running. Instead users can specify new option `WITH experimental_save_rejected` which will not stop on bad rows but save them in a side file called `<original_csv_file>.rejected` and continue. The user then can re-run the import command using `IMPORT INTO` using the rejected file after fixing the problems in it. Release note (sql change): enable skipping of faulty rows in IMPORT.
This will allow for easier testing of this logic that does not need to use the `SQL` layer. Release note: none.
This helps in being able to run an import standalone and makes it clear that the distSql processor is only used for propagating error and status messages to the controller. Release note: none.
This is part of an ongoing refactor to simplify the IMPORT code base. Particularly here we remove calls to inputFinished which is supposed to be called after all input files are ingested to close the cannel kvCh on which the KVs are sent to the routine that drains this channel and sends them to KV. Instead the creation and close of the channel is moved closer to where it is used. inputFinished was really only used for a special case in CSV where we have a fan out to a set of workers that forward the KVs to kvCh and so the closing logic needs to be called after these workers are done. Now instead the reading of the files and the workers are grouped so that we can wait for all routines from the group to finish and then close the channel. This will simplify how we save rejected rows. See the issue below. Touches: cockroachdb#40374. Release note: none.
This was promised to a client and will be backported to 19.2.1. The feature should stay undocumented for now since the semantics and UX are still not well understood. To make the change work, we had to remove the parsing workers from `conv.start` and move them to `readFile` which means that for each file a separate set of workers will be brought up and down. Also the tally for all total number of rejected rows was moved to `read_import_base.go`. Release note (sql change): add undocumented experimental_save_rejected option to CSV IMPORT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be introducing new features (especially experimental ones) in patch releases.
This change was intended to move the In the meantime, @spaskob and I are going to look though this again to see if there are any ways to minimize the backport diff further / more tightly confine it to just the applicable CSV code (i.e. if there is any way to not include the |
I think we ended up not wanting to add this in a patch release after all, and with 20.1 coming out in the coming weeks, i think we can close this. |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release