Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-19.2: IMPORT CSV experimental_save_rejected support #42391

Conversation

spaskob
Copy link
Contributor

@spaskob spaskob commented Nov 12, 2019

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Spas Bojanov and others added 5 commits November 11, 2019 10:30
When importing from large csv files it's common to have a few offending rows.
Currently `IMPORT` will abort at the first error which makes for a tedious
task of manually fixing ech problem and re-running. Instead users can specify
new option `WITH experimental_save_rejected` which will not stop on bad rows
but save them in a side file called `<original_csv_file>.rejected` and
continue. The user then can re-run the import command using `IMPORT INTO` using
the rejected file after fixing the problems in it.

Release note (sql change): enable skipping of faulty rows in IMPORT.
This will allow for easier testing of this logic that
does not need to use the `SQL` layer.

Release note: none.
This helps in being able to run an import standalone and makes it clear
that the distSql processor is only used for propagating error and status
messages to the controller.

Release note: none.
This is part of an ongoing refactor to simplify the IMPORT code base.
Particularly here we remove calls to inputFinished which is supposed to be
called after all input files are ingested to close the cannel kvCh on which
the KVs are sent to the routine that drains this channel and sends them to KV.
Instead the creation and close of the channel is moved closer to where it is used.
inputFinished was really only used for a special case in CSV where we have a fan out to
a set of workers that forward the KVs to kvCh and so the closing logic needs to be called
after these workers are done. Now instead the reading of the files and the workers are grouped
so that we can wait for all routines from the group to finish and then close the channel.
This will simplify how we save rejected rows. See the issue below.

Touches: cockroachdb#40374.

Release note: none.
This was promised to a client and will be backported to 19.2.1.
The feature should stay undocumented for now since the semantics
and UX are still not well understood.

To make the change work, we had to remove the parsing workers from
`conv.start` and move them to `readFile` which means that for each
file a separate set of workers will be brought up and down. Also the
tally for all total number of rejected rows was moved to `read_import_base.go`.

Release note (sql change): add undocumented experimental_save_rejected
option to CSV IMPORT.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spaskob spaskob requested a review from dt November 12, 2019 02:29
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be introducing new features (especially experimental ones) in patch releases.

@dt
Copy link
Member

dt commented Nov 12, 2019

This change was intended to move the experimental_save_rejected functionality that was added to the the DELIMITED format in 19.2 to also apply to the CSV format, as the customer for whom it was introduced is now using CSV instead. I think we do want to backport this for them, since added to 19.2 at their request, but as discussed offline, we'll wait to do so in 19.2.2, in case any serious bugs are discovered in 19.2.0 in the coming weeks that require a quick release of a pure-bug-fix 19.2.1.

In the meantime, @spaskob and I are going to look though this again to see if there are any ways to minimize the backport diff further / more tightly confine it to just the applicable CSV code (i.e. if there is any way to not include the runImport refactor / leave the no-op inputFinished calls for now, etc).

@dt
Copy link
Member

dt commented Mar 31, 2020

I think we ended up not wanting to add this in a patch release after all, and with 20.1 coming out in the coming weeks, i think we can close this.

@spaskob spaskob closed this Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants