-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
save the offending data import is rejecting #40374
Comments
To support this, we can add additional IMPORT syntax to allow the user to pass an additional cloud storage URI, to which we will write any malformed input lines. The code to parse cloud storage URIs and write to cloud storage is in export_storage.go. Statements that use this functionality include BACKUP, EXPORT, and CREATE CHANGEFEED. As this is being added rather late in the release, I'd hesitate to implement this via a SQL grammar change at this point. Instead, we could implement this via an experimental I'm thinking the syntax could be something like:
We should try to implement this for the delimited and csv import formats to start. I think it should be possible to make this work for the mysqldump and pg_dump formats in the future as well, but there are other open UX questions we'll need to answer if we want to do that. Open question: Should the cloud storage URI be a file name or directory name? Since IMPORT runs in parallel across multiple nodes, if we want to write all the malformed lines to a single file, we'll need to buffer all the malformed lines on a single coordinator node and have that node write the malformed lines to a file all at once. Alternatively, if we allow the path to be a directory, we could allow each node to accumulate the malformed lines separately and then write the data as they wish. |
… fields Previously DELIMITED import mode will complain about importing field `"This a nested single quote " inside quoted field"` when importing using option `WITH fields_enclosed_by '"'`. MySql implementation handles this case correctly. For more info and examples see cockroachdb#40959. Touches cockroachdb#40374. Fixes cockroachdb#40959. Release note (enterprise change): Fixes incorrect behavior for enclosed fields. Release justification: This is low risk since only business logic of import has changed. Moreover this functionality has been requested from a potential client.
@rolandcrosby we can ask if single or multiple files are preferred with AWM. |
… fields Previously DELIMITED import mode will complain about importing field `"This a nested single quote " inside quoted field"` when importing using option `WITH fields_enclosed_by '"'`. MySql implementation handles this case correctly. For more info and examples see cockroachdb#40959. Touches cockroachdb#40374. Fixes cockroachdb#40959. Release note (enterprise change): Fixes incorrect behavior for enclosed fields. Release justification: This is low risk since only business logic of import has changed. Moreover this functionality has been requested from a potential client.
40960: importccl: DELIMITED should handle delimiters only at end of fields r=spaskob a=spaskob Previously DELIMITED import mode will complain about importing field `"This a nested single quote " inside quoted field"` when importing using option `WITH fields_enclosed_by '"'`. Mysql implementation handles this case correctly. For more info and examples see #40959. Touches #40374. Fixes #40959. Release note (enterprise change): Fixes incorrect behavior for enclosed fields. Release justification: This is low risk since only business logic of import has changed. Moreover this functionality has been requested from a potential client. Co-authored-by: spaskob <spas@cockroachlabs.com>
41430: importccl: enabling saving of bad csv rows to a side file r=spaskob a=spaskob importccl: enabling saving of bad csv rows to a side file When importing from large csv files it's common to have a few offending rows. Currently `IMPORT` will abort at the first error which makes for a tedious task of manually fixing ech problem and re-running. Instead users can specify new option `WITH experimental_save_rejected` which will not stop on bad rows but save them in a side file called `<original_csv_file>.rejected` and continue. The user then can re-run the import command using `IMPORT INTO` using the rejected file after fixing the problems in it. Touches #40374. Release note (sql change): enable skipping of faulty rows in IMPORT. Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
This is part of an ongoing refactor to simplify the IMPORT code base. Particularly here we remove calls to inputFinished which is supposed to be called after all input files are ingested to close the cannel kvCh on which the KVs are sent to the routine that drains this channel and sends them to KV. Instead the creation and close of the channel is moved closer to where it is used. inputFinished was really only used for a special case in CSV where we have a fan out to a set of workers that forward the KVs to kvCh and so the closing logic needs to be called after these workers are done. Now instead the reading of the files and the workers are grouped so that we can wait for all routines from the group to finish and then close the channel. This will simplify how we save rejected rows. See the issue below. Touches: cockroachdb#40374. Release note: none.
41972: importccl: remove unnecessary interface function r=spaskob a=spaskob importccl: remove unnecessary interface function This is part of an ongoing refactor to simplify the IMPORT code base. Particularly here we remove calls to inputFinished which is supposed to be called after all input files are ingested to close the cannel kvCh on which the KVs are sent to the routine that drains this channel and sends them to KV. Instead the creation and close of the channel is moved closer to where it is used. inputFinished was really only used for a special case in CSV where we have a fan out to a set of workers that forward the KVs to kvCh and so the closing logic needs to be called after these workers are done. Now instead the reading of the files and the workers are grouped so that we can wait for all routines from the group to finish and then close the channel. This will simplify how we save rejected rows. See the issue below. Touches: #40374. Release note: none. Co-authored-by: Spas Bojanov <pachob@gmail.com>
42112: importccl: add experimental_save_rejected option to CSV IMPORT r=spaskob a=spaskob This was promised to a client and will be backported to 19.2.1. The feature should stay undocumented for now since the semantics and UX are still not well understood. To make the change work, we had to remove the parsing workers from `conv.start` and move them to `readFile` which means that for each file a separate set of workers will be brought up and down. Also the tally for all total number of rejected rows was moved to `read_import_base.go`. Touches #40374. Release note (sql change): add undocumented experimental_save_rejected option to CSV IMPORT. Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
This is part of an ongoing refactor to simplify the IMPORT code base. Particularly here we remove calls to inputFinished which is supposed to be called after all input files are ingested to close the cannel kvCh on which the KVs are sent to the routine that drains this channel and sends them to KV. Instead the creation and close of the channel is moved closer to where it is used. inputFinished was really only used for a special case in CSV where we have a fan out to a set of workers that forward the KVs to kvCh and so the closing logic needs to be called after these workers are done. Now instead the reading of the files and the workers are grouped so that we can wait for all routines from the group to finish and then close the channel. This will simplify how we save rejected rows. See the issue below. Touches: cockroachdb#40374. Release note: none.
45269: importccl: Parallelize avro import r=miretskiy a=miretskiy Parallelize avro importer to improve its throughput (2.8x improvement). Touches #40374. Fixes #45097. Release notes (performance): Faster avro import 45482: storage: integrate Concurrency Manager into Replica request path r=nvanbenschoten a=nvanbenschoten Related to #41720. Related to #44976. This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely. With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks. Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop. Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (#40205) improvements that this change enables. 45484: sql: simplify connection state machine - stop tracking retry intent r=andreimatei a=andreimatei Before this patch, the SQL connection state machine had an optimization: if a transaction that hadn't used "SAVEPOINT cockroach_restart" encountered a retriable error that we can't auto-retry, then we'd release the txn's locks eagerly and enter the Aborted state. As opposed to transactions that had used the "SAVEPOINT cockroach_restart", which go to RestartWait. This optimization is a significant complication for the state machine, so this patch is removing it. All transactions now go to RestartWait, and wait for a ROLLBACK to release the locks. On the flip side, doing "RELEASE SAVEPOINT cockroach_restart" and "ROLLBACK SAVEPOINT cockroach_restart" now works even for transactions that haven't explicitly declared that savepoint, which is nice. Although I don't promise I'll keep it working. Release note: None Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@spaskob can you describe what else do you see happing on this issue? |
Nothing in particular - feel free to close it after your PR merges. |
Is your feature request related to a problem? Please describe.
the import command stops immediately when the data does not match import specifications.
pq: nodelocal:///quotetestdata.csv: row 4: expected 3 fields, got 2
it takes extra steps to look at the row and find the offending lines and find each and every line. This becomes iterative process if there are many errors, especially if the offending data is at the end of the large data files that takes hours to process.
Describe the solution you'd like
save the offending data to another destination so that the file can be re-processed without having to reprocess the entire file again only to find one error at a time.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: