*: parse the data source directly into data and skip the KV encoder #145

kennytm · 2019-03-14T10:54:05Z

What problem does this PR solve?

In our testing with a 4.1 TB workload, we found that parsing SQL takes almost half of the time to encode a row. Since we have already used a parser to extract each row, parsing it again is wasting computing resource. Additionally, for CSV we need to perform the complex and unnecessary Parse CSV → Reconstruct SQL → Parse SQL.

What is changed and how it works?

We change the Lightning parsers to directly produce an array of types.Datum for both CSV and SQL. We also get rid of the abstraction layer KvEncoder (since it only accepts SQL statements), and directly use (*table.Table).AddRecord to convert the []types.Datum into KV pairs.

This slashes half of the encoding time according to experiment.

Check List

Tests

Unit test
Integration test

Side effects

Related changes

kennytm · 2019-03-15T16:48:46Z

@lonng Some metrics are temporarily removed, we need to see if we want to tweak the metrics or the process. The old process:

Read a block (64 KB) of SQL values → Reconstruct INSERT statement → Execute INSERT statement to get KV pairs → Deliver block

New process:

Read a single row and copy to datum array → Encode to get KV pairs of this row → Buffer until 64 KB → Deliver buffer

The following metrics were involved in this change and may need to be repurposed?

ChunkParserReadRowSecondsHistogram
BlockReadSecondsHistogram (concept of "block" no longer applies)
BlockReadBytesHistogram
BlockEncodeSecondsHistogram

lonng · 2019-03-16T11:50:39Z

@kennytm I think we could add some new metrics about:

Encode to get KV pairs duration → data part + index part
Data KV size
Index KV size

lightning/kv/session.go

This skips the more complex pingcap/parser, and speeds up parsing speed by 50%. We have also refactored the KV delivery mechanism to use channels directly, and revamped metrics: - Make the metrics about engines into its own `engines` counter. The `tables` counter is exclusively about tables now. - Removed `block_read_seconds`, `block_read_bytes`, `block_encode_seconds` since the concept of "block" no longer applies. Replaced by the equivalents named `row_***`. - Removed `chunk_parser_read_row_seconds` for being overlapping with `row_read_seconds`. - Changed `block_deliver_bytes` into a histogram vec, with kind=index or kind=data. Introduced `block_deliver_kv_pairs`.

Only kill Lightning if the whole chunk is imported exactly. The chunk checkpoint may be recorded before a chunk is fully written, and this will hit the failpoint more than 5 times.

This helps debugging some mysterious cancellation where the log is inhibited. Added IsReallyContextCanceledError() for code logic affected by error type.

lightning/kv/session.go

lightning/kv/sql2kv.go

lightning/metric/metric.go

lightning/mydump/csv_parser.go

lightning/restore/restore.go

kennytm · 2019-04-02T09:39:48Z

/run-all-tests

IANTHEREAL

I feel that lightning lacks unit test

IANTHEREAL · 2019-04-02T09:14:11Z

lightning/restore/checkpoints.go

 				_, err = chunkStmt.ExecContext(
 					c, tableName, engineID,
-					value.Key.Path, value.Key.Offset, value.Columns, value.ShouldIncludeRowID,
+					value.Key.Path, value.Key.Offset, colPerm,


I only know we call InsertEngineCheckpoints after calling populateChunks. If so, columns column always empty

Removed this and the checksum (always 0 too).

IANTHEREAL · 2019-04-02T09:37:49Z

lightning/restore/restore.go

+func (cr *chunkRestore) saveCheckpoint(t *TableRestore, engineID int32, rc *RestoreController) {
+	rc.saveCpCh <- saveCp{
+		tableName: t.tableName,
+		merger: &RebaseCheckpointMerger{


is it enough to only save it in L529? it seems AllocBase wouldn't change

Unfortunately no, the AllocBase needs to be larger than all content of _tidb_rowid (or the integer primary key), which value cannot be determined until we've read all data.

Added a comment for this.

IANTHEREAL · 2019-04-02T09:42:45Z

lightning/restore/restore_test.go

-		{"t2", "CREATE TABLE `t2` (`c1` varchar(30000) NOT NULL)", "failed to ExecDDLSQL `mockdb`.`t2`:.*"},
-		{"t3", "CREATE TABLE `t3-a` (`c1-a` varchar(5) NOT NULL)", ""},
+		{"t1", "CREATE TABLE `t1` (`c1` varchar(5) NOT NULL)"},
+		// {"t2", "CREATE TABLE `t2` (`c1` varchar(30000) NOT NULL)"}, // no longer able to create this kind of table.


is this case meaningless? or add some error cases?

In this PR we no longer parses the CREATE TABLE DDL, and instead directly unmarshal the JSON result from TiDB (calling tables.TableFromMeta). So yeah this case becomes meaningless since either you can't create a VARCHAR(30000) in TiDB, or you can and produced a proper JSON which Lightning accepts without error.

Anyway, added a separate unit test to ensure malformed table info will produce an error.

tests/checkpoint_chunks/run.sh

lightning/restore/restore.go

lightning/mydump/parser.go

IANTHEREAL · 2019-04-02T11:31:28Z

lightning/mydump/parser_test.go

@@ -19,11 +19,12 @@ import (
 	"strings"

 	. "github.com/pingcap/check"
+	"github.com/pingcap/errors"


we don't add more test for it, only old cases

The old unit test didn't compile since the interface is changed. There's no new or deleted tests in this file.

IANTHEREAL · 2019-04-02T11:46:21Z

Rest LGTM

kennytm · 2019-04-02T23:29:28Z

/run-all-tests

tests/sqlmode/run.sh

lonng · 2019-04-03T05:20:35Z

Rest LGTM

kennytm · 2019-04-03T06:26:07Z

@GregoryIan PTAL again

IANTHEREAL · 2019-04-03T11:44:16Z

LGTM

kennytm · 2019-04-03T14:14:28Z

/run-all-tests

* tests: fix a test failure due to conflict between #145 and #158 * restore: apply the row count limit to failpoint KillIfImportedChunk too

kennytm force-pushed the kennytm/parse-faster branch from ce043f9 to f0dadce Compare March 14, 2019 13:28

lonng added status/WIP Work in progress status/DNM Do not merge, test is failing or blocked by another PR priority/important type/enhancement Performance improvement or refactoring labels Mar 15, 2019

kennytm force-pushed the kennytm/parse-faster branch 2 times, most recently from f81aead to 44db439 Compare March 15, 2019 12:09

kennytm removed the status/WIP Work in progress label Mar 15, 2019

kennytm changed the title ~~[WIP] *: parse the data source directly into data and skip the KV encoder~~ *: parse the data source directly into data and skip the KV encoder Mar 15, 2019

lonng reviewed Mar 16, 2019

View reviewed changes

lightning/kv/session.go Show resolved Hide resolved

lonng reviewed Mar 16, 2019

View reviewed changes

lightning/kv/session.go Show resolved Hide resolved

kennytm force-pushed the kennytm/parse-faster branch 2 times, most recently from b83efb5 to 5624ac0 Compare March 19, 2019 15:46

kennytm added 6 commits March 23, 2019 17:51

tests,restore: prevent spurious error in checkpoint_chunks test

27f5be8

Only kill Lightning if the whole chunk is imported exactly. The chunk checkpoint may be recorded before a chunk is fully written, and this will hit the failpoint more than 5 times.

kv: use composed interface to simplify some types

b645921

kv: properly handle the SQL mode

b15a293

common: disable IsContextCanceledError() when log level = debug

e17c94d

This helps debugging some mysterious cancellation where the log is inhibited. Added IsReallyContextCanceledError() for code logic affected by error type.

restore: made some log more detailed

91669db

kennytm force-pushed the kennytm/parse-faster branch from 5624ac0 to 216c812 Compare March 23, 2019 14:11

kennytm added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP Work in progress labels Mar 23, 2019

kennytm added 2 commits March 23, 2019 22:35

restore: made the SlowDownImport failpoint apply to index engines too

a41fa4d

restore: do not open a write stream when there are no KV pairs to send

6aa40ab