Fix multiple issues by switching database drivers #63

jchampio · 2022-06-09T22:37:59Z

lib/pq has been in maintenance mode for a while, and issue #61 appears to have run into one of its idiosyncrasies: its COPY implementation assumes that you're using a query generated via pq.CopyIn(), which uses the default TEXT format, so it runs all of the incoming data through an additional escaping layer.

Our code uses CSV by default (and there appears to be no way to use TEXT format, since we're using the old COPY syntax), which means that incoming CSV containing its own escapes will be double-escaped and corrupted. This is most visible with bytea columns, but I found similar issues with tab and backslash characters, and there are probably other problematic cases too.

To fix, switch from lib/pq over to jackc/pgx. We were already depending on a part of this library before, so the new dependency isn't as big of a change as it would have been otherwise, but the switch isn't free. The compiled binary gains roughly 1.5 MB in size -- likely due to pgx's extensive type conversion system, which is unfortunate because we're not using it. Further optimization could probably be done, at the expense of having most of the DB logic go through the low-level APIs rather than database/sql.

I've backfilled several tests to try to characterize the previous and new behavior. The new logic has been moved to db.CopyFromLines() for easier testing. While I was doing that, I took a look through some of the open issues and found that #31 just happens to be fixed by this change as well, presumably because pgx isn't forcing a specific DateStyle. The new interface also makes it easy to grab the actual number of rows touched, so I went ahead and fixed our row count reporting for multi-line CSV rows. (But note that the core bug for #19 and #50 remains unfixed.)

The fact that multiple issues are fixed at once is nice, but it highlights what a big change this would be. It's probably safe to assume that there are other subtle changes in behavior. Assuming this approach is acceptable, I need to continue finding and testing corner cases to try to better characterize all of those user-visible changes. I also need to develop some real-world test cases, both for sanity checking and for performance/stress testing.

WDYT?

Roadmap of the patchset

move and backfill tests for the old processBatch() implementation, renaming it to db.CopyFromLines()
switch out the lib/pq driver for jackc/pgx/v4/stdlib, updating expected test behavior
remove dead code after the switch
fix the reported row count using the pgconn's COPY result
further backfill timestamp tests now that DateStyle copies appear to be fixed
new: move and backfill tests for scan(), renaming it to batch.Scan()
new: reimplement batch.Scan() using net.Buffers and bufio.Reader for a considerable performance improvement

Miscellanea

minimum Go version has been bumped to 1.13
utility version has been bumped to 0.4.0-dev (this is a major breaking change)
to avoid skipping the new tests, you must specify TEST_CONNINFO (see the README)
new: the -token-size option is no longer applicable and has been removed

This will make it easier to separate the tests of the public API from the tests of the internal API. I would have done this at the same time I added the public tests, but the resulting diff was not easy to read.

Add a new helper function for the COPY FROM logic, to make it easier to test. The new tests are an attempt to capture the current behavior, but note that the current behavior may not necessarily be correct/desirable. To run the new tests, `go test` must be provided with a conninfo string pointing to the database under test; see the README.

lib/pq has been in maintenance mode for a while, and issue timescale#61 appears to have run into one of its idiosyncrasies: its COPY implementation assumes that you're using a query generated via pq.CopyIn(), which uses the default TEXT format, so it runs all of the incoming data through an additional escaping layer. Our code uses CSV by default (and there appears to be no way to use TEXT format, since we're using the old COPY syntax), which means that incoming CSV containing its own escapes will be double-escaped and corrupted. This is most visible with bytea columns, but the tests currently document additional problems with tab and backslash characters, and there are probably other problematic cases too. To fix, switch from lib/pq over to jackc/pgx, and reimplement db.CopyFromLines() using the PgConn.CopyFrom() API. We were already depending on a part of this library before, so the new dependency isn't as big of a change as it would have been otherwise, but the switch isn't free. The compiled binary gains roughly 1.5 MB in size -- likely due to jackc's extensive type conversion system, which is unfortunate because we're not using it. Further optimization could probably be done, at the expense of having most of the DB logic go through the low-level APIs rather than database/sql. We make use of the new sql.Conn.Raw() method to easily drop down to the lowest API level, so bump our minimum Go version to 1.13. (1.12 has been EOL for about three years now.) This escaping fix is a breaking change for anyone who may have already worked around this problem, so bump the utility's version to 0.4.0.

The splitChar argument is now unused.

As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes that one line of CSV corresponds to one row, but that's not true -- a quoted string may be spread over multiple lines. Fix our reported row count by looking at the result of the COPY operation. This does NOT solve the more general issue of multiline rows, which is that if the batch boundary comes down in the middle of a row, we'll fail. But it is a step towards more correct behavior.

As mentioned in timescale#31, lib/pq appears to ignore the DateStyle setting and formats timestamps with MDY order. This has been serendipitously fixed by switching to the pgx driver; add a test to make sure it doesn't regress again.

CLAassistant · 2022-06-09T22:42:55Z

All committers have signed the CLA.

internal/db/db.go

jchampio · 2022-06-10T23:19:56Z

@svenklemm Thanks for the review! Would you prefer that I fix the new performance regression before merging?

svenklemm · 2022-06-11T06:19:56Z

Either is fine with me so pick whatever you prefer.

jchampio · 2022-06-13T22:50:29Z

I'll plan to fix this in the same PR, then. The slowdown is pretty significant and I have an approach that is looking much faster in preliminary testing.

Move scan() to batch.Scan(). Previous references to global variables are now either provided directly as parameters, or moved up to the caller in the case of the logging code. Tests have been backfilled in preparation for a new implementation in the next commit.

Unlike the prior lib/pq implementation, PgConn.CopyFrom() does not perform any internal buffering to reduce the number of network writes, so the current implementation leads to one-write-per-line and a significant slowdown. The overhead of Fprintln()ing to an io.Pipe is not helping things, either. Instead, read the accumulated lines directly into a net.Buffers instance. Its Read() implementation interacts well with CopyFrom() -- each network write will push as much data as is available. Since the data needs to include the end-of-line terminators, switch from bufio.Scanner to the lower-level bufio.Reader API, which won't strip our line endings. This is a more verbose interface than Scanner, but it gives us close to full control over how and when copies are made, for an even bigger performance improvement. db.CopyFromLines() now takes an io.Reader directly (and the io.Pipe has disappeared completely). There is an additional compatibility break introduced here, which is that --token-size has been removed (it's no longer applicable to the implementation now that bufio.Scanner is gone).

jchampio · 2022-06-15T16:39:19Z

Okay, the scan() implementation has been moved to its own package for testing, and I've rewritten it to pull []byte slices directly into the io.Reader used by PgConn. This leads to an overall performance improvement over the lib/pq implementation, and the CPU usage of the utility drops considerably since we're not copying as much data around.

As a result of moving away from bufio.Scanner, the --token-size argument is no longer applicable and has been removed.

jchampio · 2022-06-17T15:48:39Z

Thanks for the reviews!

jchampio added 6 commits June 8, 2022 11:02

db: move internal tests

bad90be

This will make it easier to separate the tests of the public API from the tests of the internal API. I would have done this at the same time I added the public tests, but the resulting diff was not easy to read.

db: remove dead code from CopyFromLines()

e00a62f

The splitChar argument is now unused.

db: test COPY with custom DateStyle

7127e83

As mentioned in timescale#31, lib/pq appears to ignore the DateStyle setting and formats timestamps with MDY order. This has been serendipitously fixed by switching to the pgx driver; add a test to make sure it doesn't regress again.

jchampio commented Jun 10, 2022

View reviewed changes

internal/db/db.go Outdated Show resolved Hide resolved

svenklemm approved these changes Jun 10, 2022

View reviewed changes

jchampio added 2 commits June 14, 2022 14:37

jchampio requested a review from svenklemm June 15, 2022 16:39

svenklemm approved these changes Jun 17, 2022

View reviewed changes

jchampio merged commit 1064df0 into timescale:master Jun 17, 2022

jchampio deleted the dev/jackc-escape branch June 17, 2022 15:48

This was referenced Jun 17, 2022

bytea fields are wrongly handled #61

Closed

panic: pq: date/time field value out of range: "13/01/2019 00:00:00 #31

Closed

FATAL error #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple issues by switching database drivers #63

Fix multiple issues by switching database drivers #63

jchampio commented Jun 9, 2022 •

edited

Loading

CLAassistant commented Jun 9, 2022 •

edited

Loading

jchampio commented Jun 10, 2022

svenklemm commented Jun 11, 2022

jchampio commented Jun 13, 2022

jchampio commented Jun 15, 2022 •

edited

Loading

jchampio commented Jun 17, 2022

Fix multiple issues by switching database drivers #63

Fix multiple issues by switching database drivers #63

Conversation

jchampio commented Jun 9, 2022 • edited Loading

Roadmap of the patchset

Miscellanea

CLAassistant commented Jun 9, 2022 • edited Loading

jchampio commented Jun 10, 2022

svenklemm commented Jun 11, 2022

jchampio commented Jun 13, 2022

jchampio commented Jun 15, 2022 • edited Loading

jchampio commented Jun 17, 2022

jchampio commented Jun 9, 2022 •

edited

Loading

CLAassistant commented Jun 9, 2022 •

edited

Loading

jchampio commented Jun 15, 2022 •

edited

Loading