Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple issues by switching database drivers #63

Merged
merged 8 commits into from
Jun 17, 2022

Conversation

jchampio
Copy link
Contributor

@jchampio jchampio commented Jun 9, 2022

lib/pq has been in maintenance mode for a while, and issue #61 appears to have run into one of its idiosyncrasies: its COPY implementation assumes that you're using a query generated via pq.CopyIn(), which uses the default TEXT format, so it runs all of the incoming data through an additional escaping layer.

Our code uses CSV by default (and there appears to be no way to use TEXT format, since we're using the old COPY syntax), which means that incoming CSV containing its own escapes will be double-escaped and corrupted. This is most visible with bytea columns, but I found similar issues with tab and backslash characters, and there are probably other problematic cases too.

To fix, switch from lib/pq over to jackc/pgx. We were already depending on a part of this library before, so the new dependency isn't as big of a change as it would have been otherwise, but the switch isn't free. The compiled binary gains roughly 1.5 MB in size -- likely due to pgx's extensive type conversion system, which is unfortunate because we're not using it. Further optimization could probably be done, at the expense of having most of the DB logic go through the low-level APIs rather than database/sql.

I've backfilled several tests to try to characterize the previous and new behavior. The new logic has been moved to db.CopyFromLines() for easier testing. While I was doing that, I took a look through some of the open issues and found that #31 just happens to be fixed by this change as well, presumably because pgx isn't forcing a specific DateStyle. The new interface also makes it easy to grab the actual number of rows touched, so I went ahead and fixed our row count reporting for multi-line CSV rows. (But note that the core bug for #19 and #50 remains unfixed.)

The fact that multiple issues are fixed at once is nice, but it highlights what a big change this would be. It's probably safe to assume that there are other subtle changes in behavior. Assuming this approach is acceptable, I need to continue finding and testing corner cases to try to better characterize all of those user-visible changes. I also need to develop some real-world test cases, both for sanity checking and for performance/stress testing.

WDYT?

Roadmap of the patchset

  • move and backfill tests for the old processBatch() implementation, renaming it to db.CopyFromLines()
  • switch out the lib/pq driver for jackc/pgx/v4/stdlib, updating expected test behavior
  • remove dead code after the switch
  • fix the reported row count using the pgconn's COPY result
  • further backfill timestamp tests now that DateStyle copies appear to be fixed
  • new: move and backfill tests for scan(), renaming it to batch.Scan()
  • new: reimplement batch.Scan() using net.Buffers and bufio.Reader for a considerable performance improvement

Miscellanea

  • minimum Go version has been bumped to 1.13
  • utility version has been bumped to 0.4.0-dev (this is a major breaking change)
  • to avoid skipping the new tests, you must specify TEST_CONNINFO (see the README)
  • new: the -token-size option is no longer applicable and has been removed

jchampio added 6 commits June 8, 2022 11:02
This will make it easier to separate the tests of the public API from
the tests of the internal API. I would have done this at the same time I
added the public tests, but the resulting diff was not easy to read.
Add a new helper function for the COPY FROM logic, to make it easier to
test. The new tests are an attempt to capture the current behavior, but
note that the current behavior may not necessarily be correct/desirable.

To run the new tests, `go test` must be provided with a conninfo string
pointing to the database under test; see the README.
lib/pq has been in maintenance mode for a while, and issue timescale#61 appears
to have run into one of its idiosyncrasies: its COPY implementation
assumes that you're using a query generated via pq.CopyIn(), which uses
the default TEXT format, so it runs all of the incoming data through an
additional escaping layer.

Our code uses CSV by default (and there appears to be no way to use TEXT
format, since we're using the old COPY syntax), which means that
incoming CSV containing its own escapes will be double-escaped and
corrupted. This is most visible with bytea columns, but the tests
currently document additional problems with tab and backslash
characters, and there are probably other problematic cases too.

To fix, switch from lib/pq over to jackc/pgx, and reimplement
db.CopyFromLines() using the PgConn.CopyFrom() API. We were already
depending on a part of this library before, so the new dependency isn't
as big of a change as it would have been otherwise, but the switch isn't
free. The compiled binary gains roughly 1.5 MB in size -- likely due to
jackc's extensive type conversion system, which is unfortunate because
we're not using it. Further optimization could probably be done, at the
expense of having most of the DB logic go through the low-level APIs
rather than database/sql.

We make use of the new sql.Conn.Raw() method to easily drop down to the
lowest API level, so bump our minimum Go version to 1.13. (1.12 has been
EOL for about three years now.) This escaping fix is a breaking change
for anyone who may have already worked around this problem, so bump the
utility's version to 0.4.0.
The splitChar argument is now unused.
As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes
that one line of CSV corresponds to one row, but that's not true -- a
quoted string may be spread over multiple lines. Fix our reported row
count by looking at the result of the COPY operation.

This does NOT solve the more general issue of multiline rows, which is
that if the batch boundary comes down in the middle of a row, we'll
fail. But it is a step towards more correct behavior.
As mentioned in timescale#31, lib/pq appears to ignore the DateStyle setting and
formats timestamps with MDY order. This has been serendipitously fixed
by switching to the pgx driver; add a test to make sure it doesn't
regress again.
@CLAassistant
Copy link

CLAassistant commented Jun 9, 2022

CLA assistant check
All committers have signed the CLA.

internal/db/db.go Outdated Show resolved Hide resolved
@jchampio
Copy link
Contributor Author

@svenklemm Thanks for the review! Would you prefer that I fix the new performance regression before merging?

@svenklemm
Copy link
Member

Either is fine with me so pick whatever you prefer.

@jchampio
Copy link
Contributor Author

I'll plan to fix this in the same PR, then. The slowdown is pretty significant and I have an approach that is looking much faster in preliminary testing.

jchampio added 2 commits June 14, 2022 14:37
Move scan() to batch.Scan(). Previous references to global variables
are now either provided directly as parameters, or moved up to the
caller in the case of the logging code.

Tests have been backfilled in preparation for a new implementation in
the next commit.
Unlike the prior lib/pq implementation, PgConn.CopyFrom() does not
perform any internal buffering to reduce the number of network writes,
so the current implementation leads to one-write-per-line and a
significant slowdown. The overhead of Fprintln()ing to an io.Pipe is not
helping things, either.

Instead, read the accumulated lines directly into a net.Buffers
instance. Its Read() implementation interacts well with CopyFrom() --
each network write will push as much data as is available. Since the
data needs to include the end-of-line terminators, switch from
bufio.Scanner to the lower-level bufio.Reader API, which won't strip our
line endings. This is a more verbose interface than Scanner, but it
gives us close to full control over how and when copies are made, for an
even bigger performance improvement. db.CopyFromLines() now takes an
io.Reader directly (and the io.Pipe has disappeared completely).

There is an additional compatibility break introduced here, which is
that --token-size has been removed (it's no longer applicable to the
implementation now that bufio.Scanner is gone).
@jchampio
Copy link
Contributor Author

jchampio commented Jun 15, 2022

Okay, the scan() implementation has been moved to its own package for testing, and I've rewritten it to pull []byte slices directly into the io.Reader used by PgConn. This leads to an overall performance improvement over the lib/pq implementation, and the CPU usage of the utility drops considerably since we're not copying as much data around.

As a result of moving away from bufio.Scanner, the --token-size argument is no longer applicable and has been removed.

@jchampio jchampio requested a review from svenklemm June 15, 2022 16:39
@jchampio jchampio merged commit 1064df0 into timescale:master Jun 17, 2022
@jchampio
Copy link
Contributor Author

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants