-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple issues by switching database drivers #63
Conversation
This will make it easier to separate the tests of the public API from the tests of the internal API. I would have done this at the same time I added the public tests, but the resulting diff was not easy to read.
Add a new helper function for the COPY FROM logic, to make it easier to test. The new tests are an attempt to capture the current behavior, but note that the current behavior may not necessarily be correct/desirable. To run the new tests, `go test` must be provided with a conninfo string pointing to the database under test; see the README.
lib/pq has been in maintenance mode for a while, and issue timescale#61 appears to have run into one of its idiosyncrasies: its COPY implementation assumes that you're using a query generated via pq.CopyIn(), which uses the default TEXT format, so it runs all of the incoming data through an additional escaping layer. Our code uses CSV by default (and there appears to be no way to use TEXT format, since we're using the old COPY syntax), which means that incoming CSV containing its own escapes will be double-escaped and corrupted. This is most visible with bytea columns, but the tests currently document additional problems with tab and backslash characters, and there are probably other problematic cases too. To fix, switch from lib/pq over to jackc/pgx, and reimplement db.CopyFromLines() using the PgConn.CopyFrom() API. We were already depending on a part of this library before, so the new dependency isn't as big of a change as it would have been otherwise, but the switch isn't free. The compiled binary gains roughly 1.5 MB in size -- likely due to jackc's extensive type conversion system, which is unfortunate because we're not using it. Further optimization could probably be done, at the expense of having most of the DB logic go through the low-level APIs rather than database/sql. We make use of the new sql.Conn.Raw() method to easily drop down to the lowest API level, so bump our minimum Go version to 1.13. (1.12 has been EOL for about three years now.) This escaping fix is a breaking change for anyone who may have already worked around this problem, so bump the utility's version to 0.4.0.
The splitChar argument is now unused.
As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes that one line of CSV corresponds to one row, but that's not true -- a quoted string may be spread over multiple lines. Fix our reported row count by looking at the result of the COPY operation. This does NOT solve the more general issue of multiline rows, which is that if the batch boundary comes down in the middle of a row, we'll fail. But it is a step towards more correct behavior.
As mentioned in timescale#31, lib/pq appears to ignore the DateStyle setting and formats timestamps with MDY order. This has been serendipitously fixed by switching to the pgx driver; add a test to make sure it doesn't regress again.
@svenklemm Thanks for the review! Would you prefer that I fix the new performance regression before merging? |
Either is fine with me so pick whatever you prefer. |
I'll plan to fix this in the same PR, then. The slowdown is pretty significant and I have an approach that is looking much faster in preliminary testing. |
Move scan() to batch.Scan(). Previous references to global variables are now either provided directly as parameters, or moved up to the caller in the case of the logging code. Tests have been backfilled in preparation for a new implementation in the next commit.
Unlike the prior lib/pq implementation, PgConn.CopyFrom() does not perform any internal buffering to reduce the number of network writes, so the current implementation leads to one-write-per-line and a significant slowdown. The overhead of Fprintln()ing to an io.Pipe is not helping things, either. Instead, read the accumulated lines directly into a net.Buffers instance. Its Read() implementation interacts well with CopyFrom() -- each network write will push as much data as is available. Since the data needs to include the end-of-line terminators, switch from bufio.Scanner to the lower-level bufio.Reader API, which won't strip our line endings. This is a more verbose interface than Scanner, but it gives us close to full control over how and when copies are made, for an even bigger performance improvement. db.CopyFromLines() now takes an io.Reader directly (and the io.Pipe has disappeared completely). There is an additional compatibility break introduced here, which is that --token-size has been removed (it's no longer applicable to the implementation now that bufio.Scanner is gone).
Okay, the As a result of moving away from |
Thanks for the reviews! |
lib/pq
has been in maintenance mode for a while, and issue #61 appears to have run into one of its idiosyncrasies: itsCOPY
implementation assumes that you're using a query generated viapq.CopyIn()
, which uses the defaultTEXT
format, so it runs all of the incoming data through an additional escaping layer.Our code uses
CSV
by default (and there appears to be no way to useTEXT
format, since we're using the oldCOPY
syntax), which means that incoming CSV containing its own escapes will be double-escaped and corrupted. This is most visible withbytea
columns, but I found similar issues with tab and backslash characters, and there are probably other problematic cases too.To fix, switch from
lib/pq
over tojackc/pgx
. We were already depending on a part of this library before, so the new dependency isn't as big of a change as it would have been otherwise, but the switch isn't free. The compiled binary gains roughly 1.5 MB in size -- likely due topgx
's extensive type conversion system, which is unfortunate because we're not using it. Further optimization could probably be done, at the expense of having most of the DB logic go through the low-level APIs rather thandatabase/sql
.I've backfilled several tests to try to characterize the previous and new behavior. The new logic has been moved to
db.CopyFromLines()
for easier testing. While I was doing that, I took a look through some of the open issues and found that #31 just happens to be fixed by this change as well, presumably becausepgx
isn't forcing a specificDateStyle
. The new interface also makes it easy to grab the actual number of rows touched, so I went ahead and fixed our row count reporting for multi-line CSV rows. (But note that the core bug for #19 and #50 remains unfixed.)The fact that multiple issues are fixed at once is nice, but it highlights what a big change this would be. It's probably safe to assume that there are other subtle changes in behavior. Assuming this approach is acceptable, I need to continue finding and testing corner cases to try to better characterize all of those user-visible changes. I also need to develop some real-world test cases, both for sanity checking and for performance/stress testing.
WDYT?
Roadmap of the patchset
processBatch()
implementation, renaming it todb.CopyFromLines()
lib/pq
driver forjackc/pgx/v4/stdlib
, updating expected test behaviorpgconn
'sCOPY
resultDateStyle
copies appear to be fixedscan()
, renaming it tobatch.Scan()
batch.Scan()
usingnet.Buffers
andbufio.Reader
for a considerable performance improvementMiscellanea
TEST_CONNINFO
(see the README)-token-size
option is no longer applicable and has been removed