-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50
Comments
First of all But looks like it works anyway even with wrong result output. How to reproduceTable to insert into: CREATE TABLE test (
id SERIAL PRIMARY KEY,
state TEXT,
time TIMESTAMPTZ,
date TEXT
) Content of test file
$ timescaledb-parallel-copy --table test --file data.csv
COPY 52 52 rows were inserted. But table looks correct: postgres=# select * from test limit 1;
id | state | time | date
----+--------+-------------------------------+--------------------------------------------------
1 | failed | 2019-07-29 07:43:14.197455+00 | ---
| | | :image:
| | | :name: ubuntu:16.04
| | | :before_script:
| | | - echo "Install sofrware"
| | | - apt-get update && apt-get install -y sudo wget
| | | :script:
| | | - errcount=0
| | |
2 | failed | 2019-08-29 08:38:31.747244+00 | ---
| | | :image:
| | | :name: ubuntu:16.04
| | | :before_script:
| | | - echo "Install sofrware"
| | | - apt-get update && apt-get install -y sudo wget
| | | :script:
| | | - echo "Tests"
| | | - "./tests.sh"
| | |
3 | failed | 2019-09-29 08:52:44.77306+00 | ---
| | | :image:
| | | :name: docker:stable
| | | :services:
| | | - :name: docker:stable-dind
| | | :before_script:
| | | - docker info
| | | :script:
| | | - echo "Tests"
| | | - "./tests.sh"
| | |
4 | failed | 2019-10-29 09:31:14.098283+00 | ---
| | | :image:
| | | :name: ubuntu:16.04
| | | :services:
| | | - :name: docker:stable-dind
| | | :before_script:
| | | - docker info
| | | :script:
| | | - echo "Tests"
| | | - "./tests.sh"
| | |
5 | failed | 2019-11-29 09:55:21.588764+00 | ---
| | | :image:
| | | :name: docker:stable
| | | :services:
| | | - :name: docker:stable-dind
| | | :before_script:
| | | - docker info
| | | :script:
| | | - echo "Tests"
| | | - "./tests.sh"
| | |
(5 rows) |
I seem to have found the cause of the error. It's all about the same problem of multi-line records. Specifying the How to reproduceTable to insert into: CREATE TABLE test (
id SERIAL PRIMARY KEY,
state TEXT,
time TIMESTAMPTZ,
date TEXT
) Content of test file
Single worker--batch-size 1 --workers 1$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 1
>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"---]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field
goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb
--batch-size 2 --workers 1$ timescaledb-parallel-copy --table test --file data.csv --batch-size 2 --workers 1
>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field
goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb --batch-size 3 --workers 1$ timescaledb-parallel-copy --table test --file data.csv --batch-size 3 --workers 1
>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image: :name: ubuntu:16.04]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field
goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb --batch-size 100 --workers 1$ timescaledb-parallel-copy --table test --file data.csv --batch-size 100 --workers 1
COPY 9 Two workers--batch-size 1 --workers 2$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 2
>>> batch=&{[:image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: invalid input syntax for integer: ":image:"
goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
/home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb Debug output done with code The current implementation is completely dependent on reading the file line by line. We need to think about how to ensure that the data is read by records rather than rows. Because every worker can get to the middle of a multi-line record. |
@RobAtticus Hello! Any ideas how to fix that? |
Also had this issue. Had to resort to |
As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes that one line of CSV corresponds to one row, but that's not true -- a quoted string may be spread over multiple lines. Fix our reported row count by looking at the result of the COPY operation. This does NOT solve the more general issue of multiline rows, which is that if the batch boundary comes down in the middle of a row, we'll fail. But it is a step towards more correct behavior.
my comment was rubbish. deleting.... using |
As mentioned in 8da678e, the batching algorithm assumes that one row consists of exactly one line. If a row contains multi-line quoted column values, and we happen to split that row across multiple batches, we'll fail the COPY. Add a naive CSV parser to Scan() which prevents this accidental splitting. This parser searches the incoming CSV for the QUOTE and ESCAPE characters (which can be customized by the new -quote and -escape flags to the tool), and prevents Scan() from sending out a new batch if it detects that we're still inside a quoted value. Because it's imperative that our parser matches the Postgres CSV parser exactly (and the rules for parsing are not as intuitive as the rules for production), a wide variety of test cases have been added. Running the suite with TEST_CONNINFO will additionally run these test cases against a live Postgres server, as a sanity check to ensure that each case was coded correctly. Fixes timescale#19 and timescale#50. This implementation adds a significant CPU hit, in the form of naive iteration over every byte in the input. I duplicated each benchmark, to measure the case where QUOTE and ESCAPE are the same and the case where they are different. (The code paths diverge enough that it may be useful to optimize them separately.) For the following machine (with CPU scaling disabled to the best of my knowledge): goos: linux goarch: amd64 pkg: github.com/timescale/timescaledb-parallel-copy/internal/batch cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz benchstat reports the following changes as of this commit, when compared to the previous one: $ benchstat original.txt multiline.txt name old time/op new time/op delta Scan/warmup_(disregard)_(standard_escapes)-8 294µs ±23% 623µs ± 8% +111.59% (p=0.008 n=5+5) Scan/warmup_(disregard)_(custom_escapes)-8 294µs ±23% 617µs ± 1% +109.60% (p=0.008 n=5+5) Scan/no_quotes_(standard_escapes)-8 273µs ± 2% 625µs ± 2% +129.08% (p=0.008 n=5+5) Scan/no_quotes_(custom_escapes)-8 273µs ± 2% 633µs ± 1% +132.08% (p=0.008 n=5+5) Scan/some_quotes_at_the_beginning_(standard_escapes)-8 281µs ± 6% 835µs ± 2% +196.96% (p=0.008 n=5+5) Scan/some_quotes_at_the_beginning_(custom_escapes)-8 281µs ± 6% 850µs ± 2% +202.22% (p=0.008 n=5+5) Scan/some_quotes_in_the_middle_(standard_escapes)-8 280µs ± 5% 856µs ± 2% +205.66% (p=0.008 n=5+5) Scan/some_quotes_in_the_middle_(custom_escapes)-8 280µs ± 5% 879µs ± 2% +214.06% (p=0.008 n=5+5) Scan/all_quotes_(standard_escapes)-8 332µs ±10% 1106µs ± 1% +232.77% (p=0.008 n=5+5) Scan/all_quotes_(custom_escapes)-8 332µs ±10% 1129µs ± 2% +239.91% (p=0.008 n=5+5) Scan/nothing_but_quotes_(standard_escapes)-8 303µs ± 5% 859µs ± 1% +183.87% (p=0.008 n=5+5) Scan/nothing_but_quotes_(custom_escapes)-8 303µs ± 5% 1140µs ± 1% +276.61% (p=0.008 n=5+5) This is a pretty significant slowdown -- we've reduced the speed by a factor of two or three.
Okay, as of the above commit, multiline record splitting should no longer happen. If you have the ability to test that out and confirm, that'd be fantastic! Please reopen if you find any issues. |
Got an error on using
timescaledb-parallel-copy
.Later I will compile the app from source and try to find the reason and prepare the test data set for reproduction.
The error may be related to the following issues: #19 #24 #31
The text was updated successfully, but these errors were encountered: