Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

Closed
binakot opened this issue Apr 29, 2021 · 6 comments

Comments

@binakot
Copy link
Contributor

binakot commented Apr 29, 2021

Got an error on using timescaledb-parallel-copy.

$ timescaledb-parallel-copy -version
timescaledb-parallel-copy 0.3.0 (linux amd64)

$ timescaledb-parallel-copy --db-name postgres --table table --file data.csv
panic: pq: unterminated CSV quoted field

goroutine 21 [running]:
main.processBatches(0xc0000b6370, 0xc0000926c0)
        /go/src/github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:262 +0x8ad
created by main.main
        /go/src/github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1d2

Later I will compile the app from source and try to find the reason and prepare the test data set for reproduction.
The error may be related to the following issues: #19 #24 #31

@binakot
Copy link
Contributor Author

binakot commented Apr 29, 2021

First of all timescaledb-parallel-copy gives wrong output of inserted rows when CSV has multiline records.
Related to issue #19.

But looks like it works anyway even with wrong result output.

How to reproduce

Table to insert into:

CREATE TABLE test (
  id SERIAL PRIMARY KEY,
  state TEXT,
  time TIMESTAMPTZ,
  date TEXT
)

Content of test file data.csv:

1,failed,2019-07-29 07:43:14.197455,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- errcount=0
"
2,failed,2019-08-29 08:38:31.747244,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- echo ""Tests""
- ""./tests.sh""
"
3,failed,2019-09-29 08:52:44.77306,"---
:image:
  :name: docker:stable
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"
4,failed,2019-10-29 09:31:14.098283,"---
:image:
  :name: ubuntu:16.04
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"
5,failed,2019-11-29 09:55:21.588764,"---
:image:
  :name: docker:stable
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"
$ timescaledb-parallel-copy --table test --file data.csv
COPY 52

52 rows were inserted.
It's not a count of records in CSV, it is a number or rows in file.

But table looks correct:

postgres=# select * from test limit 1;
 id | state  |             time              |                       date                       
----+--------+-------------------------------+--------------------------------------------------
  1 | failed | 2019-07-29 07:43:14.197455+00 | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: ubuntu:16.04                           
    |        |                               | :before_script:                                 
    |        |                               | - echo "Install sofrware"                       
    |        |                               | - apt-get update && apt-get install -y sudo wget
    |        |                               | :script:                                        
    |        |                               | - errcount=0                                    
    |        |                               | 
  2 | failed | 2019-08-29 08:38:31.747244+00 | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: ubuntu:16.04                           
    |        |                               | :before_script:                                 
    |        |                               | - echo "Install sofrware"                       
    |        |                               | - apt-get update && apt-get install -y sudo wget
    |        |                               | :script:                                        
    |        |                               | - echo "Tests"                                  
    |        |                               | - "./tests.sh"                                  
    |        |                               | 
  3 | failed | 2019-09-29 08:52:44.77306+00  | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: docker:stable                          
    |        |                               | :services:                                      
    |        |                               | - :name: docker:stable-dind                     
    |        |                               | :before_script:                                 
    |        |                               | - docker info                                   
    |        |                               | :script:                                        
    |        |                               | - echo "Tests"                                  
    |        |                               | - "./tests.sh"                                  
    |        |                               | 
  4 | failed | 2019-10-29 09:31:14.098283+00 | ---                        
    |        |                               | :image:                    
    |        |                               |   :name: ubuntu:16.04      
    |        |                               | :services:                 
    |        |                               | - :name: docker:stable-dind
    |        |                               | :before_script:            
    |        |                               | - docker info              
    |        |                               | :script:                   
    |        |                               | - echo "Tests"             
    |        |                               | - "./tests.sh"             
    |        |                               | 
  5 | failed | 2019-11-29 09:55:21.588764+00 | ---                        
    |        |                               | :image:                    
    |        |                               |   :name: docker:stable     
    |        |                               | :services:                 
    |        |                               | - :name: docker:stable-dind
    |        |                               | :before_script:            
    |        |                               | - docker info              
    |        |                               | :script:                   
    |        |                               | - echo "Tests"             
    |        |                               | - "./tests.sh"             
    |        |                               | 

(5 rows)

@binakot
Copy link
Contributor Author

binakot commented Apr 30, 2021

I seem to have found the cause of the error. It's all about the same problem of multi-line records.

Specifying the --batch-size, which is 5000 by default, is not the number of records, but the number of rows, without taking the fact that the data can be on several lines.

How to reproduce

Table to insert into:

CREATE TABLE test (
  id SERIAL PRIMARY KEY,
  state TEXT,
  time TIMESTAMPTZ,
  date TEXT
)

Content of test file data.csv:

1,failed,2019-07-29 07:43:14.197455,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- errcount=0
"

Single worker

--batch-size 1 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"---]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 2 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 2 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 3 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 3 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image:   :name: ubuntu:16.04]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 100 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 100 --workers 1

COPY 9

Two workers

--batch-size 1 --workers 2

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 2

>>> batch=&{[:image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: invalid input syntax for integer: ":image:"

goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

Debug output done with code fmt.Printf(">>> batch=%s\r\n>>> copyCmd=%s\r\n", batch, copyCmd) here: https://github.com/timescale/timescaledb-parallel-copy/blob/master/cmd/timescaledb-parallel-copy/main.go#L262

The current implementation is completely dependent on reading the file line by line. We need to think about how to ensure that the data is read by records rather than rows. Because every worker can get to the middle of a multi-line record.

@binakot binakot changed the title panic: pq: unterminated CSV quoted field Multi-line records are not supported! panic: pq: unterminated CSV quoted field Apr 30, 2021
@binakot
Copy link
Contributor Author

binakot commented Apr 30, 2021

@RobAtticus Hello!

Any ideas how to fix that?
May be is there some hacks to fix multi-line records?
Or is it required additional implementation?

@ryanmjacobs
Copy link

Also had this issue. Had to resort to psql -d new_db -c "\COPY conditions FROM old_db.csv CSV"

jchampio added a commit to jchampio/timescaledb-parallel-copy that referenced this issue Jun 9, 2022
As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes
that one line of CSV corresponds to one row, but that's not true -- a
quoted string may be spread over multiple lines. Fix our reported row
count by looking at the result of the COPY operation.

This does NOT solve the more general issue of multiline rows, which is
that if the batch boundary comes down in the middle of a row, we'll
fail. But it is a step towards more correct behavior.
@mindrunner
Copy link

mindrunner commented Jul 19, 2022

my comment was rubbish. deleting....

using psql is the proper workaround, as mentioned by @ryanmjacobs

jchampio added a commit to jchampio/timescaledb-parallel-copy that referenced this issue Aug 11, 2022
As mentioned in 8da678e, the batching algorithm assumes that one row
consists of exactly one line. If a row contains multi-line quoted column
values, and we happen to split that row across multiple batches, we'll
fail the COPY.

Add a naive CSV parser to Scan() which prevents this accidental
splitting. This parser searches the incoming CSV for the QUOTE and
ESCAPE characters (which can be customized by the new -quote and -escape
flags to the tool), and prevents Scan() from sending out a new batch if
it detects that we're still inside a quoted value.

Because it's imperative that our parser matches the Postgres CSV parser
exactly (and the rules for parsing are not as intuitive as the rules for
production), a wide variety of test cases have been added. Running the
suite with TEST_CONNINFO will additionally run these test cases against
a live Postgres server, as a sanity check to ensure that each case was
coded correctly.

Fixes timescale#19 and timescale#50.

This implementation adds a significant CPU hit, in the form of naive
iteration over every byte in the input. I duplicated each benchmark, to
measure the case where QUOTE and ESCAPE are the same and the case where
they are different. (The code paths diverge enough that it may be useful
to optimize them separately.)

For the following machine (with CPU scaling disabled to the best of my
knowledge):

    goos: linux
    goarch: amd64
    pkg: github.com/timescale/timescaledb-parallel-copy/internal/batch
    cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

benchstat reports the following changes as of this commit, when compared
to the previous one:

    $ benchstat original.txt multiline.txt
    name                                                    old time/op  new time/op  delta
    Scan/warmup_(disregard)_(standard_escapes)-8             294µs ±23%   623µs ± 8%  +111.59%  (p=0.008 n=5+5)
    Scan/warmup_(disregard)_(custom_escapes)-8               294µs ±23%   617µs ± 1%  +109.60%  (p=0.008 n=5+5)
    Scan/no_quotes_(standard_escapes)-8                      273µs ± 2%   625µs ± 2%  +129.08%  (p=0.008 n=5+5)
    Scan/no_quotes_(custom_escapes)-8                        273µs ± 2%   633µs ± 1%  +132.08%  (p=0.008 n=5+5)
    Scan/some_quotes_at_the_beginning_(standard_escapes)-8   281µs ± 6%   835µs ± 2%  +196.96%  (p=0.008 n=5+5)
    Scan/some_quotes_at_the_beginning_(custom_escapes)-8     281µs ± 6%   850µs ± 2%  +202.22%  (p=0.008 n=5+5)
    Scan/some_quotes_in_the_middle_(standard_escapes)-8      280µs ± 5%   856µs ± 2%  +205.66%  (p=0.008 n=5+5)
    Scan/some_quotes_in_the_middle_(custom_escapes)-8        280µs ± 5%   879µs ± 2%  +214.06%  (p=0.008 n=5+5)
    Scan/all_quotes_(standard_escapes)-8                     332µs ±10%  1106µs ± 1%  +232.77%  (p=0.008 n=5+5)
    Scan/all_quotes_(custom_escapes)-8                       332µs ±10%  1129µs ± 2%  +239.91%  (p=0.008 n=5+5)
    Scan/nothing_but_quotes_(standard_escapes)-8             303µs ± 5%   859µs ± 1%  +183.87%  (p=0.008 n=5+5)
    Scan/nothing_but_quotes_(custom_escapes)-8               303µs ± 5%  1140µs ± 1%  +276.61%  (p=0.008 n=5+5)

This is a pretty significant slowdown -- we've reduced the speed by a
factor of two or three.
@jchampio
Copy link
Contributor

Okay, as of the above commit, multiline record splitting should no longer happen. If you have the ability to test that out and confirm, that'd be fantastic! Please reopen if you find any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants