Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

binakot · 2021-04-29T08:17:00Z

Got an error on using timescaledb-parallel-copy.

$ timescaledb-parallel-copy -version
timescaledb-parallel-copy 0.3.0 (linux amd64)

$ timescaledb-parallel-copy --db-name postgres --table table --file data.csv
panic: pq: unterminated CSV quoted field

goroutine 21 [running]:
main.processBatches(0xc0000b6370, 0xc0000926c0)
        /go/src/github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:262 +0x8ad
created by main.main
        /go/src/github.com/timescale/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1d2

Later I will compile the app from source and try to find the reason and prepare the test data set for reproduction.
The error may be related to the following issues: #19 #24 #31

The text was updated successfully, but these errors were encountered:

binakot · 2021-04-29T13:14:56Z

First of all timescaledb-parallel-copy gives wrong output of inserted rows when CSV has multiline records.
Related to issue #19.

But looks like it works anyway even with wrong result output.

How to reproduce

Table to insert into:

CREATE TABLE test (
  id SERIAL PRIMARY KEY,
  state TEXT,
  time TIMESTAMPTZ,
  date TEXT
)

Content of test file data.csv:

1,failed,2019-07-29 07:43:14.197455,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- errcount=0
"
2,failed,2019-08-29 08:38:31.747244,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- echo ""Tests""
- ""./tests.sh""
"
3,failed,2019-09-29 08:52:44.77306,"---
:image:
  :name: docker:stable
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"
4,failed,2019-10-29 09:31:14.098283,"---
:image:
  :name: ubuntu:16.04
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"
5,failed,2019-11-29 09:55:21.588764,"---
:image:
  :name: docker:stable
:services:
- :name: docker:stable-dind
:before_script:
- docker info
:script:
- echo ""Tests""
- ""./tests.sh""
"

$ timescaledb-parallel-copy --table test --file data.csv
COPY 52

52 rows were inserted.
It's not a count of records in CSV, it is a number or rows in file.

But table looks correct:

postgres=# select * from test limit 1;
 id | state  |             time              |                       date                       
----+--------+-------------------------------+--------------------------------------------------
  1 | failed | 2019-07-29 07:43:14.197455+00 | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: ubuntu:16.04                           
    |        |                               | :before_script:                                 
    |        |                               | - echo "Install sofrware"                       
    |        |                               | - apt-get update && apt-get install -y sudo wget
    |        |                               | :script:                                        
    |        |                               | - errcount=0                                    
    |        |                               | 
  2 | failed | 2019-08-29 08:38:31.747244+00 | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: ubuntu:16.04                           
    |        |                               | :before_script:                                 
    |        |                               | - echo "Install sofrware"                       
    |        |                               | - apt-get update && apt-get install -y sudo wget
    |        |                               | :script:                                        
    |        |                               | - echo "Tests"                                  
    |        |                               | - "./tests.sh"                                  
    |        |                               | 
  3 | failed | 2019-09-29 08:52:44.77306+00  | ---                                             
    |        |                               | :image:                                         
    |        |                               |   :name: docker:stable                          
    |        |                               | :services:                                      
    |        |                               | - :name: docker:stable-dind                     
    |        |                               | :before_script:                                 
    |        |                               | - docker info                                   
    |        |                               | :script:                                        
    |        |                               | - echo "Tests"                                  
    |        |                               | - "./tests.sh"                                  
    |        |                               | 
  4 | failed | 2019-10-29 09:31:14.098283+00 | ---                        
    |        |                               | :image:                    
    |        |                               |   :name: ubuntu:16.04      
    |        |                               | :services:                 
    |        |                               | - :name: docker:stable-dind
    |        |                               | :before_script:            
    |        |                               | - docker info              
    |        |                               | :script:                   
    |        |                               | - echo "Tests"             
    |        |                               | - "./tests.sh"             
    |        |                               | 
  5 | failed | 2019-11-29 09:55:21.588764+00 | ---                        
    |        |                               | :image:                    
    |        |                               |   :name: docker:stable     
    |        |                               | :services:                 
    |        |                               | - :name: docker:stable-dind
    |        |                               | :before_script:            
    |        |                               | - docker info              
    |        |                               | :script:                   
    |        |                               | - echo "Tests"             
    |        |                               | - "./tests.sh"             
    |        |                               | 

(5 rows)

binakot · 2021-04-30T12:35:42Z

I seem to have found the cause of the error. It's all about the same problem of multi-line records.

Specifying the --batch-size, which is 5000 by default, is not the number of records, but the number of rows, without taking the fact that the data can be on several lines.

How to reproduce

Table to insert into:

CREATE TABLE test (
  id SERIAL PRIMARY KEY,
  state TEXT,
  time TIMESTAMPTZ,
  date TEXT
)

Content of test file data.csv:

1,failed,2019-07-29 07:43:14.197455,"---
:image:
  :name: ubuntu:16.04
:before_script:
- echo ""Install sofrware""
- apt-get update && apt-get install -y sudo wget
:script:
- errcount=0
"

Single worker

--batch-size 1 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"---]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 2 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 2 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 3 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 3 --workers 1

>>> batch=&{[1,failed,2019-07-29 07:43:14.197455,"--- :image:   :name: ubuntu:16.04]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: unterminated CSV quoted field

goroutine 6 [running]:
main.processBatches(0xc00001a460, 0xc00002a720)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

--batch-size 100 --workers 1

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 100 --workers 1

COPY 9

Two workers

--batch-size 1 --workers 2

$ timescaledb-parallel-copy --table test --file data.csv --batch-size 1 --workers 2

>>> batch=&{[:image:]}
>>> copyCmd=COPY "public"."test" FROM STDIN WITH DELIMITER ',' CSV
panic: pq: invalid input syntax for integer: ":image:"

goroutine 19 [running]:
main.processBatches(0xc0000c4350, 0xc0000ca660)
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:263 +0x978
created by main.main
        /home/binakot/repos/github/timescaledb-parallel-copy/cmd/timescaledb-parallel-copy/main.go:148 +0x1bb

Debug output done with code fmt.Printf(">>> batch=%s\r\n>>> copyCmd=%s\r\n", batch, copyCmd) here: https://github.com/timescale/timescaledb-parallel-copy/blob/master/cmd/timescaledb-parallel-copy/main.go#L262

The current implementation is completely dependent on reading the file line by line. We need to think about how to ensure that the data is read by records rather than rows. Because every worker can get to the middle of a multi-line record.

binakot · 2021-04-30T20:51:44Z

@RobAtticus Hello!

Any ideas how to fix that?
May be is there some hacks to fix multi-line records?
Or is it required additional implementation?

ryanmjacobs · 2022-01-14T23:40:40Z

Also had this issue. Had to resort to psql -d new_db -c "\COPY conditions FROM old_db.csv CSV"

As mentioned in issues timescale#19 and timescale#50, our COPY FROM implementation assumes that one line of CSV corresponds to one row, but that's not true -- a quoted string may be spread over multiple lines. Fix our reported row count by looking at the result of the COPY operation. This does NOT solve the more general issue of multiline rows, which is that if the batch boundary comes down in the middle of a row, we'll fail. But it is a step towards more correct behavior.

mindrunner · 2022-07-19T21:23:50Z

my comment was rubbish. deleting....

using psql is the proper workaround, as mentioned by @ryanmjacobs

As mentioned in 8da678e, the batching algorithm assumes that one row consists of exactly one line. If a row contains multi-line quoted column values, and we happen to split that row across multiple batches, we'll fail the COPY. Add a naive CSV parser to Scan() which prevents this accidental splitting. This parser searches the incoming CSV for the QUOTE and ESCAPE characters (which can be customized by the new -quote and -escape flags to the tool), and prevents Scan() from sending out a new batch if it detects that we're still inside a quoted value. Because it's imperative that our parser matches the Postgres CSV parser exactly (and the rules for parsing are not as intuitive as the rules for production), a wide variety of test cases have been added. Running the suite with TEST_CONNINFO will additionally run these test cases against a live Postgres server, as a sanity check to ensure that each case was coded correctly. Fixes timescale#19 and timescale#50. This implementation adds a significant CPU hit, in the form of naive iteration over every byte in the input. I duplicated each benchmark, to measure the case where QUOTE and ESCAPE are the same and the case where they are different. (The code paths diverge enough that it may be useful to optimize them separately.) For the following machine (with CPU scaling disabled to the best of my knowledge): goos: linux goarch: amd64 pkg: github.com/timescale/timescaledb-parallel-copy/internal/batch cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz benchstat reports the following changes as of this commit, when compared to the previous one: $ benchstat original.txt multiline.txt name old time/op new time/op delta Scan/warmup_(disregard)_(standard_escapes)-8 294µs ±23% 623µs ± 8% +111.59% (p=0.008 n=5+5) Scan/warmup_(disregard)_(custom_escapes)-8 294µs ±23% 617µs ± 1% +109.60% (p=0.008 n=5+5) Scan/no_quotes_(standard_escapes)-8 273µs ± 2% 625µs ± 2% +129.08% (p=0.008 n=5+5) Scan/no_quotes_(custom_escapes)-8 273µs ± 2% 633µs ± 1% +132.08% (p=0.008 n=5+5) Scan/some_quotes_at_the_beginning_(standard_escapes)-8 281µs ± 6% 835µs ± 2% +196.96% (p=0.008 n=5+5) Scan/some_quotes_at_the_beginning_(custom_escapes)-8 281µs ± 6% 850µs ± 2% +202.22% (p=0.008 n=5+5) Scan/some_quotes_in_the_middle_(standard_escapes)-8 280µs ± 5% 856µs ± 2% +205.66% (p=0.008 n=5+5) Scan/some_quotes_in_the_middle_(custom_escapes)-8 280µs ± 5% 879µs ± 2% +214.06% (p=0.008 n=5+5) Scan/all_quotes_(standard_escapes)-8 332µs ±10% 1106µs ± 1% +232.77% (p=0.008 n=5+5) Scan/all_quotes_(custom_escapes)-8 332µs ±10% 1129µs ± 2% +239.91% (p=0.008 n=5+5) Scan/nothing_but_quotes_(standard_escapes)-8 303µs ± 5% 859µs ± 1% +183.87% (p=0.008 n=5+5) Scan/nothing_but_quotes_(custom_escapes)-8 303µs ± 5% 1140µs ± 1% +276.61% (p=0.008 n=5+5) This is a pretty significant slowdown -- we've reduced the speed by a factor of two or three.

jchampio · 2022-08-24T18:20:27Z

Okay, as of the above commit, multiline record splitting should no longer happen. If you have the ability to test that out and confirm, that'd be fantastic! Please reopen if you find any issues.

binakot changed the title ~~panic: pq: unterminated CSV quoted field~~ Multi-line records are not supported! panic: pq: unterminated CSV quoted field Apr 30, 2021

jchampio mentioned this issue Jun 9, 2022

Fix multiple issues by switching database drivers #63

Merged

jchampio mentioned this issue Aug 11, 2022

Don't split multi-line rows across batches #67

Merged

jchampio closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

binakot commented Apr 29, 2021

binakot commented Apr 29, 2021

binakot commented Apr 30, 2021

binakot commented Apr 30, 2021

ryanmjacobs commented Jan 14, 2022

mindrunner commented Jul 19, 2022 •

edited

Loading

jchampio commented Aug 24, 2022

Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

Multi-line records are not supported! panic: pq: unterminated CSV quoted field #50

Comments

binakot commented Apr 29, 2021

binakot commented Apr 29, 2021

How to reproduce

binakot commented Apr 30, 2021

How to reproduce

Single worker

--batch-size 1 --workers 1

--batch-size 2 --workers 1

--batch-size 3 --workers 1

--batch-size 100 --workers 1

Two workers

--batch-size 1 --workers 2

binakot commented Apr 30, 2021

ryanmjacobs commented Jan 14, 2022

mindrunner commented Jul 19, 2022 • edited Loading

jchampio commented Aug 24, 2022

mindrunner commented Jul 19, 2022 •

edited

Loading