Fixes to CSV encoding/line endings/dialect inference #432

mildbyte · 2021-04-07T15:07:41Z

Autodetect the encoding using chardet
Add more configuration to the CSV plugin for: encoding, dialect (e.g. "excel"), sample size for inference
Bump the sample size to 64KB to have a better chance of inferring the dialect for wider tables
Autogenerate column names for unnamed columns
Handle Mac-style and other newlines (universal newlines mode)

…to a separate module. Get the CSV plugin to also infer the file's encoding and get it to handle Windows line endings properly. Also make the sample size for inference customizable.

….g. col_1) since PG doesn't like empty column names. Add an integration test for the end-to-end querying + import through FDW with an unnamed column.

…a test for Mac-style newlines.

* Fixes to the Snowflake data source (#421) * Add automatic encoding, newline and dialect inference to the CSV data source (#432)

mildbyte added 5 commits April 7, 2021 12:59

Fix pre-commit.

1baadc8

Clean up the CSV FDW and factor out the encoding/dialect inference in…

1f7b8e6

…to a separate module. Get the CSV plugin to also infer the file's encoding and get it to handle Windows line endings properly. Also make the sample size for inference customizable.

Fix tests and some types. Add generating names for unnamed columns (e…

67c98ba

….g. col_1) since PG doesn't like empty column names. Add an integration test for the end-to-end querying + import through FDW with an unnamed column.

Fixup paths (CSV resources moved)

375f1d3

Make sure we can handle universal newlines at inference time and add …

924f46c

…a test for Mac-style newlines.

mildbyte merged commit 09e0f56 into master Apr 7, 2021

mildbyte deleted the feature/csv-encoding-inference branch April 7, 2021 15:24

mildbyte added a commit that referenced this pull request Apr 7, 2021

Bump version: 0.2.11 → 0.2.12

9f64e86

* Fixes to the Snowflake data source (#421) * Add automatic encoding, newline and dialect inference to the CSV data source (#432)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to CSV encoding/line endings/dialect inference #432

Fixes to CSV encoding/line endings/dialect inference #432

mildbyte commented Apr 7, 2021

Fixes to CSV encoding/line endings/dialect inference #432

Fixes to CSV encoding/line endings/dialect inference #432

Conversation

mildbyte commented Apr 7, 2021