Panic when reading from partitioned datasets with columns that have `'` in them #9269

alamb · 2024-02-19T07:55:14Z

Describe the bug

There is a bug when reading from partitioned tables that have commas in their names

Here is the test
https://github.com/apache/arrow-datafusion/blob/b2a04519da97c2ff81789ef41dd652870794a73a/datafusion/sqllogictest/test_files/copy.slt#L109

To Reproduce

Run this script

-- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

-- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

-- This panics
select * from validate_partitioned_escape_quote;

Here is an example:

❯ -- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

0 rows in set. Query took 0.008 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.009 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.029 seconds.

❯ -- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

0 rows in set. Query took 0.004 seconds.

❯ -- This panics
select * from validate_partitioned_escape_quote;

thread 'thread 'tokio-runtime-workertokio-runtime-worker' panicked at ' panicked at /Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs::248:thread '54248:
:tokio-runtime-workerindex out of bounds: the len is 0 but the index is 054' panicked at
/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:248:
:index out of bounds: the len is 0 but the index is 054
:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::core::Core<T,S>::poll
   7: tokio::runtime::task::harness::Harness<T,S>::poll
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   9: tokio::runtime::scheduler::multi_thread::worker::Context::run
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Expected behavior

Note the data is written correctly

andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ find /tmp/escape_quote
/tmp/escape_quote
/tmp/escape_quote/'test2'=x
/tmp/escape_quote/'test2'=x/'test3'=aa
/tmp/escape_quote/'test2'=x/'test3'=aa/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=y
/tmp/escape_quote/'test2'=y/'test3'=bb
/tmp/escape_quote/'test2'=y/'test3'=bb/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=z
/tmp/escape_quote/'test2'=z/'test3'=cc
/tmp/escape_quote/'test2'=z/'test3'=cc/3zMw255TXFQxId14.csv

andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ cat /tmp/escape_quote/\'test2\'\=x/\'test3\'\=aa/3zMw255TXFQxId14.csv
'test'
a

Additional context

@devinjdangelo found this in #9240

The text was updated successfully, but these errors were encountered:

Add reference to #9269

devinjdangelo · 2024-02-19T14:14:45Z

I tried searching documentation of various engines to see if ' is allowed in partition columns. I didn't find anything concrete.

However, I tried the equivalent example in DuckDB and it does work (see below). It would probably be best to tighten up our parsing of non standard column names. It may be easier to make this robust by extending sqlparser-rs upstream. The recent discussion on the mailing list is relevant https://lists.apache.org/thread/q80j49poyg99x2c01900312qz7ps9wgp

devinjd@devinjd$ ./duckdb 
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
D insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
D select * from test;
┌─────────┬─────────┬─────────┐
│ 'test'  │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘
D copy test to '/tmp/escape_quote' (format csv, partition_by ('''test2''','''test3'''));
D select * from read_csv('/tmp/escape_quote/*/*/*.csv', hive_partitioning=1, header=true);
┌─────────┬─────────┬─────────┐
│ 'test'  │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘

Add reference to #9269

Jefffrey · 2024-04-05T09:48:45Z

Another case:

DataFusion CLI v37.0.0
❯ create external table test123(a string, ```a=b``` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
0 row(s) fetched.
Elapsed 0.002 seconds.

❯ insert into test123 select 'a', 'b';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

❯ select * from test123;
thread 'main' panicked at /home/jeffrey/Code/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
datafusion-cli$

Files exist on disk:

datafusion-cli$ ll /tmp/test123/
total 0
drwxr-xr-x   2 jeffrey jeffrey   60 Apr  5 20:43 '%60a=b%60=b'/
drwxrwxrwt 119 root    root    2.6K Apr  5 20:43  ../
drwxr-xr-x   3 jeffrey jeffrey   60 Apr  5 20:43  ./
datafusion-cli$ ll /tmp/test123/\%60a=b\%60=b/
total 4.0K
-rw-r--r-- 1 jeffrey jeffrey 282 Apr  5 20:43 33T7kTcVyecaVk07.parquet
drwxr-xr-x 3 jeffrey jeffrey  60 Apr  5 20:43 ../
drwxr-xr-x 2 jeffrey jeffrey  60 Apr  5 20:43 ./
datafusion-cli$

Note I had to get cheeky with the column and partition names, as just this didn't work:

DataFusion CLI v37.0.0
❯ create external table test123(a string, `a=b` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
Arrow error: Schema error: Unable to get field named "`a=b`". Valid fields: ["a", "a=b"]

alamb added the bug Something isn't working label Feb 19, 2024

alamb mentioned this issue Feb 19, 2024

Support Copy To Partitioned Files #9240

Merged

alamb added a commit that referenced this issue Feb 19, 2024

Add ticket reference to test

681d9c2

Add reference to #9269

alamb mentioned this issue Feb 19, 2024

Minor: Add ticket reference to test #9270

Merged

andygrove pushed a commit that referenced this issue Feb 21, 2024

Add ticket reference to test (#9270)

202f285

Add reference to #9269

Jefffrey mentioned this issue Apr 5, 2024

Validate partitions columns in CREATE EXTERNAL TABLE if table already exists. #9912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when reading from partitioned datasets with columns that have `'` in them #9269

Panic when reading from partitioned datasets with columns that have `'` in them #9269

alamb commented Feb 19, 2024

devinjdangelo commented Feb 19, 2024 •

edited

Loading

Jefffrey commented Apr 5, 2024

Panic when reading from partitioned datasets with columns that have ' in them #9269

Panic when reading from partitioned datasets with columns that have ' in them #9269

Comments

alamb commented Feb 19, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

devinjdangelo commented Feb 19, 2024 • edited Loading

Jefffrey commented Apr 5, 2024

Panic when reading from partitioned datasets with columns that have `'` in them #9269

Panic when reading from partitioned datasets with columns that have `'` in them #9269

devinjdangelo commented Feb 19, 2024 •

edited

Loading