-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic when reading from partitioned datasets with columns that have '
in them
#9269
Comments
I tried searching documentation of various engines to see if ' is allowed in partition columns. I didn't find anything concrete. However, I tried the equivalent example in DuckDB and it does work (see below). It would probably be best to tighten up our parsing of non standard column names. It may be easier to make this robust by extending sqlparser-rs upstream. The recent discussion on the mailing list is relevant https://lists.apache.org/thread/q80j49poyg99x2c01900312qz7ps9wgp devinjd@devinjd$ ./duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
D insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
D select * from test;
┌─────────┬─────────┬─────────┐
│ 'test' │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a │ x │ aa │
│ b │ y │ bb │
│ c │ z │ cc │
└─────────┴─────────┴─────────┘
D copy test to '/tmp/escape_quote' (format csv, partition_by ('''test2''','''test3'''));
D select * from read_csv('/tmp/escape_quote/*/*/*.csv', hive_partitioning=1, header=true);
┌─────────┬─────────┬─────────┐
│ 'test' │ 'test2' │ 'test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a │ x │ aa │
│ b │ y │ bb │
│ c │ z │ cc │
└─────────┴─────────┴─────────┘ |
Another case: DataFusion CLI v37.0.0
❯ create external table test123(a string, ```a=b``` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
0 row(s) fetched.
Elapsed 0.002 seconds.
❯ insert into test123 select 'a', 'b';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
❯ select * from test123;
thread 'main' panicked at /home/jeffrey/Code/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
datafusion-cli$ Files exist on disk: datafusion-cli$ ll /tmp/test123/
total 0
drwxr-xr-x 2 jeffrey jeffrey 60 Apr 5 20:43 '%60a=b%60=b'/
drwxrwxrwt 119 root root 2.6K Apr 5 20:43 ../
drwxr-xr-x 3 jeffrey jeffrey 60 Apr 5 20:43 ./
datafusion-cli$ ll /tmp/test123/\%60a=b\%60=b/
total 4.0K
-rw-r--r-- 1 jeffrey jeffrey 282 Apr 5 20:43 33T7kTcVyecaVk07.parquet
drwxr-xr-x 3 jeffrey jeffrey 60 Apr 5 20:43 ../
drwxr-xr-x 2 jeffrey jeffrey 60 Apr 5 20:43 ./
datafusion-cli$ Note I had to get cheeky with the column and partition names, as just this didn't work: DataFusion CLI v37.0.0
❯ create external table test123(a string, `a=b` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
Arrow error: Schema error: Unable to get field named "`a=b`". Valid fields: ["a", "a=b"] |
Describe the bug
There is a bug when reading from partitioned tables that have commas in their names
Here is the test
https://github.com/apache/arrow-datafusion/blob/b2a04519da97c2ff81789ef41dd652870794a73a/datafusion/sqllogictest/test_files/copy.slt#L109
To Reproduce
Run this script
Here is an example:
Expected behavior
Note the data is written correctly
Additional context
@devinjdangelo found this in #9240
The text was updated successfully, but these errors were encountered: