Type name case sensitivity #10085

chliang71 · 2024-06-06T17:21:02Z

Description

Currently when checking type children here and here, it is using string equal which is case sensitive.

There is also a [test|https://github.com/facebookincubator/velox/blob/main/velox/type/tests/TypeTest.cpp#L405] which has a todo comment regarding case behavior.

Internally we are running into cases like
Field not found: foobar. Available fields are: x, y, fooBar

So I wonder is there any particular reason/use case to be case sensitive? Or should we make it case insensitive? If yes, I can file a PR.

The text was updated successfully, but these errors were encountered:

Yuhta · 2024-06-06T17:37:43Z

I would say the case sensitive matching is a reasonable default. Making it configurable without performance hit is also complicated. After all you can maintain a mapping from the user-facing names to normalized names in planner, and worker (thus Velox) only need to deal with normalized names thus case sensitivity should not be a problem here.

yingsu00 · 2024-06-08T04:03:04Z

My understanding that whether a column name is case sensitive depends on the database implementation and there is no rule to say if it should be case insensitive. https://www.scaler.com/topics/is-sql-case-sensitive/. So if Velox chooses to allow it to be case sensitive then it is what it is. @Yuhta Is this a common consensus in Meta?

Yuhta · 2024-06-11T16:13:47Z

@yingsu00 It's more about design the library to support more cases. Velox is at the bottom level so it has to be in the most strict mode, and engine authors can relax the criteria by normalizing the names. If Velox chose to match name case-insensitively, there would be nothing engine authors can do to make it case sensitive at higher level.

yingsu00 · 2024-06-12T01:41:55Z

I consulted more people on this matter, and as @tdcmeehan pointed out, "Case sensitivity is a property of storage. For example, Hive and Iceberg are case insensitive, but some SQL databases are case sensitive. It is not a property of the engine--the engine needs to respect the convention adopted by storage. " If this is correct, implementing sense insensitiviy in just one single Engine probably won't work. "Suppose Spark wrote fooBar, and Presto wrote FooBar. In the Presto planner, we say, OK, the normalized name is foobar. We plan out the query and one worker is reading fooBar, another is reading FooBar--how can we accommodate case insensitivity without knowing in advance that we've written out the column in two different formats"

It seems to me that we may want to normalize the column names to lower case in the Hive connector which supports both Hive table format and Iceberg table format. One possible place for it to happen is in SplitReader::adaptColumns() (and IcebergSplitReader::adaptColumns() if needed for any semantic differences between these two table formats). The general type system in the core Velox and Presto planner, however, need to remain case sensitive. Does it make sense?

Yuhta · 2024-06-12T14:07:30Z

No by storage he means metastore, not Velox. Files do not keep public facing names, that's the job of metastore. The names in files are just data stream identifiers. Engines (or the storage submodule in engines) need to match user facing names to these identifiers (with the help of metastore), and different engines can have different ways to match them (by name, case sensitive/insensitive, by position, by other unique id than name). Different engines using the same storage do need to keep consistent though, otherwise there will be confusion.

Yuhta · 2024-06-12T14:11:48Z

We do have an option to normalize all names in file to lower case to make matching by name case insensitive easier for now, but we do have plan to have some configurable mechanism to make the column matching more generic. The implementation would likely to be replace the names in file with some changed names, like what DWRF reader is doing when hive.orc.use-column-names=false), but the interface is to be designed.

yingsu00 · 2024-07-05T20:12:32Z

@Yuhta We met similar issue when the column name contains spaces. I created #10388. I think the say to solve it is to

send the HiveColumnIndex from Presto to Velox in HiveColumnHandle
Support this use-column-names=false in Velox
Add space as a valid char in Tokenizer.

Our team is already working on 1)

Yuhta · 2024-07-05T21:52:47Z

@yingsu00 I replied in #10388 that we should be able to solve it without (1) because we use HiveTableHandle::dataColumns to get full table schema.

chliang71 added the enhancement New feature or request label Jun 6, 2024

This was referenced Jul 16, 2024

Include HiveColumnIndex in HiveColumnHandle to Velox prestodb/presto#23130

Closed

Iceberg table writer writes space to _x20 in column names in the Parquet file prestodb/presto#23129

Closed

Summary of Parquet reader Issues #9560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type name case sensitivity #10085

Type name case sensitivity #10085

chliang71 commented Jun 6, 2024

Yuhta commented Jun 6, 2024

yingsu00 commented Jun 8, 2024

Yuhta commented Jun 11, 2024

yingsu00 commented Jun 12, 2024

Yuhta commented Jun 12, 2024 •

edited

Loading

Yuhta commented Jun 12, 2024 •

edited

Loading

yingsu00 commented Jul 5, 2024

Yuhta commented Jul 5, 2024 •

edited

Loading

Type name case sensitivity #10085

Type name case sensitivity #10085

Comments

chliang71 commented Jun 6, 2024

Description

Yuhta commented Jun 6, 2024

yingsu00 commented Jun 8, 2024

Yuhta commented Jun 11, 2024

yingsu00 commented Jun 12, 2024

Yuhta commented Jun 12, 2024 • edited Loading

Yuhta commented Jun 12, 2024 • edited Loading

yingsu00 commented Jul 5, 2024

Yuhta commented Jul 5, 2024 • edited Loading

Yuhta commented Jun 12, 2024 •

edited

Loading

Yuhta commented Jun 12, 2024 •

edited

Loading

Yuhta commented Jul 5, 2024 •

edited

Loading