-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Identifiers should always be case-folded unless double quoted #8862
Comments
@knz thoughts? |
👍 of course. I didn't realize previously that pg's SQL was actually case-sensitive with pre-normalization during parsing. This makes perfect sense. I just checked and you can have two tables "a" and "A" side-by-side. This greatly simplifies the code throughout our code base and will improve performance of lookups everywhere. I am definitely in favor of this change. The only remaining question is whether double quotes disable all normalizations or just case folding. We currently also normalize unicode characters; it seems to me we may still want to do this even for quoted names (during parsing too, of course). (This would also incidentally supersede the discussion in #8200) |
Is this possibly why liquibase can't detect that the table "databasechangelog" has already been created, and fails every time it's run against cockroachdb except the first? |
Hi @poetix, I apologize for the delayed response. I don't believe anyone on the team has had a chance to take a look at liquibase, so I can't say for sure what the problem is there. That said, at first glance, I don't think this issue would be causing the problem you're seeing. I say this because this issue will cause issues when creating tables with capitalized characters and then searching for that table using lower case characters. In Postgres, there is an implicit case-folding during table creation, so the capitalized table name will be converted to lower case, allowing the subsequent lookup to succeed. Until this issue is resolved, CockroachDB won't perform a similar case-folding. It sounds like you created the table |
Regarding the 1.0 release, I don't think I'm realistically going to be able to get to this in the next two weeks. Perhaps @justinj might want to take a look? |
Perhaps we could start by rejecting identifiers that do not normalize to themselves unless they are quoted? This will ensure that any future change post-beta will not break existing databases. |
CockroachDB is case-preserving while PostgreSQL is case-folding. Seems like every little difference in semantics will be uncovered by an ORM. Moving to case-folding is straightforward. The first step would be to introduce case-folding to the scanner for non-quoted identifiers. And then a subsequent step would be to remove the case-insensitive comparisons scattered throughout the code, though that could wait until later. |
Removing the case-insensitive comparisons will be backwards-incompatible unless we also migrate existing tables to their case-folded names (but we can't tell any more whether the double-quote case-preserving semantics would have been intended). |
For 1.0, could we simply change the semantics of quoted identifiers to also compare in a case-insensitive way? Supporting case sensitivity in quoted identifiers doesn't seem that important to me. How often do customers need to have tables |
I'd be fine with making table names always be case-insensitive. |
This is a bit complicated, see the original PR example:
It's not only table names that need to be case-insensitive; we'd need to make |
Just had an extensive discussion with @nvanbenschoten and @danhhz about this. Context: CockroachDB currently stores identifiers in storage using the same case that was used in CREATE (e.g. There are two competing proposals in the issue history so far: do what Nathan proposed initially, which is to make our SQL engine case-sensitive and more like pg; or do what Jordan suggested in #8862 (comment), which is merely to normalize the names in the After discussing with Nathan and Dan we think that the first proposal is better for two reasons. One is that it removes the overhead of normalization in the query planner, which is currently required whenever a descriptor is looked up or when a view descriptor is used (a performance gain); the second is that it makes the experience less surprising for users (in Dan's words, "case insensitivity in filesystems is known to cause headaches"). There are 8 types of names currently case-insensitive in CockroachDB: databases, tables, columns, indexes, functions, session variables, cluster settings, and usernames. This issue will remove case insensitivity only for the first 4, i.e. stored things. (We will need to do function names too eventually, when we support user-defined an db-stored functions, but not doing this right now is not a problem for future backward-compatibility, because all the function names currently supported by CockroachDB are already normalized.) What follows is a plan to achieve this. To illustrate the plan we consider a database that already contains a table called
At this point we have achieved the main goal, which is that all existing descriptors can still be used; they will behave as if they had been initially created double-quoted. Names that happen to already be normalized (the most common case: lowercase with no special characters) can be used without double quotes, for example in the context above everything using table However at this point we still have an issue, what if the database contained a view
|
@justinj @eisenstatdavid how comfortable would you be with this? |
@eisenstatdavid can you take this on for 1.0? |
Sure. |
A few questions:
I also don't remember if there are situations where case-mapping, Unicode normalizing, and comparing the bytes is different from computing case-folding equality. I'll research that and what Postgres does. FWIW, my first instinct is to leave double-quoted identifiers alone, not even to Unicode normalize them, so that queries with double-quotes are never contingent on what the Unicode library is doing, and to use |
Non-ASCII identifiers are allowed. For case-folding we use a slightly-modified version of the unicode rules, collapsing dotted and dotless I into a single character so that case-folding is not locale-sensitive. I think we currently normalize to NFC only when we are also case-folding. I believe double-quoted identifiers are left exactly as in the input but haven't verified this. |
|
Bumping to 1.1 because the high-priority work is done. |
An earlier change introduced pre-normalization of descriptor names upon descriptor creation, thereby aiming for two goals: - normalize descriptor names upon creation of the descriptor, so as to avoid the time overhead of re-normalizing the name on every access; - creating a distinction, like one exists in postgres, between descriptors created with the syntax `"A"` and the syntax `A` - the latter is normalized, the former is not. This makes case sensitivity opt-in for client applications. Unfortunately, prior to this patch, most SQL statements also used a case-insensitive *lookup* of database/table/view/column names from storage, preventing the aforementioned benefits. This patch completes the earlier change by making name lookups case-sensitive. An exhaustive test is modified/introduced to confirm that the change is invisible to most common use cases -- i.e. as long as client apps were not double quoting their identifiers. Fixes cockroachdb#8862. Fixes cockroachdb#16858.
An earlier change introduced pre-normalization of descriptor names upon descriptor creation, thereby aiming for two goals: - normalize descriptor names upon creation of the descriptor, so as to avoid the time overhead of re-normalizing the name on every access; - creating a distinction, like one exists in postgres, between descriptors created with the syntax `"A"` and the syntax `A` - the latter is normalized, the former is not. This makes case sensitivity opt-in for client applications. Unfortunately, prior to this patch, most SQL statements also used a case-insensitive *lookup* of database/table/view/column names from storage, preventing the aforementioned benefits. This patch completes the earlier change by making name lookups case-sensitive. An exhaustive test is modified/introduced to confirm that the change is invisible to most common use cases -- i.e. as long as client apps were not double quoting their identifiers. Fixes cockroachdb#8862. Fixes cockroachdb#16858.
An earlier change introduced pre-normalization of descriptor names upon descriptor creation, thereby aiming for two goals: - normalize descriptor names upon creation of the descriptor, so as to avoid the time overhead of re-normalizing the name on every access; - creating a distinction, like one exists in postgres, between descriptors created with the syntax `"A"` and the syntax `A` - the latter is normalized, the former is not. This makes case sensitivity opt-in for client applications. Unfortunately, prior to this patch, most SQL statements also used a case-insensitive *lookup* of database/table/view/column names from storage, preventing the aforementioned benefits. This patch completes the earlier change by making name lookups case-sensitive. An exhaustive test is modified/introduced to confirm that the change is invisible to most common use cases -- i.e. as long as client apps were not double quoting their identifiers. Fixes cockroachdb#8862. Fixes cockroachdb#16858.
An earlier change introduced pre-normalization of descriptor names upon descriptor creation, thereby aiming for two goals: - normalize descriptor names upon creation of the descriptor, so as to avoid the time overhead of re-normalizing the name on every access; - creating a distinction, like one exists in postgres, between descriptors created with the syntax `"A"` and the syntax `A` - the latter is normalized, the former is not. This makes case sensitivity opt-in for client applications. Unfortunately, prior to this patch, most SQL statements also used a case-insensitive *lookup* of database/table/view/column names from storage, preventing the aforementioned benefits. This patch completes the earlier change by making name lookups case-sensitive. An exhaustive test is modified/introduced to confirm that the change is invisible to most common use cases -- i.e. as long as client apps were not double quoting their identifiers. Fixes cockroachdb#8862. Fixes cockroachdb#16858.
There were some commented out test cases in a test that tests backupccl's descriptor matching logic. The tests cases exercise CockroachDB's handing of case folding inside double quotes. This was handled by cockroachdb#8862, but this test case was never uncommented. Release note: None
42607: backupccl: enable case folding desc matching test r=pbardea a=pbardea There were some commented out test cases in a test that tests backupccl's descriptor matching logic. The tests cases exercise CockroachDB's handing of case folding inside double quotes. This was handled by #8862, but this test case was never uncommented. Release note: None Co-authored-by: Paul Bardea <pbardea@gmail.com>
The Postgres JDBC driver makes a few assumptions about SQL identifiers (table names, column names, etc.) being case-folded unless double quoted. This provides the desired behavior that most usages of identifiers are case-insensitive. PG's documentation on this behavior is here.
Postgres Behavior Example:
In CockroachDB however, we handle identifiers differently. Instead of performing this case-folding early in our parser, we have case-insensitive comparisons scattered throughout our code. This means that the case of object names in descriptors could be lower-case, upper-case, or mixed. This is proving to be problematic.
Because the PGJDBC driver makes the assumption that all identifiers are case-folded unless explicitly told not to (using double quotes), it often performs case-sensitive comparisons against introspective statement results. This means that a series of statements like the following work for Postgres, but don't work for us.
Note also that the
select * from "A";
example above would select from tablea
in CockroachDB, which is a deviation from Postgres and may not be desirable.I propose that we change our SQL parser to perform case-folding early on, and always store case-folded names in db/table descriptors (again, unless told not to with double quotes). This will allow us to ignore identifier case throughout the rest of the
sql
package, and will fix the two issues discussed above for free.The text was updated successfully, but these errors were encountered: