sql: Identifiers should always be case-folded unless double quoted #8862

nvanbenschoten · 2016-08-27T03:21:19Z

The Postgres JDBC driver makes a few assumptions about SQL identifiers (table names, column names, etc.) being case-folded unless double quoted. This provides the desired behavior that most usages of identifiers are case-insensitive. PG's documentation on this behavior is here.

Postgres Behavior Example:

postgres=# CREATE TABLE a (i int);
CREATE TABLE
postgres=# select * from a;
 i

---
(0 rows)

postgres=# select * from A;
 i

---
(0 rows)

postgres=# select * from "a";
 i

---
(0 rows)

postgres=# select * from "A";
ERROR:  relation "A" does not exist
LINE 1: select * from "A";
                      ^

In CockroachDB however, we handle identifiers differently. Instead of performing this case-folding early in our parser, we have case-insensitive comparisons scattered throughout our code. This means that the case of object names in descriptors could be lower-case, upper-case, or mixed. This is proving to be problematic.

Because the PGJDBC driver makes the assumption that all identifiers are case-folded unless explicitly told not to (using double quotes), it often performs case-sensitive comparisons against introspective statement results. This means that a series of statements like the following work for Postgres, but don't work for us.

CREATE TABLE Test (a int);
SELECT * FROM pg_class WHERE relname = 'test'

Note also that the select * from "A"; example above would select from table a in CockroachDB, which is a deviation from Postgres and may not be desirable.

I propose that we change our SQL parser to perform case-folding early on, and always store case-folded names in db/table descriptors (again, unless told not to with double quotes). This will allow us to ignore identifier case throughout the rest of the sql package, and will fix the two issues discussed above for free.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2016-08-27T03:21:31Z

@knz thoughts?

knz · 2016-08-27T08:56:04Z

👍 of course.

I didn't realize previously that pg's SQL was actually case-sensitive with pre-normalization during parsing. This makes perfect sense. I just checked and you can have two tables "a" and "A" side-by-side. This greatly simplifies the code throughout our code base and will improve performance of lookups everywhere. I am definitely in favor of this change.

The only remaining question is whether double quotes disable all normalizations or just case folding. We currently also normalize unicode characters; it seems to me we may still want to do this even for quoted names (during parsing too, of course).

(This would also incidentally supersede the discussion in #8200)

poetix · 2017-02-14T10:25:31Z

Is this possibly why liquibase can't detect that the table "databasechangelog" has already been created, and fails every time it's run against cockroachdb except the first?

nvanbenschoten · 2017-04-10T01:56:44Z

Hi @poetix, I apologize for the delayed response. I don't believe anyone on the team has had a chance to take a look at liquibase, so I can't say for sure what the problem is there. That said, at first glance, I don't think this issue would be causing the problem you're seeing. I say this because this issue will cause issues when creating tables with capitalized characters and then searching for that table using lower case characters. In Postgres, there is an implicit case-folding during table creation, so the capitalized table name will be converted to lower case, allowing the subsequent lookup to succeed. Until this issue is resolved, CockroachDB won't perform a similar case-folding. It sounds like you created the table databasechangelog with lower case characters, so a failed lookup wouldn't be a result of this issue. Would you mind opening up a separate issue with any other auxiliary information that might serve useful when debugging? Thanks!

nvanbenschoten · 2017-04-10T01:58:47Z

Regarding the 1.0 release, I don't think I'm realistically going to be able to get to this in the next two weeks. Perhaps @justinj might want to take a look?

knz · 2017-04-10T10:42:34Z

Perhaps we could start by rejecting identifiers that do not normalize to themselves unless they are quoted? This will ensure that any future change post-beta will not break existing databases.

petermattis · 2017-04-10T12:40:45Z

CockroachDB is case-preserving while PostgreSQL is case-folding. Seems like every little difference in semantics will be uncovered by an ORM.

Moving to case-folding is straightforward. The first step would be to introduce case-folding to the scanner for non-quoted identifiers. And then a subsequent step would be to remove the case-insensitive comparisons scattered throughout the code, though that could wait until later.

bdarnell · 2017-04-10T19:16:47Z

Removing the case-insensitive comparisons will be backwards-incompatible unless we also migrate existing tables to their case-folded names (but we can't tell any more whether the double-quote case-preserving semantics would have been intended).

jordanlewis · 2017-04-18T16:13:42Z

For 1.0, could we simply change the semantics of quoted identifiers to also compare in a case-insensitive way? Supporting case sensitivity in quoted identifiers doesn't seem that important to me. How often do customers need to have tables a and A be distinct?

bdarnell · 2017-04-18T16:43:05Z

I'd be fine with making table names always be case-insensitive.

knz · 2017-04-18T17:05:35Z

This is a bit complicated, see the original PR example:

CREATE TABLE Test (a int);
SELECT * FROM pg_class WHERE relname = 'test

It's not only table names that need to be case-insensitive; we'd need to make relname and anything related also use a new SQL data type with case-insensitive comparisons.

knz · 2017-04-20T18:08:38Z

Just had an extensive discussion with @nvanbenschoten and @danhhz about this.

Context: CockroachDB currently stores identifiers in storage using the same case that was used in CREATE (e.g. CREATE TABLE RedCamel causes a descriptor with name "RedCamel") but performs case-insensitive lookups (so that select * from redcamel works).

There are two competing proposals in the issue history so far: do what Nathan proposed initially, which is to make our SQL engine case-sensitive and more like pg; or do what Jordan suggested in #8862 (comment), which is merely to normalize the names in the pg_catalog virtual tables and hope for the best.

After discussing with Nathan and Dan we think that the first proposal is better for two reasons. One is that it removes the overhead of normalization in the query planner, which is currently required whenever a descriptor is looked up or when a view descriptor is used (a performance gain); the second is that it makes the experience less surprising for users (in Dan's words, "case insensitivity in filesystems is known to cause headaches").

There are 8 types of names currently case-insensitive in CockroachDB: databases, tables, columns, indexes, functions, session variables, cluster settings, and usernames. This issue will remove case insensitivity only for the first 4, i.e. stored things. (We will need to do function names too eventually, when we support user-defined an db-stored functions, but not doing this right now is not a problem for future backward-compatibility, because all the function names currently supported by CockroachDB are already normalized.)

What follows is a plan to achieve this. To illustrate the plan we consider a database that already contains a table called foo and another table called RedCamel.

remove case-insensitive comparisons throughout the sql package. After this step, the user cannot use the query select * from redcamel any more; however at that point they can still use select * from RedCamel and select * from "RedCamel". At this point SHOW CREATE TABLE RedCamel still prints "CREATE TABLE RedCamel(...)".
change the pretty-printing of names to double-quote identifiers that are not equal to their normalization. This will cause SHOW CREATE TABLE RedCamel to now print out "CREATE TABLE "RedCamel"(...)"
conditionally normalize identifiers during lexing, i.e. normalize during lexing if the identifier is not double-quoted, and do not normalize if the identifier is quoted. After this step, the user cannot use select * from RedCamel any more, but they can still use select * from "RedCamel". Also now the user must enter SHOW CREATE TABLE "RedCamel" to see the table definition.

At this point we have achieved the main goal, which is that all existing descriptors can still be used; they will behave as if they had been initially created double-quoted. Names that happen to already be normalized (the most common case: lowercase with no special characters) can be used without double quotes, for example in the context above everything using table foo is unchanged; even queries like select * from Foo will continue to work (because we normalize Foo to foo during scanning).

However at this point we still have an issue, what if the database contained a view pc_camels defined with the query SELECT * FROM redcamels? After the change in step 1 above, a query on pc_camels would fail because the name "redcamels" cannot be found any more. (Ideally, #10083 would be fixed already and we would not need to consider this case, but we can still find a solution before #10083 is solved.) To address this we need one additional step:

create a new version number for table descriptors; then:
- when using a view descriptor with a version smaller than that new number, migrate it to the new version by rewriting the query to use properly quoted identifiers. This will need to analyze (one-off) the query with case insensitive comparisons. A custom rewriter is needed for this.
- when the version number is up to date, just process the view as usual.
- (don't forget to document the new version number, we have a doc/rfc for that somewhere I believe)

knz · 2017-04-20T20:18:16Z

@justinj @eisenstatdavid how comfortable would you be with this?

cuongdo · 2017-04-20T20:23:59Z

@eisenstatdavid can you take this on for 1.0?

eisenstatdavid · 2017-04-20T20:33:59Z

Sure.

eisenstatdavid · 2017-04-24T13:42:04Z

A few questions:

Are non-ASCII identifiers allowed? (If not, then ignore the rest of this comment.)
While identifiers in double quotes are not case-mapped, are they
- treated as their constituent bytes?
- put into a Unicode normal form? NFC/NFD or NFKC/NFKD (which makes pairs like 9 and ⁹ equivalent)?
To do case mapping, do we use Golang's code-point-by-code-point functions (e.g., https://golang.org/pkg/strings/#ToUpper) or the ones that follow the Unicode standard (https://godoc.org/golang.org/x/text/cases)? The former will be more stable, but the latter is presumably more intuitive (at least if your language agrees with the default Unicode locale) and we're already planning to vendor that library if the tables change.

I also don't remember if there are situations where case-mapping, Unicode normalizing, and comparing the bytes is different from computing case-folding equality. I'll research that and what Postgres does.

FWIW, my first instinct is to leave double-quoted identifiers alone, not even to Unicode normalize them, so that queries with double-quotes are never contingent on what the Unicode library is doing, and to use golang.org/x/text/cases (and vendor it eventually).

bdarnell · 2017-04-24T14:42:56Z

Non-ASCII identifiers are allowed. For case-folding we use a slightly-modified version of the unicode rules, collapsing dotted and dotless I into a single character so that case-folding is not locale-sensitive.

I think we currently normalize to NFC only when we are also case-folding. I believe double-quoted identifiers are left exactly as in the input but haven't verified this.

knz · 2017-04-24T16:24:32Z

ditto what ben says ben
when nathan and I discussed this we envisioned that double quoted identifiers must be normalized but not case mapped. I would agree with your decision to not do any kind of normalization at all for quoted identifiers, but then we need to communicate this clearly in our docs (feel free to add a docs issue to follow up)
so far we've used ToUpper for case mapping but feel free to be smarter if the behavior remains the same for ASCII names.

eisenstatdavid · 2017-04-29T02:45:21Z

Bumping to 1.1 because the high-priority work is done.

An earlier change introduced pre-normalization of descriptor names upon descriptor creation, thereby aiming for two goals: - normalize descriptor names upon creation of the descriptor, so as to avoid the time overhead of re-normalizing the name on every access; - creating a distinction, like one exists in postgres, between descriptors created with the syntax `"A"` and the syntax `A` - the latter is normalized, the former is not. This makes case sensitivity opt-in for client applications. Unfortunately, prior to this patch, most SQL statements also used a case-insensitive *lookup* of database/table/view/column names from storage, preventing the aforementioned benefits. This patch completes the earlier change by making name lookups case-sensitive. An exhaustive test is modified/introduced to confirm that the change is invisible to most common use cases -- i.e. as long as client apps were not double quoting their identifiers. Fixes cockroachdb#8862. Fixes cockroachdb#16858.

There were some commented out test cases in a test that tests backupccl's descriptor matching logic. The tests cases exercise CockroachDB's handing of case folding inside double quotes. This was handled by cockroachdb#8862, but this test case was never uncommented. Release note: None

42607: backupccl: enable case folding desc matching test r=pbardea a=pbardea There were some commented out test cases in a test that tests backupccl's descriptor matching logic. The tests cases exercise CockroachDB's handing of case folding inside double quotes. This was handled by #8862, but this test case was never uncommented. Release note: None Co-authored-by: Paul Bardea <pbardea@gmail.com>

nvanbenschoten added the A-sql-pgcompat Semantic compatibility with PostgreSQL label Oct 19, 2016

nvanbenschoten mentioned this issue Oct 19, 2016

sql: Hibernate should work with CockroachDB #10079

Closed

12 tasks

knz mentioned this issue Oct 22, 2016

sql: add CREATE USER stmt and pgwire password-based authentication #9794

Merged

knz added the A-sql-semantics label Oct 27, 2016

nvanbenschoten mentioned this issue Nov 14, 2016

sql: support qualified name resolution and search path for function names #10595

Merged

jordanlewis mentioned this issue Dec 6, 2016

sql: Hibernate expects pg_attribute.attname to be lowercased #12122

Closed

petermattis added this to the 1.0 milestone Feb 22, 2017

spencerkimball assigned nvanbenschoten Apr 3, 2017

knz added the high priority label Apr 18, 2017

cuongdo assigned eisenstatdavid and unassigned nvanbenschoten Apr 20, 2017

eisenstatdavid mentioned this issue Apr 27, 2017

sql: scanner downcases unquoted identifiers #15434

Merged

eisenstatdavid removed high priority A-sql-pgcompat Semantic compatibility with PostgreSQL labels Apr 29, 2017

eisenstatdavid modified the milestones: 1.1, 1.0 Apr 29, 2017

jordanlewis mentioned this issue Jun 28, 2017

sql: improper case handling in REGCLASS casts #16767

Closed

knz mentioned this issue Jul 6, 2017

sql: make database, table, view and columns lookups properly case-sensitive #16884

Merged

knz closed this as completed in #16884 Jul 17, 2017

pbardea mentioned this issue Nov 20, 2019

backupccl: enable case folding desc matching test #42607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: Identifiers should always be case-folded unless double quoted #8862

sql: Identifiers should always be case-folded unless double quoted #8862

nvanbenschoten commented Aug 27, 2016

nvanbenschoten commented Aug 27, 2016

knz commented Aug 27, 2016 •

edited

Loading

poetix commented Feb 14, 2017

nvanbenschoten commented Apr 10, 2017

nvanbenschoten commented Apr 10, 2017 •

edited

Loading

knz commented Apr 10, 2017

petermattis commented Apr 10, 2017

bdarnell commented Apr 10, 2017

jordanlewis commented Apr 18, 2017

bdarnell commented Apr 18, 2017

knz commented Apr 18, 2017

knz commented Apr 20, 2017 •

edited by eisenstatdavid

Loading

knz commented Apr 20, 2017 •

edited

Loading

cuongdo commented Apr 20, 2017 •

edited

Loading

eisenstatdavid commented Apr 20, 2017

eisenstatdavid commented Apr 24, 2017

bdarnell commented Apr 24, 2017

knz commented Apr 24, 2017

eisenstatdavid commented Apr 29, 2017

sql: Identifiers should always be case-folded unless double quoted #8862

sql: Identifiers should always be case-folded unless double quoted #8862

Comments

nvanbenschoten commented Aug 27, 2016

nvanbenschoten commented Aug 27, 2016

knz commented Aug 27, 2016 • edited Loading

poetix commented Feb 14, 2017

nvanbenschoten commented Apr 10, 2017

nvanbenschoten commented Apr 10, 2017 • edited Loading

knz commented Apr 10, 2017

petermattis commented Apr 10, 2017

bdarnell commented Apr 10, 2017

jordanlewis commented Apr 18, 2017

bdarnell commented Apr 18, 2017

knz commented Apr 18, 2017

knz commented Apr 20, 2017 • edited by eisenstatdavid Loading

knz commented Apr 20, 2017 • edited Loading

cuongdo commented Apr 20, 2017 • edited Loading

eisenstatdavid commented Apr 20, 2017

eisenstatdavid commented Apr 24, 2017

bdarnell commented Apr 24, 2017

knz commented Apr 24, 2017

eisenstatdavid commented Apr 29, 2017

knz commented Aug 27, 2016 •

edited

Loading

nvanbenschoten commented Apr 10, 2017 •

edited

Loading

knz commented Apr 20, 2017 •

edited by eisenstatdavid

Loading

knz commented Apr 20, 2017 •

edited

Loading

cuongdo commented Apr 20, 2017 •

edited

Loading