Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Identifiers should always be case-folded unless double quoted #8862

Closed
nvanbenschoten opened this issue Aug 27, 2016 · 19 comments
Closed
Assignees
Milestone

Comments

@nvanbenschoten
Copy link
Member

The Postgres JDBC driver makes a few assumptions about SQL identifiers (table names, column names, etc.) being case-folded unless double quoted. This provides the desired behavior that most usages of identifiers are case-insensitive. PG's documentation on this behavior is here.

Postgres Behavior Example:

postgres=# CREATE TABLE a (i int);
CREATE TABLE
postgres=# select * from a;
 i

---
(0 rows)

postgres=# select * from A;
 i

---
(0 rows)

postgres=# select * from "a";
 i

---
(0 rows)

postgres=# select * from "A";
ERROR:  relation "A" does not exist
LINE 1: select * from "A";
                      ^

In CockroachDB however, we handle identifiers differently. Instead of performing this case-folding early in our parser, we have case-insensitive comparisons scattered throughout our code. This means that the case of object names in descriptors could be lower-case, upper-case, or mixed. This is proving to be problematic.

Because the PGJDBC driver makes the assumption that all identifiers are case-folded unless explicitly told not to (using double quotes), it often performs case-sensitive comparisons against introspective statement results. This means that a series of statements like the following work for Postgres, but don't work for us.

CREATE TABLE Test (a int);
SELECT * FROM pg_class WHERE relname = 'test'

Note also that the select * from "A"; example above would select from table a in CockroachDB, which is a deviation from Postgres and may not be desirable.

I propose that we change our SQL parser to perform case-folding early on, and always store case-folded names in db/table descriptors (again, unless told not to with double quotes). This will allow us to ignore identifier case throughout the rest of the sql package, and will fix the two issues discussed above for free.

@nvanbenschoten
Copy link
Member Author

@knz thoughts?

@knz
Copy link
Contributor

knz commented Aug 27, 2016

👍 of course.

I didn't realize previously that pg's SQL was actually case-sensitive with pre-normalization during parsing. This makes perfect sense. I just checked and you can have two tables "a" and "A" side-by-side. This greatly simplifies the code throughout our code base and will improve performance of lookups everywhere. I am definitely in favor of this change.

The only remaining question is whether double quotes disable all normalizations or just case folding. We currently also normalize unicode characters; it seems to me we may still want to do this even for quoted names (during parsing too, of course).

(This would also incidentally supersede the discussion in #8200)

@poetix
Copy link

poetix commented Feb 14, 2017

Is this possibly why liquibase can't detect that the table "databasechangelog" has already been created, and fails every time it's run against cockroachdb except the first?

@petermattis petermattis added this to the 1.0 milestone Feb 22, 2017
@nvanbenschoten
Copy link
Member Author

Hi @poetix, I apologize for the delayed response. I don't believe anyone on the team has had a chance to take a look at liquibase, so I can't say for sure what the problem is there. That said, at first glance, I don't think this issue would be causing the problem you're seeing. I say this because this issue will cause issues when creating tables with capitalized characters and then searching for that table using lower case characters. In Postgres, there is an implicit case-folding during table creation, so the capitalized table name will be converted to lower case, allowing the subsequent lookup to succeed. Until this issue is resolved, CockroachDB won't perform a similar case-folding. It sounds like you created the table databasechangelog with lower case characters, so a failed lookup wouldn't be a result of this issue. Would you mind opening up a separate issue with any other auxiliary information that might serve useful when debugging? Thanks!

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Apr 10, 2017

Regarding the 1.0 release, I don't think I'm realistically going to be able to get to this in the next two weeks. Perhaps @justinj might want to take a look?

@knz
Copy link
Contributor

knz commented Apr 10, 2017

Perhaps we could start by rejecting identifiers that do not normalize to themselves unless they are quoted? This will ensure that any future change post-beta will not break existing databases.

@petermattis
Copy link
Collaborator

CockroachDB is case-preserving while PostgreSQL is case-folding. Seems like every little difference in semantics will be uncovered by an ORM.

Moving to case-folding is straightforward. The first step would be to introduce case-folding to the scanner for non-quoted identifiers. And then a subsequent step would be to remove the case-insensitive comparisons scattered throughout the code, though that could wait until later.

@bdarnell
Copy link
Contributor

Removing the case-insensitive comparisons will be backwards-incompatible unless we also migrate existing tables to their case-folded names (but we can't tell any more whether the double-quote case-preserving semantics would have been intended).

@jordanlewis
Copy link
Member

For 1.0, could we simply change the semantics of quoted identifiers to also compare in a case-insensitive way? Supporting case sensitivity in quoted identifiers doesn't seem that important to me. How often do customers need to have tables a and A be distinct?

@bdarnell
Copy link
Contributor

I'd be fine with making table names always be case-insensitive.

@knz
Copy link
Contributor

knz commented Apr 18, 2017

This is a bit complicated, see the original PR example:

CREATE TABLE Test (a int);
SELECT * FROM pg_class WHERE relname = 'test

It's not only table names that need to be case-insensitive; we'd need to make relname and anything related also use a new SQL data type with case-insensitive comparisons.

@knz
Copy link
Contributor

knz commented Apr 20, 2017

Just had an extensive discussion with @nvanbenschoten and @danhhz about this.

Context: CockroachDB currently stores identifiers in storage using the same case that was used in CREATE (e.g. CREATE TABLE RedCamel causes a descriptor with name "RedCamel") but performs case-insensitive lookups (so that select * from redcamel works).

There are two competing proposals in the issue history so far: do what Nathan proposed initially, which is to make our SQL engine case-sensitive and more like pg; or do what Jordan suggested in #8862 (comment), which is merely to normalize the names in the pg_catalog virtual tables and hope for the best.

After discussing with Nathan and Dan we think that the first proposal is better for two reasons. One is that it removes the overhead of normalization in the query planner, which is currently required whenever a descriptor is looked up or when a view descriptor is used (a performance gain); the second is that it makes the experience less surprising for users (in Dan's words, "case insensitivity in filesystems is known to cause headaches").

There are 8 types of names currently case-insensitive in CockroachDB: databases, tables, columns, indexes, functions, session variables, cluster settings, and usernames. This issue will remove case insensitivity only for the first 4, i.e. stored things. (We will need to do function names too eventually, when we support user-defined an db-stored functions, but not doing this right now is not a problem for future backward-compatibility, because all the function names currently supported by CockroachDB are already normalized.)

What follows is a plan to achieve this. To illustrate the plan we consider a database that already contains a table called foo and another table called RedCamel.

  1. remove case-insensitive comparisons throughout the sql package. After this step, the user cannot use the query select * from redcamel any more; however at that point they can still use select * from RedCamel and select * from "RedCamel". At this point SHOW CREATE TABLE RedCamel still prints "CREATE TABLE RedCamel(...)".
  2. change the pretty-printing of names to double-quote identifiers that are not equal to their normalization. This will cause SHOW CREATE TABLE RedCamel to now print out "CREATE TABLE "RedCamel"(...)"
  3. conditionally normalize identifiers during lexing, i.e. normalize during lexing if the identifier is not double-quoted, and do not normalize if the identifier is quoted. After this step, the user cannot use select * from RedCamel any more, but they can still use select * from "RedCamel". Also now the user must enter SHOW CREATE TABLE "RedCamel" to see the table definition.

At this point we have achieved the main goal, which is that all existing descriptors can still be used; they will behave as if they had been initially created double-quoted. Names that happen to already be normalized (the most common case: lowercase with no special characters) can be used without double quotes, for example in the context above everything using table foo is unchanged; even queries like select * from Foo will continue to work (because we normalize Foo to foo during scanning).

However at this point we still have an issue, what if the database contained a view pc_camels defined with the query SELECT * FROM redcamels? After the change in step 1 above, a query on pc_camels would fail because the name "redcamels" cannot be found any more. (Ideally, #10083 would be fixed already and we would not need to consider this case, but we can still find a solution before #10083 is solved.) To address this we need one additional step:

  1. create a new version number for table descriptors; then:
    • when using a view descriptor with a version smaller than that new number, migrate it to the new version by rewriting the query to use properly quoted identifiers. This will need to analyze (one-off) the query with case insensitive comparisons. A custom rewriter is needed for this.
    • when the version number is up to date, just process the view as usual.
    • (don't forget to document the new version number, we have a doc/rfc for that somewhere I believe)

@knz
Copy link
Contributor

knz commented Apr 20, 2017

@justinj @eisenstatdavid how comfortable would you be with this?

@cuongdo
Copy link
Contributor

cuongdo commented Apr 20, 2017

@eisenstatdavid can you take this on for 1.0?

@eisenstatdavid
Copy link

Sure.

@eisenstatdavid
Copy link

A few questions:

  1. Are non-ASCII identifiers allowed? (If not, then ignore the rest of this comment.)
  2. While identifiers in double quotes are not case-mapped, are they
    • treated as their constituent bytes?
    • put into a Unicode normal form? NFC/NFD or NFKC/NFKD (which makes pairs like 9 and ⁹ equivalent)?
  3. To do case mapping, do we use Golang's code-point-by-code-point functions (e.g., https://golang.org/pkg/strings/#ToUpper) or the ones that follow the Unicode standard (https://godoc.org/golang.org/x/text/cases)? The former will be more stable, but the latter is presumably more intuitive (at least if your language agrees with the default Unicode locale) and we're already planning to vendor that library if the tables change.

I also don't remember if there are situations where case-mapping, Unicode normalizing, and comparing the bytes is different from computing case-folding equality. I'll research that and what Postgres does.

FWIW, my first instinct is to leave double-quoted identifiers alone, not even to Unicode normalize them, so that queries with double-quotes are never contingent on what the Unicode library is doing, and to use golang.org/x/text/cases (and vendor it eventually).

@bdarnell
Copy link
Contributor

Non-ASCII identifiers are allowed. For case-folding we use a slightly-modified version of the unicode rules, collapsing dotted and dotless I into a single character so that case-folding is not locale-sensitive.

I think we currently normalize to NFC only when we are also case-folding. I believe double-quoted identifiers are left exactly as in the input but haven't verified this.

@knz
Copy link
Contributor

knz commented Apr 24, 2017

  1. ditto what ben says ben
  2. when nathan and I discussed this we envisioned that double quoted identifiers must be normalized but not case mapped. I would agree with your decision to not do any kind of normalization at all for quoted identifiers, but then we need to communicate this clearly in our docs (feel free to add a docs issue to follow up)
  3. so far we've used ToUpper for case mapping but feel free to be smarter if the behavior remains the same for ASCII names.

@eisenstatdavid
Copy link

Bumping to 1.1 because the high-priority work is done.

@eisenstatdavid eisenstatdavid removed high priority A-sql-pgcompat Semantic compatibility with PostgreSQL labels Apr 29, 2017
@eisenstatdavid eisenstatdavid modified the milestones: 1.1, 1.0 Apr 29, 2017
knz added a commit to knz/cockroach that referenced this issue Jul 7, 2017
An earlier change introduced pre-normalization of descriptor names
upon descriptor creation, thereby aiming for two goals:

- normalize descriptor names upon creation of the descriptor, so as to
  avoid the time overhead of re-normalizing the name on every access;

- creating a distinction, like one exists in postgres, between
  descriptors created with the syntax `"A"` and the syntax `A` - the
  latter is normalized, the former is not. This makes case sensitivity
  opt-in for client applications.

Unfortunately, prior to this patch, most SQL statements also used a
case-insensitive *lookup* of database/table/view/column names from
storage, preventing the aforementioned benefits.

This patch completes the earlier change by making name lookups
case-sensitive. An exhaustive test is modified/introduced to confirm
that the change is invisible to most common use cases -- i.e. as long
as client apps were not double quoting their identifiers.

Fixes cockroachdb#8862.
Fixes cockroachdb#16858.
knz added a commit to knz/cockroach that referenced this issue Jul 14, 2017
An earlier change introduced pre-normalization of descriptor names
upon descriptor creation, thereby aiming for two goals:

- normalize descriptor names upon creation of the descriptor, so as to
  avoid the time overhead of re-normalizing the name on every access;

- creating a distinction, like one exists in postgres, between
  descriptors created with the syntax `"A"` and the syntax `A` - the
  latter is normalized, the former is not. This makes case sensitivity
  opt-in for client applications.

Unfortunately, prior to this patch, most SQL statements also used a
case-insensitive *lookup* of database/table/view/column names from
storage, preventing the aforementioned benefits.

This patch completes the earlier change by making name lookups
case-sensitive. An exhaustive test is modified/introduced to confirm
that the change is invisible to most common use cases -- i.e. as long
as client apps were not double quoting their identifiers.

Fixes cockroachdb#8862.
Fixes cockroachdb#16858.
knz added a commit to knz/cockroach that referenced this issue Jul 17, 2017
An earlier change introduced pre-normalization of descriptor names
upon descriptor creation, thereby aiming for two goals:

- normalize descriptor names upon creation of the descriptor, so as to
  avoid the time overhead of re-normalizing the name on every access;

- creating a distinction, like one exists in postgres, between
  descriptors created with the syntax `"A"` and the syntax `A` - the
  latter is normalized, the former is not. This makes case sensitivity
  opt-in for client applications.

Unfortunately, prior to this patch, most SQL statements also used a
case-insensitive *lookup* of database/table/view/column names from
storage, preventing the aforementioned benefits.

This patch completes the earlier change by making name lookups
case-sensitive. An exhaustive test is modified/introduced to confirm
that the change is invisible to most common use cases -- i.e. as long
as client apps were not double quoting their identifiers.

Fixes cockroachdb#8862.
Fixes cockroachdb#16858.
knz added a commit to knz/cockroach that referenced this issue Jul 17, 2017
An earlier change introduced pre-normalization of descriptor names
upon descriptor creation, thereby aiming for two goals:

- normalize descriptor names upon creation of the descriptor, so as to
  avoid the time overhead of re-normalizing the name on every access;

- creating a distinction, like one exists in postgres, between
  descriptors created with the syntax `"A"` and the syntax `A` - the
  latter is normalized, the former is not. This makes case sensitivity
  opt-in for client applications.

Unfortunately, prior to this patch, most SQL statements also used a
case-insensitive *lookup* of database/table/view/column names from
storage, preventing the aforementioned benefits.

This patch completes the earlier change by making name lookups
case-sensitive. An exhaustive test is modified/introduced to confirm
that the change is invisible to most common use cases -- i.e. as long
as client apps were not double quoting their identifiers.

Fixes cockroachdb#8862.
Fixes cockroachdb#16858.
pbardea added a commit to pbardea/cockroach that referenced this issue Nov 20, 2019
There were some commented out test cases in a test that tests
backupccl's descriptor matching logic. The tests cases exercise
CockroachDB's handing of case folding inside double quotes.

This was handled by cockroachdb#8862, but this test case was never uncommented.

Release note: None
craig bot pushed a commit that referenced this issue Nov 20, 2019
42607: backupccl: enable case folding desc matching test r=pbardea a=pbardea

There were some commented out test cases in a test that tests
backupccl's descriptor matching logic. The tests cases exercise
CockroachDB's handing of case folding inside double quotes.

This was handled by #8862, but this test case was never uncommented.

Release note: None

Co-authored-by: Paul Bardea <pbardea@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants