Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Unicode string literals prefixed with U& #1354

Closed
lovasoa opened this issue Jul 27, 2024 · 0 comments · Fixed by #1355
Closed

support Unicode string literals prefixed with U& #1354

lovasoa opened this issue Jul 27, 2024 · 0 comments · Fixed by #1355

Comments

@lovasoa
Copy link
Contributor

lovasoa commented Jul 27, 2024

Postgres supports literal strings prefixed with U& which can include backslash-escaped Unicode code points:

A variant of quoted identifiers allows including escaped Unicode characters identified by their code points. This variant starts with U& (upper or lower case U followed by ampersand) immediately before the opening double quote, without any spaces in between, for example U&"foo". (Note that this creates an ambiguity with the operator &. Use spaces around the operator to avoid this problem.) Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number. For example, the identifier "data" could be written as

U&"d\0061t+000061"

The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:

U&"\0441\043B\043E\043D"

If a different escape character than backslash is desired, it can be specified using the UESCAPE clause after the string, for example:

U&"d!0061t!+000061" UESCAPE '!'

The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is written in single quotes, not double quotes, after UESCAPE.

To include the escape character in the identifier literally, write it twice.

Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate pairs are not stored directly, but are combined into a single code point.)

https://www.postgresql.org/docs/current/sql-syntax-lexical.html

Currently sqlparser-rs parses U&'x' as the binary operation & applied to an identifier u and a string literal.

Initially reported in sqlpage/SQLPage#511

Other databases support the syntax:

@lovasoa lovasoa changed the title support postgres Unicode literals support Unicode string literals prefixed with U& Jul 27, 2024
lovasoa added a commit to lovasoa/sqlparser-rs that referenced this issue Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant