support Unicode string literals prefixed with `U&` #1354

lovasoa · 2024-07-27T19:52:46Z

Postgres supports literal strings prefixed with U& which can include backslash-escaped Unicode code points:

A variant of quoted identifiers allows including escaped Unicode characters identified by their code points. This variant starts with U& (upper or lower case U followed by ampersand) immediately before the opening double quote, without any spaces in between, for example U&"foo". (Note that this creates an ambiguity with the operator &. Use spaces around the operator to avoid this problem.) Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number. For example, the identifier "data" could be written as

U&"d\0061t+000061"

The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:

U&"\0441\043B\043E\043D"

If a different escape character than backslash is desired, it can be specified using the UESCAPE clause after the string, for example:

U&"d!0061t!+000061" UESCAPE '!'

The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is written in single quotes, not double quotes, after UESCAPE.

To include the escape character in the identifier literally, write it twice.

Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate pairs are not stored directly, but are combined into a single code point.)

https://www.postgresql.org/docs/current/sql-syntax-lexical.html

Currently sqlparser-rs parses U&'x' as the binary operation & applied to an identifier u and a string literal.

Initially reported in sqlpage/SQLPage#511

Other databases support the syntax:

The text was updated successfully, but these errors were encountered:

fixes apache#1354

lovasoa changed the title ~~support postgres Unicode literals~~ support Unicode string literals prefixed with U& Jul 27, 2024

This was referenced Jul 27, 2024

support unicode literal strings prefixed with U& sqlpage/SQLPage#512

Closed

Support for postgres String Constants with Unicode Escapes #1355

Merged

lovasoa added a commit to lovasoa/sqlparser-rs that referenced this issue Jul 27, 2024

add support for postgres String Constants with Unicode Escapes

69ea53e

fixes apache#1354

alamb closed this as completed in #1355 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support Unicode string literals prefixed with `U&` #1354

support Unicode string literals prefixed with `U&` #1354

lovasoa commented Jul 27, 2024 •

edited

Loading

support Unicode string literals prefixed with U& #1354

support Unicode string literals prefixed with U& #1354

Comments

lovasoa commented Jul 27, 2024 • edited Loading

support Unicode string literals prefixed with `U&` #1354

support Unicode string literals prefixed with `U&` #1354

lovasoa commented Jul 27, 2024 •

edited

Loading