Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected eliminated conditions in the connection with utf8mb4 charset #53730

Closed
pcqz opened this issue May 31, 2024 · 4 comments · Fixed by #54179
Closed

Unexpected eliminated conditions in the connection with utf8mb4 charset #53730

pcqz opened this issue May 31, 2024 · 4 comments · Fixed by #54179

Comments

@pcqz
Copy link

pcqz commented May 31, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

create table t (id int primary key, name varchar(20));
insert into t values(1, 'a'), (2, 'a ');
select * from t where name='a' and length(name)=1;

2. What did you expect to see? (Required)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql>  select * from t where name='a' and length(name)=1;
+----+------+
| id | name |
+----+------+
|  1 | a    |
+----+------+
1 row in set (0.00 sec)
mysql> explain select * from t where name='a' and length(name)=1;
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
| id                      | estRows | task      | access object | operator info                                    |
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
| TableReader_7           | 0.00    | root      |               | data:Selection_6                                 |
| └─Selection_6           | 0.00    | cop[tikv] |               | eq(length(test.t.name), 1), eq(test.t.name, "a") |
|   └─TableFullScan_5     | 2.00    | cop[tikv] | table:t       | keep order:false, stats:pseudo                   |
+-------------------------+---------+-----------+---------------+--------------------------------------------------+
3 rows in set (0.00 sec)

3. What did you see instead (Required)

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql>  select * from t where name='a' and length(name)=1;
+----+------+
| id | name |
+----+------+
|  1 | a    |
|  2 | a    |
+----+------+
2 rows in set (0.00 sec)

mysql> explain select * from t where name='a' and length(name)=1;
+-------------------------+---------+-----------+---------------+--------------------------------+
| id                      | estRows | task      | access object | operator info                  |
+-------------------------+---------+-----------+---------------+--------------------------------+
| TableReader_7           | 0.00    | root      |               | data:Selection_6               |
| └─Selection_6           | 0.00    | cop[tikv] |               | eq(test.t.name, "a")           |
|   └─TableFullScan_5     | 2.00    | cop[tikv] | table:t       | keep order:false, stats:pseudo |
+-------------------------+---------+-----------+---------------+--------------------------------+
3 rows in set (0.00 sec)

4. What is your TiDB version? (Required)

v6.5.6

@elsa0520
Copy link
Contributor

Mysql behavior

The utf8, UTF8MB3, UTF8MB4 have the same behavior.

mysql> set names utf8;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                                     |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 3719 | 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous. |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from t where name='a' and length(name)=1;
+----+------+
| id | name |
+----+------+
|  1 | a    |
+----+------+
1 row in set (0.01 sec)

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t where name='a' and length(name)=1;
+----+------+
| id | name |
+----+------+
|  1 | a    |
+----+------+
1 row in set (0.01 sec)

@elsa0520
Copy link
Contributor

elsa0520 commented Jun 17, 2024

What behavior of uft8/uft8mb4?

The trailing spaces will be ignore when compared with other string.
So the "a " is same as "a".

But, it doesn't mean the tailing spaces should be ignore in the length function. The length of "a " is 2 not 1.

SET NAMES utf8;
mysql> SELECT length('a ');
+--------------+
| length('a ') |
+--------------+
|            2 |
+--------------+
1 row in set (0.00 sec)

SET NAMES utf8mb4;
mysql> SELECT length('a ');
+--------------+
| length('a ') |
+--------------+
|            2 |
+--------------+
1 row in set (0.00 sec)

From mysql
https://dev.mysql.com/doc/refman/8.4/en/charset-binary-collations.html

For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to trailing spaces.

NO PAD collations treat trailing spaces as significant in comparisons, like any other character.

The differing behaviors can be demonstrated using the two utf8mb4 binary collations, one of which is PAD SPACE, the other of which is NO PAD. The example also shows how to use the INFORMATION_SCHEMA COLLATIONS table to determine the pad attribute for collations.

mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE
       FROM INFORMATION_SCHEMA.COLLATIONS
       WHERE COLLATION_NAME LIKE 'utf8mb4%bin';
+------------------+---------------+
| COLLATION_NAME   | PAD_ATTRIBUTE |
+------------------+---------------+
| utf8mb4_bin      | PAD SPACE     |
| utf8mb4_0900_bin | NO PAD        |
+------------------+---------------+
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_bin;
mysql> SELECT 'a ' = 'a';
+------------+
| 'a ' = 'a' |
+------------+
|          1 |
+------------+
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_0900_bin;
mysql> SELECT 'a ' = 'a';
+------------+
| 'a ' = 'a' |
+------------+
|          0 |
+------------+

@elsa0520
Copy link
Contributor

Constant propagate

where name='a' and length(name) = 1

The predicate name='a' couldn't propagate to length('a'). Because the collation of utfxxx_bin ignore the tailling space, while length function shouldn't ignore it.

@elsa0520
Copy link
Contributor

The constant propagate bug

@ti-chi-bot ti-chi-bot bot closed this as completed in 521c999 Jun 27, 2024
ti-chi-bot bot pushed a commit that referenced this issue Jul 4, 2024
ti-chi-bot bot pushed a commit that referenced this issue Aug 1, 2024
ti-chi-bot bot pushed a commit that referenced this issue Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants