Emojis and 4-byte characters cause Jethro to break #754

jefft · 2022-01-29T04:28:05Z

Jethro currently breaks when emojis are entered. For instance, if I try to save SMS text containing '😊' into a Person note, it fails and an error is logged:

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xF0\x9F\x98\x8A' for column `jethro`.`note_comment`.`contents` at row 1
Line 240 of /srv/www/jethro/2.31.1/include/db_object.class.php

USER:       1
REFERER:    https://jethro/?view=_edit_note&note_type=person&noteid=20649
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36

REQUEST:
Array
(
    [view] => _edit_note
    [note_type] => person
    [noteid] => 20649
    [update_note_submitted] => 1
    [contents] => All good thanks, looking at getting back very soon
Thanks for reaching out 😊
    [status] => pending
    [assignee] => 1
    [action_date_d] => 24
    [action_date_m] => 1
    [action_date_y] => 2022
)

The problem is that MySQL's 'utf8' encoding isn't real UTF-8, but a 3-byte subset that doesn't support all characters. The correct encoding to use in MySQL is 'utfmb4'. For more information, see here or here.

When I checked my Jethro database I found the database-level encoding was correct:

MariaDB [jethro]> SELECT default_character_set_name FROM information_schema.SCHEMATA                                                                               
    -> WHERE schema_name = "jethro";                                                                                                                               
                                                                                                                                                                   
+----------------------------+                                                                                                                                     
| default_character_set_name |                                                                                                                                     
+----------------------------+                                                                                                                                     
| utf8mb4                    |                                                                                                                                     
+----------------------------+

but table-level encodings were wrong:

MariaDB [jethro]> SELECT CCSA.character_set_name FROM information_schema.`TABLES` T,        information_schema.`COLLATION_CHARACTER_SET_APP                        
LICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation   AND T.table_schema = "jethro";                                                                    
+--------------------+                                                                                                                                             
| character_set_name |                                                                                                                                             
+--------------------+                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |

Fixing this needs two changes:

The database and table encodings need to be changed from utf8 to utf8mb4.
Jethro needs to initiate connections with utf8mb4 rather than utf8, as is currently hardcoded when Jethro constructs the connection dsn:

jethro-pmm/include/jethrodb.php

Line 53 in 269b7b4

$dsn = $type . ':host=' . $host . (strlen($port) ? (';port=' . $port) : '') . ';dbname=' . DB_DATABASE . ';charset=utf8';

Correspondingly, the fix requires two parts:

Fix the database encodings. I followed this advice to generate appropriate SQL to upgrade all database tables. The result is in upgrades/2022-upgrade-to-2.32.sql
Fix the database DSN connection string; done in include/jethrodb.php.

This requires MySQL 5.5.3 and above (released in 2010). I have adjusted the minimum MySQL version in README.md.

Note: I hope changing character sets isn't opening a can of worms. I've only tested on one (mariadb 10.3.32) instance. There may be more places where utf8 needs changing to utf8mb4. I hope other users can test and confirm that upgrades go smoothly.

…version. This allows chars like emojis 😊 in Jethro if the database + tables are in utf8mb4 too.

…f8mb4

tbar0970 · 2022-02-03T10:03:55Z

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

tbar0970 · 2022-03-17T22:17:58Z

Nice summary of the issue here: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/

Holding off on this for now since on MySQL 5.1 emojis are working fine...

vanoudt · 2022-09-05T01:46:53Z

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

vanoudt · 2022-09-05T01:52:28Z

I've updated #606 too match this too!

tbar0970 · 2022-09-05T04:13:51Z

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

That wouldn't be it - the original problem was triggered by pasting emojis into a Jethro note, not by sending anything out.

tbar0970 · 2023-03-07T22:25:18Z

Hey I'm thinking now would be a good time to merge this.
Not sure what "closed by deleting the head repository" means or came from?

jefft · 2023-03-07T23:59:01Z

Holding off on this for now since on MySQL 5.1 emojis are working fine...

Yes, it was wise to hold off for that reason. The SQL in this PR would corrupt some older databases.

As your experience shows, emojis in MySQL 5.1 do work, if your database encoding is latin1 and you're using ancient PHP (5.3.3) which seems hardcode latin1 as the transfer encoding (ignoring charset=utf8 in include/jethrodb.php). Say we have a latin1 table, and we ask MySQL to insert the 4 utf8 bytes representing 😊. MySQL goes "close enough lol ¯_(ツ)_/¯" and adds the 4 raw bytes, despite them not being valid latin1. When asked to retrieve data, MySQL goes 'here's your bytes whatever they are'. Everything just works because PHP and MySQL store and load the raw bytes unaltered.

Now what happens if you take one of these latin1 databases containing utf8 bytes, and run the SQL I proposed in this PR? E.g.:

ALTER TABLE `family_note` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

I suspect that might break your characters. Say you have a '😊' encoded as 4 bytes: MySQL will try to map those 4 utf8 bytes one by one, to whatever unicode characters live at those 3 codepoints.

For latin1 databases, what's needed is a more involved process involving a mysqldump, then search & replace, and import, as documented at https://www.whitesmith.co/blog/latin1-to-utf8/. I have a draft guide written on how to do that.

What about the less severe case, where tables are created as utf8, and we run SQL to CONVERT CHARACTER SET utf8mb4? I think that would be safe.

So there's 2 kinds of Jethro databases: those with old latin1 databases that need surgery, and those with utf8 databases that just need converting, and we cannot recommend the same SQL for both cases. Perhaps we'll need a PHP script, upgrades/2023-fixcharsets.php, which does the right thing.

tbar0970 · 2023-03-08T01:28:49Z

Nice analysis.

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import. And/or ancient emojis can just suffer.

jefft · 2023-03-08T02:27:48Z

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

I don't know, but I'd expect the column-level charset to take precedence. If your columns are utf8 you're probably okay - just run some SQL to convert it (and the columns) to utf8mb4.

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import.

Yes, I think that's fine.

I've tweaked my comment above, to point out that 😁 has a 4-byte utf8 representation.

As for how to tell category 1 and category 2 databases apart, 4-byte emojis are a useful shibboleth. Ask your Jethro "can you pronounce 😁"?

A modern Jethro using utf8 will not be able to store that. utf8 can only store 3-byte unicode, not 4-byte. You can see a utf8 Jethro fall over at https://easyjethro.com.au/demo/.
However an old Jethro, with latin1 database and PHP 5.x will be able to store and display 😁. If the smiley shows, you know you'll need database surgery to convert latin1 to utf8mb4.

tbar0970 · 2023-04-12T13:06:17Z

Further diagnostic steps from Jeff:

Try running this query:
SELECT T.table_name, CCSA.character_set_name FROM information_schema.TABLES T, information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = DATABASE();

If you get lots of 'latin1', but smileys still work, your database is basically corrupt - storing utf8 bytes in latin1 tables. The fix is as described in https://www.whitesmith.co/blog/latin1-to-utf8/, to dump the database with no charset transcoding:

mysqldump --routines --events --no-data --skip-set-charset --default-character-set=latin1 jethro > jethro.sql

Then do a massive search & replace on the .sql dump:

cat jethro.sql | sed -e 's/CHARACTER SET [^ ]+/CHARACTER SET utf8mb4/g'
-e 's/DEFAULT CHARSET=[^;]+;/DEFAULT CHARSET=utf8mb4;/g'
-e 's/SET (character_set_client|character_set_results)( +)= ([^@][^ ]+)([ ;])/SET \1\2= utf8mb4\4/g'
-e 's/SET (collation_connection)( +)= ([^ ]+) /SET \1\2= utf8mb4_bin /g'
-e 's/COLLATE [^ ]+_ci/COLLATE utf8mb4_unicode_ci/g'
-e 's/SET NAMES [^ ]+ /SET NAMES utf8mb4 /g' \

fixedjethro.sql

Then import the fixed SQL.

jefft · 2024-09-30T10:49:11Z

That diagnostic SQL query, with non-broken formatting:

SELECT t.table_name, 
       ccsa.character_set_name,
       t.table_collation
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
  AND t.table_schema = database();

This shows the encoding and collation of each table, e.g.:

+----------------------------------+--------------------+--------------------+
| table_name                       | character_set_name | table_collation    |
+----------------------------------+--------------------+--------------------+
| person_status                    | utf8mb4            | utf8mb4_bin        |
| 2fa_trust                        | utf8mb4            | utf8mb4_bin        |
| custom_field                     | utf8mb4            | utf8mb4_unicode_ci |
| person_group_membership_status   | utf8mb4            | utf8mb4_unicode_ci |
| _abstract_note_old_backup        | utf8mb4            | utf8mb4_unicode_ci |
| staff_member                     | utf8mb4            | utf8mb4_unicode_ci |
....

Encodings

Your character_set_name might be:

utf8mb4 - this is ideal
utf8 or utf8mb3 (an alias for utf8) - this is less ideal but livable. You won't be able to store 4-byte characters like 😁 in Jethro. You might like to convert all tables to utf8mb4.
latin1. This is a problem if your database has any non-ASCII, international characters (including emojis). See above comment on how to dump and re-import your database.

Collations

As for collations: whatever the table collation is (for non-binary tables), must be the same for all tables:

SELECT DISTINCT 
CHARACTER_SET_NAME,
                table_collation,
                count(*)
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
  AND t.table_schema = database()
GROUP BY 1,
         2
ORDER BY 3 DESC;
+--------------------+--------------------+----------+
| CHARACTER_SET_NAME | table_collation    | count(*) |
+--------------------+--------------------+----------+
| utf8mb4            | utf8mb4_unicode_ci |       56 |
| utf8mb4            | utf8mb4_bin        |        4 |
+--------------------+--------------------+----------+
2 rows in set (0.002 sec)

If you mix, say, utf8_general_ci and utf8_unicode_ci, you will get "General error: 1271 Illegal mix of collations for operation 'UNION'" errors.

jefft added 3 commits January 29, 2022 14:00

Use real 4-bytes-per-char utf8 rather than mysql's degenerate 3-byte …

bb28ada

…version. This allows chars like emojis 😊 in Jethro if the database + tables are in utf8mb4 too.

Upgrade existing databases from 3-bit to proper 4-bit utf8

48cc40b

Bump minimum MySQL version to 5.5.3, which is the first to support ut…

e05f34f

…f8mb4

jefft closed this by deleting the head repository Jan 23, 2023

tbar0970 reopened this Mar 7, 2023

jefft closed this Mar 7, 2023

jefft mentioned this pull request Jul 11, 2024

Database migration to new server #1058

Open

jefft mentioned this pull request Sep 30, 2024

SQL error in report #1084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emojis and 4-byte characters cause Jethro to break #754

Emojis and 4-byte characters cause Jethro to break #754

jefft commented Jan 29, 2022 •

edited

Loading

tbar0970 commented Feb 3, 2022

tbar0970 commented Mar 17, 2022 •

edited

Loading

vanoudt commented Sep 5, 2022

vanoudt commented Sep 5, 2022

tbar0970 commented Sep 5, 2022

tbar0970 commented Mar 7, 2023

jefft commented Mar 7, 2023 •

edited

Loading

tbar0970 commented Mar 8, 2023

jefft commented Mar 8, 2023 •

edited

Loading

tbar0970 commented Apr 12, 2023

jefft commented Sep 30, 2024

Emojis and 4-byte characters cause Jethro to break #754

Emojis and 4-byte characters cause Jethro to break #754

Conversation

jefft commented Jan 29, 2022 • edited Loading

tbar0970 commented Feb 3, 2022

tbar0970 commented Mar 17, 2022 • edited Loading

vanoudt commented Sep 5, 2022

vanoudt commented Sep 5, 2022

tbar0970 commented Sep 5, 2022

tbar0970 commented Mar 7, 2023

jefft commented Mar 7, 2023 • edited Loading

tbar0970 commented Mar 8, 2023

jefft commented Mar 8, 2023 • edited Loading

tbar0970 commented Apr 12, 2023

jefft commented Sep 30, 2024

Encodings

Collations

jefft commented Jan 29, 2022 •

edited

Loading

tbar0970 commented Mar 17, 2022 •

edited

Loading

jefft commented Mar 7, 2023 •

edited

Loading

jefft commented Mar 8, 2023 •

edited

Loading