-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emojis and 4-byte characters cause Jethro to break #754
Conversation
…version. This allows chars like emojis 😊 in Jethro if the database + tables are in utf8mb4 too.
Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why. |
Nice summary of the issue here: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/ Holding off on this for now since on MySQL 5.1 emojis are working fine... |
Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error! |
I've updated #606 too match this too! |
That wouldn't be it - the original problem was triggered by pasting emojis into a Jethro note, not by sending anything out. |
Hey I'm thinking now would be a good time to merge this. |
Yes, it was wise to hold off for that reason. The SQL in this PR would corrupt some older databases. As your experience shows, emojis in MySQL 5.1 do work, if your database encoding is Now what happens if you take one of these ALTER TABLE `family_note` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; I suspect that might break your characters. Say you have a '😊' encoded as 4 bytes: MySQL will try to map those 4 utf8 bytes one by one, to whatever unicode characters live at those 3 codepoints. For What about the less severe case, where tables are created as So there's 2 kinds of Jethro databases: those with old |
Nice analysis. Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)? If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import. And/or ancient emojis can just suffer. |
I don't know, but I'd expect the column-level charset to take precedence. If your columns are
Yes, I think that's fine. I've tweaked my comment above, to point out that 😁 has a 4-byte utf8 representation. As for how to tell category 1 and category 2 databases apart, 4-byte emojis are a useful shibboleth. Ask your Jethro "can you pronounce 😁"? |
Further diagnostic steps from Jeff: Try running this query: If you get lots of 'latin1', but smileys still work, your database is basically corrupt - storing utf8 bytes in latin1 tables. The fix is as described in https://www.whitesmith.co/blog/latin1-to-utf8/, to dump the database with no charset transcoding: mysqldump --routines --events --no-data --skip-set-charset --default-character-set=latin1 jethro > jethro.sql Then do a massive search & replace on the .sql dump: cat jethro.sql | sed -e 's/CHARACTER SET [^ ]+/CHARACTER SET utf8mb4/g'
Then import the fixed SQL. |
That diagnostic SQL query, with non-broken formatting: SELECT t.table_name,
ccsa.character_set_name,
t.table_collation
FROM information_schema.`tables` t,
information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
AND t.table_schema = database(); This shows the encoding and collation of each table, e.g.: +----------------------------------+--------------------+--------------------+
| table_name | character_set_name | table_collation |
+----------------------------------+--------------------+--------------------+
| person_status | utf8mb4 | utf8mb4_bin |
| 2fa_trust | utf8mb4 | utf8mb4_bin |
| custom_field | utf8mb4 | utf8mb4_unicode_ci |
| person_group_membership_status | utf8mb4 | utf8mb4_unicode_ci |
| _abstract_note_old_backup | utf8mb4 | utf8mb4_unicode_ci |
| staff_member | utf8mb4 | utf8mb4_unicode_ci |
.... EncodingsYour
CollationsAs for collations: whatever the table collation is (for non-binary tables), must be the same for all tables: SELECT DISTINCT
CHARACTER_SET_NAME,
table_collation,
count(*)
FROM information_schema.`tables` t,
information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
AND t.table_schema = database()
GROUP BY 1,
2
ORDER BY 3 DESC;
+--------------------+--------------------+----------+
| CHARACTER_SET_NAME | table_collation | count(*) |
+--------------------+--------------------+----------+
| utf8mb4 | utf8mb4_unicode_ci | 56 |
| utf8mb4 | utf8mb4_bin | 4 |
+--------------------+--------------------+----------+
2 rows in set (0.002 sec) If you mix, say, |
Jethro currently breaks when emojis are entered. For instance, if I try to save SMS text containing '😊' into a Person note, it fails and an error is logged:
The problem is that MySQL's 'utf8' encoding isn't real UTF-8, but a 3-byte subset that doesn't support all characters. The correct encoding to use in MySQL is 'utfmb4'. For more information, see here or here.
When I checked my Jethro database I found the database-level encoding was correct:
but table-level encodings were wrong:
Fixing this needs two changes:
utf8
toutf8mb4
.utf8mb4
rather thanutf8
, as is currently hardcoded when Jethro constructs the connection dsn:jethro-pmm/include/jethrodb.php
Line 53 in 269b7b4
Correspondingly, the fix requires two parts:
This requires MySQL 5.5.3 and above (released in 2010). I have adjusted the minimum MySQL version in README.md.
Note: I hope changing character sets isn't opening a can of worms. I've only tested on one (mariadb 10.3.32) instance. There may be more places where
utf8
needs changing toutf8mb4
. I hope other users can test and confirm that upgrades go smoothly.