Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql mb4 #6409

Open
wants to merge 28 commits into
base: release-2.1
Choose a base branch
from
Open

Mysql mb4 #6409

wants to merge 28 commits into from

Conversation

sbulen
Copy link
Contributor

@sbulen sbulen commented Dec 14, 2020

Partial for #7173
Fixes #6405
Fixes #7776
Fixes #5108

Overview...
Mysql charsets supported should be utf8 (for 2.1 users migrated before utf8mb4 support) and utf8mb4. The only difference between the two is entity encoding needed for 4-byte utf8 chars when not yet utf8mb4. pg supports mb4 already.

All conditional utf8 logic (utf8 vs iso-8859) is removed.

Current thinking:

  • Install/upgrade to utf8mb4
  • Needs new 'convert to mb4', like 2.0 had with utf8
  • Yes, db changes are necessary, index size constraints (*** Can be avoided if we have change min mysql version to 5.7 & mandate InnoDB...***)
  • Stripped non-utf8/utf8mb4 logic wherever found; assume utf8/utf8mb4 ONLY going forward in 2.1; Making it easier to refactor down the road

191...

  • Where does 191 come from? (int) 767/4... Index size max varies based on innodb vs myisam & the innodb row format, which changed over time... Some configs/versions, the max is 767, for some, it's 1000, & for the rest is 3072. 767 is a safe assumption.
  • To do a good a/b test, you should emulate a prefix length of 767. You can do this by using innodb tables & changing innodb_default_row_format to 'compact', which was the default prior to mysql 5.7. (Or, use MyISAM, which has a limit of 1000.)

Trying to keep field meanings straight...

  • global_character_set - used in http headers "UTF-8"; had been used as a flag whether utf8 conversion had been run (no change)
  • $txt['lang_character_set'] - used in http headers "UTF-8" (no change)
  • $context['character_set'] = derived from the above 2 settings, used in http headers (no change)
  • $db_character_set = "utf8" or "utf8mb4" (requires changes throughout...)
  • The above 4 really need to be in sync...
  • $utf8 & $context['utf8'] = flags used throughout code to determine utf8/utf8mb4 string support - most often used to add u modifier to regexes, derived from above (NO LONGER NEEDED requires changes throughout...)
  • $db_mb4 = flag for whether mb4 is supported; set after utf8mb4 conversion (or pg in use)... Only used in entity conversion. This really should derived in load.php, not set by user.

Read up on index prefixes here:
https://dev.mysql.com/doc/refman/8.0/en/column-indexes.html
https://dev.mysql.com/doc/refman/8.0/en/innodb-limits.html#:~:text=The%20index%20key%20prefix%20length%20limit%20is%20767%20bytes%20for,4%20bytes%20for%20each%20character.

mb4 support history:
https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

Mysql usage & support by version:
https://www.eversql.com/mysql-8-adoption-usage-rate/
https://fromdual.com/support-for-mysql-from-oracle

@pr-triage pr-triage bot added the PR: draft label Dec 14, 2020
@sbulen sbulen added MySQL Database Charset/Encoding UTF8 & mb4 encoding related issues labels Dec 14, 2020
@albertlast
Copy link
Collaborator

In my eyes would be this a huge uplift to redurce the complexity,
by force everthing to be utf8mb4 AND to raise the min mysql version to 5.5.

So I like the pr.

@MissAllSunday MissAllSunday mentioned this pull request Dec 15, 2020
@Underdog-01
Copy link
Contributor

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci.
"general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

@Underdog-01
Copy link
Contributor

Also text fields may be affected in some circumstances.
They may require being changed to their upper tiered counterpart.
(ie. text to mediumtext)

@sbulen
Copy link
Contributor Author

sbulen commented Jan 18, 2021

Also text fields may be affected in some circumstances.
They may require being changed to their upper tiered counterpart.
(ie. text to mediumtext)

Pretty sure we're good here... The settings are now by characters, not bytes.

The problems experienced db-structure-wise were the index lengths...

@sbulen
Copy link
Contributor Author

sbulen commented Jan 18, 2021

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci.
"general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

I did some tests & couldn't find an example where there was a difference.

Can you identify a scenario?

I understand the benefit in theory, just not sure it's really there...

@sbulen
Copy link
Contributor Author

sbulen commented Jan 18, 2021

& THANK YOU for the feedback. I think this is an important topic.

@Underdog-01
Copy link
Contributor

It's data sorting for various languages that will present obstacles when using general.
A brief explanation with some examples can be found on the MySQL forum:
https://forums.mysql.com/read.php?103,187048,188748

@Underdog-01
Copy link
Contributor

Underdog-01 commented Jan 18, 2021

BTW I agree that this is a very important topic.
The MySQL website states that utf8mb3 is deprecated and in the very near future will be removed
Look at the first "Note" here:
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

@sbulen sbulen mentioned this pull request Feb 4, 2021
@live627 live627 added this to the The future milestone Mar 2, 2021
@jdarwood007
Copy link
Member

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

@live627
Copy link
Contributor

live627 commented Mar 21, 2021 via email

@albertlast
Copy link
Collaborator

in my eyes smf 2.1 should run only in mb4 mode and should drop mb3 support,
since it get again mysql and pg similiar and
you don't had the struggle with newer mysql version where mb4 get the new default coalation when you define a column as utf8.

@sbulen
Copy link
Contributor Author

sbulen commented Mar 22, 2021

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

I disagree with you on this. SMF2.1 should be mb4, IMO. But we've waited too long & we are appropriately maintaining scope. This is appropriate for 2.1.x.

There is nothing different at all in how we handle the db layer, no changes in DB calls at all. This PR mirrors how we dealt with mb3 in 2.0.

This PR is about 10% fixing bugs, 70% cleaning up old latin1 logic that should have already been removed from 2.1, and 20% migrating to mb4. But we agreed to wait after 2.1, so we are.

@jdarwood007
Copy link
Member

Well I do disagree with that because of all the alterations to the core functionality of code. But I will stand by my statement made on others. This has to make RC4 or its not making 2.1. We can't introduce this between RC4 and final.

@jdarwood007
Copy link
Member

Fyi, this is also showing as a draft PR.

@sbulen
Copy link
Contributor Author

sbulen commented Mar 23, 2021

It's marked as a draft because the dev team had agreed to wait until after 2.1 goes live for mb4 support. I want to ensure it doesnt get merged prematurely.

other/Settings.php Show resolved Hide resolved
Sources/ManageMaintenance.php Show resolved Hide resolved
@sbulen
Copy link
Contributor Author

sbulen commented Sep 24, 2021

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

@albertlast
Copy link
Collaborator

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

Exists any issue with pg?

@sbulen
Copy link
Contributor Author

sbulen commented Sep 24, 2021

I haven't seen any issues with pg yet. Still early in testing, though.

@albertlast
Copy link
Collaborator

From my pov the whole topic didn't exists in pg.

Signed by Shawn Bulen, bulens@pacbell.net
Copy link
Member

@jdarwood007 jdarwood007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approving this on the grounds the code looks good. But we need to get it tested. I don't have any non english forums to test on (other than the obvious). So we need to work on ensuring stability of this and that it converts just fine.

@albertlast
Copy link
Collaborator

Smf.org would be a nice place 😀

@sbulen
Copy link
Contributor Author

sbulen commented Jul 9, 2023

I did a ton of testing, but like, 2.5 years ago...

It needs to be repeated with the current codebase.

@jdarwood007
Copy link
Member

@albertlast Of course we will, once we feel its stable enough for a production environment. We eat our own food.

Signed by Shawn Bulen, bulens@pacbell.net
@sbulen
Copy link
Contributor Author

sbulen commented Jul 12, 2023

Just fixed an issue where log_search_subjects was still getting inappropriately truncated (byte-based substr).

@sbulen
Copy link
Contributor Author

sbulen commented Jul 12, 2023

I am still re-testing this one.

Just found an issue with upgrading old forums - mainly due to index renames over the years.

@jdarwood007
Copy link
Member

Would it seem logical to track all columns that would be affected in the index, convert, then drop and recreate that index? Its a bit of a long way around, but we can find what columns are in a index.

@sbulen
Copy link
Contributor Author

sbulen commented Jul 13, 2023

This PR currently drops & recreates the indexes. Both in the maint function & in the upgrader. The upgrader already tries various old names. The maint function needs to do so as well.

Also, we have a new index added after this PR (and before 2.1 release) that needs tweaking as well...

@jdarwood007
Copy link
Member

What happens for custom indexes that may not be included in this? Will the conversion fail? If it fails, I was thinking we track all indexes where we can find columns that are converted, then rebuild those indexes with a drop/create.

@sbulen
Copy link
Contributor Author

sbulen commented Jul 13, 2023

Hmmm... Custom indexes might be out there. For that matter, custom columns with custom indexes might be problematic also.

To rebuild the indexes, we'd have to save off the index build info (parse it from SHOW CREATE TABLE), tweak it, & recreate the indexes. This is feeling pretty complex, for what I expect is an edge case.

My take is that covering the known variants covers us 99.9%+ of the time. The folks who have customized their DB to that degree will need to complete the conversion manually. Kinda the price to pay for making said customizations...

Signed by Shawn Bulen, bulens@pacbell.net
@@ -226,6 +226,8 @@ function smf_db_create_table($table_name, $columns, $indexes = array(), $paramet
$table_query .= ') ENGINE=' . $parameters['engine'];
if (!empty($db_character_set) && $db_character_set == 'utf8')
$table_query .= ' DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci';
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an esle if would be better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either 'utf8' (currently an alias of 'utf8mb3') or 'utf8mb4'. These are the only two options at this point. And as noted in the description to this PR, we are defaulting to 'utf8mb4' going forward.

@jdarwood007
Copy link
Member

jdarwood007 commented Jul 13, 2023

Hmmm... Custom indexes might be out there. For that matter, custom columns with custom indexes might be problematic also.

I don't know about custom columns. That wouldn't affect this because we didn't convert that column to mb4.

To rebuild the indexes, we'd have to save off the index build info (parse it from SHOW CREATE TABLE), tweak it, & recreate the indexes. This is feeling pretty complex, for what I expect is an edge case.

We have database tools that can list indexes. $smcFunc['db_list_indexes'] does what you need I think.

@sbulen
Copy link
Contributor Author

sbulen commented Jul 16, 2023

I'll take a peek at recreating all indexes...

It's feeling like it's getting too complex for an edge case, though.

Note the CONVERT command converts all columns, and we are stepping thru all tables.

Signed by Shawn Bulen, bulens@pacbell.net
@jdarwood007
Copy link
Member

Understable it can get complex with edge cases. Which is why I wondered what happens for custom indexes. I need to setup a few to test. I suspect they would be corrupted and not work. But that converting the table should still happen? Or do they need to drop first for the conversion to occur?

@sbulen
Copy link
Contributor Author

sbulen commented Jul 17, 2023

I don't think the tables or indexes would get corrupted at any point.

If there is a custom index that meets the criteria (on any column, custom or not), the conversion will error out, & processing will stop. The table will not be converted. The index will need to be removed, the process restarted, and the index re-added after completion.

I doubt we will see a lot of custom indexes that meet the exact criteria. Probably none. Again, I consider this an unlikely edge case.

@Sesquipedalian
Copy link
Member

Note: This PR targets release-2.1, but it has been assigned to the milestones for SMF 3.0. The PR will need to be updated and resubmitted to target release-3.0. I am leaving it open for now so that it doesn't get forgotten.

@Sesquipedalian
Copy link
Member

Further note: Much has changed in SMF 3.0, and I don't want to lay the burden for reimplementing all of this on @sbulen, so I have assigned #7938 to myself. I will use this PR as a reference, since @sbulen has already thought through all the things that need to be done.

@sbulen
Copy link
Contributor Author

sbulen commented Jun 30, 2024

I think the new approach should be more along the lines of what was discussed over in #7938 than the approach used here.

Specifically, if we can ensure InnoDB and COMPACT rows, then there is no need for a 191 byte limitation on the indexes. The DB structures themselves don't need to change, the DB engine & storage traits do. The real question is the best way to do that.

I suggest that as a pre-upgrader step, outside of the normal upgrade process, so we don't end up overloading the upgrader logic as we did in 2.1. Such a pre-upgrader can also check for missing standard columns, missing default values, etc., the types of things that tend to bork the upgrader.

Also:

  • UTF8MB4_unicode_ci
  • Ripping out all the old ISO-8859 logic in various places. You should be able to grep it & only find any references in the upgrader (if that...).
  • Similar with regex - all the logic that chooses between ISO-8859 & utf8 needs to go (like in this PR); all regexes need the "u" going forward - all those ternaries should go...
  • No need for a utf8 conversion form & logic (like this PR has)

Last, & possibly most important - we should get rid of all of the entity-encoding logic for multi-byte characters altogether. All those $smcFunc calls should go - it's overly complicated & we are always chasing bugs in it.

We should be storing MB3/4 chars as chars, not entities. That's the point.

@Sesquipedalian
Copy link
Member

Sesquipedalian commented Jun 30, 2024

Excellent points all around. But in this PR you did ping pretty much all of the different bits that need to be addressed. Even if we address them somewhat differently for 3.0, we still need to address all these things. So it's excellent reference material.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Charset/Encoding UTF8 & mb4 encoding related issues Database MySQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emojis cannot be used in SMF logo fields Search issue - multi-byte words truncated utfmb4 database
6 participants