Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change MySQL defaults from broken utf8 to fixed utf8mb4 #851

Closed
wants to merge 3 commits into from

Conversation

DHager
Copy link

@DHager DHager commented May 4, 2015

This is a conservative echo of #317 .

Essentially MySQL's uft8 character set is broken and does not support full UTF-8, and a better alternative, utf8mb4 has existed for about five years now. When a 4-byte UTF-8 character comes in for a utf8 table, by default MySQL will truncate the string and log a warning.

Insofar as Doctrine has any default configuration values for MySQL, I think utf8mb4 is a better, safer choice. If anybody is running MySQL <5.5.3 and has problems, they should have a fairly easy time figuring out what's going wrong due to how distinctive the string is. (The other way around, searching for utf8, is not.)

@doctrinebot
Copy link

Hello,

thank you for creating this pull request. I have automatically opened an issue
on our Jira Bug Tracker for you. See the issue link:

http://www.doctrine-project.org/jira/browse/DBAL-1224

We use Jira to track the state of pull requests and the versions they got
included in.

@Ocramius
Copy link
Member

Ocramius commented May 4, 2015

This seems reasonable to me, but only if an example string in a functional test is also being provided (in order to prove the validity of the change), especially around the data truncation you reported in the description

@fprochazka
Copy link
Contributor

Doctrine is already parsing the database version in order to create Platform objects. I would preffer a condition for this, like Nette did nette/http#28 + nette/database@7988663

@DHager
Copy link
Author

DHager commented May 5, 2015

@Ocramius: I don't think a test is worthwhile in this case, because there's no new behavior of Doctrine to test, nor any particular way to "solve" or react the issue in Doctrine. (Yet?) We'd just be writing tests to assert how third-party non-PHP code behaves.

Here's a simple way to validate the issue, with thanks to @mathiasbynens for the example problem-character:

-- Create table using utf8 with two columns, one problematic and one OK
CREATE TABLE `trunctest` (
    `utf8text` VARCHAR(255) CHARACTER SET 'utf8' COLLATE 'utf8_general_ci' NOT NULL,
    `mb4text` VARCHAR(255) CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_general_ci' NOT NULL
);

-- Hex is equivalent to 'foo𝌆bar' and makes things simpler 
SELECT UNHEX('666F6FF09D8C86626172') INTO @astralstring;

-- Will generate a warning, but still insert "foo" into the utf8 field
INSERT INTO trunctest (`utf8text`,`mb4text`) VALUE (@astralstring,@astralstring);

-- Now the two fields don't match up!
SELECT utf8text, mb4text, (char_length(utf8text) = char_length(mb4text)) as success FROM trunctest;

-- Cleanup
DROP TABLE `trunctest`;

Turning on MySQLs "strict" mode should promote the warning to an error, but that's not the default and people would still need to go through the effort of changing their tables if they need to support the problem-characters.

@DHager
Copy link
Author

DHager commented May 5, 2015

@fprochazka: I notice there's a MySQL57Platform class, do you mean creating a new one like MySQL553Platform that "layers on" the use of utf8mb4?

Alternately, it might be possible to initialize the Platform to a charset/collation default based on what we can glean from the database connection (i.e. with select @@character_set_server, @@collation_server;) ... But that introduces some new risk. For example, I bet a lot of existing installations are still set to latin1/latin1_swedish_ci, and having that pop in the schema tool is going to be a nasty surprise, since the new value is not a superset of what they used to be using.

@Ocramius
Copy link
Member

Ocramius commented May 5, 2015

I don't think a test is worthwhile in this case

A change is only justified by a test here, IMO: we show why this default is
being changed and the test should fail (and decisions should be taken) if
the change is reverted.

You are not really testing the functionality, you are just hardening
support by preventing regressions ;-)

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

On 5 May 2015 at 03:56, Darien notifications@github.com wrote:

@fprochazka https://github.com/fprochazka: I notice there's a
MySQL57Platform class, do you mean creating a new one like
MySQL553Platform that "layers on" the use of utf8mb4?


Reply to this email directly or view it on GitHub
#851 (comment).

@DHager
Copy link
Author

DHager commented May 5, 2015

@fprochazka: I've got a side-branch that uses a MySQL553Platform where each platform can override the default collation/charset, but unfortunately it causes a bunch of problems with the AbstractMySQLPlatformTestCase, because it assumes all MySQL platforms will always use one and only one collation and charset between them.

@DHager
Copy link
Author

DHager commented May 5, 2015

Ah, yes, now the build-server is telling me something I should've remembered from my own work: utf8mb4 introduces a new quirk, where varchar(255) cannot be indexed because it's (potentially) too long for InnoDB.

It seems like there are a few different choices, turning on MySQL-specific behavior:

  1. Hardcode the default charset/collation, keeping it as utf8, with the lurking incompatibility issue compared the UTF8 used by the rest of the tech-stack.
  2. Hardcode the default charset/collation as utf8mb4, and make people using varchar(255) indexes either specify a charset in the options for that field, or specify a lower length.
  3. Create a system that inspects the database-state on startup to determine which to use, which may bite users who never changed their settings from latin1 and were depending on Doctrine to quietly always override it for them.
  4. Make it possible for users to specify--through PHP code--the database-wide charset/collation defaults.

@DHager
Copy link
Author

DHager commented May 5, 2015

@Ocramius, @fprochazka : At least for my immediate use-case, I think I've found a better solution with doctrine/DoctrineBundle#420 instead.

@DHager DHager closed this May 5, 2015
@Ocramius Ocramius reopened this May 6, 2015
@Ocramius
Copy link
Member

Ocramius commented May 6, 2015

Reopening as it is worth discussing the upgrade.

@MisatoTremor
Copy link
Contributor

As a follow up on the post about varchar(255):

This is because the maximum size of an InnoDB index is 767 bytes. Thus in 3 byte utf8 a varchar type field can be at max 255 chars long (255 * 3 -> 765), while in 4 byte utf8 it can only be 191 chars long (191 * 4 -> 764).

Starting with MySQL 5.6.3 you can circumvent this by configuring it with the setting innodb_large_prefix=ON (defaults is OFF) which itself needs innodb_file_format=Barracuda (default is Antelope) AND creating the table with the option ROW_FORMAT set to either DYNAMIC or COMPRESSED. The maximum index size will then be 3072 bytes, so varchar can be longer again.
Starting with MyQL 5.7.7 however innodb_file_format=Barracuda and innodb_large_prefix=ON will be the default values and even be removed as configurable options in a later release.

Maybe these things should be considered to build a stable solution for this.
I think creating a platform classes for at least 5.5.3 and 5.7.7 should be suitable.

What do you think @Ocramius?

@deeky666
Copy link
Member

deeky666 commented Jan 9, 2016

This one is rather tricky and considering the amount of risks I doubt we can easily change this in 2.x. Also I wonder IF we changed it somehow, how we can keep BC and consistency here. When users would start to upgrade they'd have new tables with charset utf8mb4 and old ones being stuck at utf8. So manual upgrade would be required here then. IMO the best option for Doctrine would be to not define a default charset and collation at all and let the user or database server decide about it. This might surely raise some other issues (like the index problem discussed) if a particular charset is used but the can of worms is open already anyways as you can define the default table charset and collation via table options.

@Ocramius
Copy link
Member

Ocramius commented Jan 9, 2016

IMO the best option for Doctrine would be to not define a default charset and collation at all and let the user or database server decide about it.

Disagree with this, according to history (remember latin1?)

@deeky666
Copy link
Member

deeky666 commented Jan 9, 2016

@Ocramius yeah what about it? Is it wrong to use latin1? For example in my company we are dealing with german customers only and everything is bound to latin1 from the customers side. We are sometimes having a hard job converting from latin1 to utf8 (our application and database default) forth and back. Also we could gain performance and disk space when using latin1 instead of utf8 not to mention sorting/comparing issues latin1_german2_ci is what we actually need for germanized applications.

@deeky666
Copy link
Member

deeky666 commented Jan 9, 2016

So what can we do in 2.x go get at least an intermediate solution? I tend to prefer the idea of introducing a MySql553Platform, making the default charset depend on the platform used (utf8mb4 for MySQL > 5.5.3) and have a decent upgrade instruction. As a change of the default charset only affects newly created tables where a custom charset is not defined by the user, this could maybe acceptable?

@realityking
Copy link

@deeky666 Even if all your customers in Germany, they might have a name that doesn't fit into latin1.

Storing non-legacy data in anything but Unicode really shouldn't be encouraged.

@mcfedr
Copy link

mcfedr commented Feb 15, 2016

I am very much in favour of this, there is small backwards compatibility issue that is worth talking about.

The default length of a string type is 255 in Doctrine, which fits in the maximum key length for InnoDB. If the default charset is going to be utf8mb the max key length is 191 characters.

This will cause the generated index SQL for many string columns to fail as it doesn't include key length.

@MisatoTremor
Copy link
Contributor

@mcfedr See #851 (comment) for details about this.

@ste93cry
Copy link
Contributor

ste93cry commented May 6, 2016

any news on this?

@ThaDafinser
Copy link

ThaDafinser commented Jul 21, 2016

Just also wanted to switch "easily", but got trapped by this issue.

Since there is already a Mysql57Platform...what about changing the default only there?(i know it's sadly not that easy for the mass)
https://github.com/doctrine/dbal/blob/master/lib/Doctrine/DBAL/Platforms/MySQL57Platform.php

//EDIT
For now i created this workaround:

Just register your custom platform in the config
'platform' => new \YOUR_SPACE\Doctrine\DBAL\Platforms\MySQL57PlatformCustom()

<?php
namespace YOUR_SPACE\Doctrine\DBAL\Platforms;

use Doctrine\DBAL\Platforms\MySQL57Platform;

/**
 * Custom Mysql57 Platform, to use utf8mb4 per default
 */
class MySQL57PlatformCustom extends MySQL57Platform
{

    protected function _getCreateTableSQL($tableName, array $columns, array $options = array())
    {
        // Charset
        if (! isset($options['charset'])) {
            $options['charset'] = 'utf8mb4';
        }

        // Collate
        if (! isset($options['collate'])) {
            $options['collate'] = 'utf8mb4_unicode_ci';
        }

        return parent::_getCreateTableSQL($tableName, $columns, $options);
    }
}

j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 8, 2016
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851
j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 11, 2016
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851
j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 11, 2016
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851
j0k3r added a commit to wallabag/wallabag that referenced this pull request Oct 22, 2016
Switch to utf8mb4 instead of utf8 because f*** MySQL
See doctrine/dbal#851
@mnapoli
Copy link
Contributor

mnapoli commented Jul 2, 2017

@ThaDafinser thank you for that piece of code, that's perfect!

And for those using MariaDB you will need a version > 10.2.

@Ocramius
Copy link
Member

Ocramius commented Jul 2, 2017

FWIW, this would be an awesome addition if it wasn't for mysql's awful index size limits :-\

If we provide it out of the box, then the varchar size should go down by default.

@mcfedr
Copy link

mcfedr commented Jul 2, 2017

@Ocramius Or generate index declarations with size specified, although that might be another can of worms

@mcfedr
Copy link

mcfedr commented Jul 17, 2017

Also adding to this conversation the 'correct' collate to use with utf8mb4 is utf8mb4_unicode_520_ci

@afoeder
Copy link
Contributor

afoeder commented Aug 9, 2017

@mcfedr do you have a source for the utf8mb4_unicode_520_ci statement?

@mcfedr
Copy link

mcfedr commented Aug 9, 2017

Unfortunately there isn't good information on the MySQL docs, but the best source I have found is the Wordpress bug tracker: https://core.trac.wordpress.org/ticket/32105#comment:3

Wordpress recently changed to using utf8mb4_unicode_520_ci by default. The key reason

The problem stems from MySQL's collation behaviour - it treats all Unicode Supplementary Characters (which emoji fall under) as being equivalent. It's not until MySQL 5.6, wich the addition of the utf8mb4_unicode_520_ci collation that this changes.

@ghost
Copy link

ghost commented Jan 25, 2018

What's the status on this 'defaults' merge, and utf8mb4 support in general?

Currently, in a new Symfony4 project

bin/console doctrine:phpcr:repository:init

returns

	Successfully registered system node types.
	Executing initializer: CmfRoutingBundle

	In ObjectManager.php line 859:
	  Error inside the transport layer: An exception occurred while executing 'SELECT id FROM phpcr_nodes WHERE path COLLATE utf8mb4_bin = ? AND workspace_name = ?' with params ["\/", "default"]:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In AbstractMySQLDriver.php line 121:
	  An exception occurred while executing 'SELECT id FROM phpcr_nodes WHERE path COLLATE utf8mb4_bin = ? AND workspace_name = ?' with params ["\/", "default"]:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In PDOStatement.php line 107:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

	In PDOStatement.php line 105:
	  SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'utf8'

also noting

vendor/doctrine/dbal/UPGRADE.md

	## Creating MySQL Tables now defaults to UTF-8

	  If you are creating a new MySQL Table through the Doctrine API, charset/collate are now set to 'utf8'/'utf8_unicode_ci' by default. Previously the MySQL server defaults were used.

the current config includes,

config/packages/doctrine.yaml 
	...
	doctrine:
	    dbal:
	        default_connection: dev
	        connections:
	            dev:
	                ...
	                driver: 'pdo_mysql'
	                server_version: '5.7'
	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci
	            phpcr_dev:
	                ...
	                driver: 'pdo_mysql'
	                server_version: '5.7'
	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci                                                                                                                                              
	...

	doctrine_phpcr:
	    session:
	        backend:
	            type: doctrinedbal
	            connection: phpcr_dev
	...

@mcfedr
Copy link

mcfedr commented Jan 25, 2018 via email

@ghost
Copy link

ghost commented Jan 25, 2018

@mcfedr
mixing how/where?
both DBs have

	                charset: utf8mb4
	                default_table_options:
	                    charset: utf8mb4
	                    collate: utf8mb4_unicode_ci

consistent with MariaDB defaults ...

@mcfedr
Copy link

mcfedr commented Jan 25, 2018 via email

@ghost
Copy link

ghost commented Jan 25, 2018

@mcfedr
These are 'clean installed' DBs ... so far nothing other than install & (attempts to) initialize, etc.

EDIT: I think my issue better belongs here:
doctrine/DoctrinePHPCRBundle#310

Base automatically changed from master to 4.0.x January 22, 2021 07:43
@morozov
Copy link
Member

morozov commented Jul 9, 2021

The change is obsolete as of #4644.

@morozov morozov closed this Jul 9, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.