Skip to content

Undocumented mb_substr behaviour in PHP 8.3? #14703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jahyvari opened this issue Jun 28, 2024 · 13 comments
Closed

Undocumented mb_substr behaviour in PHP 8.3? #14703

jahyvari opened this issue Jun 28, 2024 · 13 comments

Comments

@jahyvari
Copy link

Description

The following code:

<?php
var_dump(
    mb_convert_encoding(
        mb_substr(
            mb_convert_encoding(
            	"test ääöö",
            	"ISO-8859-15",
            	"UTF-8"
            ),
            0,
            null,
            "UTF-8"
        ),
        "UTF-8",
        "ISO-8859-15"
    )
);

Resulted this output in PHP 8.2:

string(13) "test ääöö"

But resulted this output in PHP 8.3:

string(9) "test ????"

Based on this, did the pre PHP 8.3 mb_substr return the string always in the "raw" binary form, but in PHP 8.3 it will return the string using the passed encoding, and therefore could lead to an encoding conversion?

PHP Version

PHP 8.3.8

Operating System

No response

@youkidearitai
Copy link
Contributor

Yes, many behavior changes in mb_substr in PHP 8.3. Invalid UTF-8 character (0xE4 0xE4 0xF6 0xF6) is convert to ???? (0x3F 0x3F 0x3F 0x3F). Therefore, mb_substr returns ????.
https://3v4l.org/9U5tW

@alexdowad Could you please confirm?

@damianwadley
Copy link
Member

Well, that code is clearly fundamentally incorrect. It tries to

  1. Convert a (UTF-8?) string from UTF-8 to ISO-8859-15
  2. Do a substr on that string treating it as if it was UTF-8, which it is not
  3. Convert it back from ISO-8859-15 to UTF-8

Look at the intermediate steps: https://3v4l.org/dRaZH

I don't see anything in the changelog for 8.3.2 so I suspect there was some refactoring done in mbstring that unintentionally fixed this - perhaps the earlier implementation didn't do any processing in the case of a substr over [0, null) but it does now.

Current behavior is correct, though: fix the mb_substr step to use the right encoding and the output is as expected.

@jahyvari
Copy link
Author

Yes, the code is intentionally incorrect for this example, hence was asking about documentation, was not arguing against the change, which makes sense :)

From our codebase found a function which was accidentally working from PHP 5.3 to 8.2, because it was relaying on this old behaviour while manipulating legacy encodings.

@youkidearitai
Copy link
Contributor

mb_substr is performance improvement at 0c0774f. Perhaps it is a side effect.
Anyway, let's wait Alex's reply.

@alexdowad
Copy link
Contributor

mb_substr is performance improvement at 0c0774f. Perhaps it is a side effect. Anyway, let's wait Alex's reply.

@youkidearitai is right; the updated mb_substr is stricter when given strings with invalid encoding.

@jahyvari Thanks for letting us know about your experience. Maintaining mbstring has given me a deep appreciation for Hyrum's Law. It is incredibly difficult to tell which minor bug fixes may accidentally break someone's code, which relied on the buggy behavior. In this case, I probably should have known that the bug fix was big enough that it should have been called out in the release notes. I am sorry about that.

@DAdq26
Copy link

DAdq26 commented Sep 4, 2024

Hi team,

Up on this topic,
With simply trema replacement like that https://3v4l.org/aiIvs

<?php 
var_dump(
    mb_convert_encoding(
        mb_substr(
            mb_convert_encoding(
            	"¨l",
            	"ISO-8859-1",
            	"UTF-8"
            ),
            0,
            null,
            "UTF-8"
        ),
        "UTF-8",
        "ISO-8859-1"
    )
);

Output were different,

Output for 8.3.2 - 8.3.11
string(2) "?l"
Output for 8.1.0 - 8.1.29, 8.2.0 - 8.2.23, 8.3.0 - 8.3.1
string(3) "¨l"

It's really an issue for us and make us in trouble with mb_substr usages...
Result is not consistent here

Can you inestigate on your side?

Is it related to this changelog 8.3.2...?

image
https://php.watch/versions/8.3/releases/8.3.2

What does it mean ?

Best

We use PHP 8.3.10 for now

For more example

https://3v4l.org/ZOvX2

<?php
$mystring = mb_convert_encoding("¨l", 'windows-1252', 'UTF-8');
var_dump($mystring);
var_dump(substr($mystring, 0, 1));
var_dump(mb_substr($mystring, 0, 1, "UTF-8"));

Output for 8.3.2 - 8.3.11
string(2) "�l"
string(1) "�"
string(1) "?"
Output for 8.1.0 - 8.1.29, 8.2.0 - 8.2.23, 8.3.0 - 8.3.1
string(2) "�l"
string(1) "�"
string(1) "�"

On the other way

<?php
$mystring = mb_convert_encoding("¨l", 'UTF-8', 'windows-1252');
var_dump($mystring);
var_dump(substr($mystring, 0, 1));
var_dump(mb_substr($mystring, 0, 1, "UTF-8"));

Output for 8.1.0 - 8.1.29, 8.2.0 - 8.2.23, 8.3.0 - 8.3.11
string(5) "¨l"
string(1) "�"
string(2) "Â"

Without string conversion

<?php
$mystring = "¨l";
var_dump($mystring);
var_dump(substr($mystring, 0, 1));
var_dump(mb_substr($mystring, 0, 1, "UTF-8"));

Output for 8.1.0 - 8.1.29, 8.2.0 - 8.2.23, 8.3.0 - 8.3.11
string(3) "¨l"
string(1) "�"
string(2) "¨"

It's really not consistent when the string is manipulated before, even if it did not make the job that i want here, the mb_substr behaviour did not make his job ...

@youkidearitai
Copy link
Contributor

@DAdq26
This change is bugfix in ec348a1 . Anyway, "¨l" of character encoding is UTF-8, not ISO-8859-1, mb_substr does not handle wrong encoding binary correctly.

@DAdq26
Copy link

DAdq26 commented Sep 5, 2024

Hi @youkidearitai
Thank you for your quick answer.
I understand the buggy thing that started this discussion but the point i noticed is the change of behavior between PHP 8.x and 8.3 in feedback of these methods
Note: In french language trema (¨) is common to both windows and iso encoding not utf-8 (sure in utf-8 this char is a dual octet encoding) so is not a wrong encoding in this case
image
http://www.jchr.be/html/caracteres.htm
https://www.compart.com/fr/unicode/charsets/containing/U+00A8

This change is really problematic https://3v4l.org/ZOvX2 between 8.3.1 and next because our unittests failed and we are not confident on mb_functions feedback

Sorry if i did not understand perfectly what you mean...

Best

@alexdowad
Copy link
Contributor

@DAdq26 Thanks for letting the PHP team know about the problem you are having. It is also good that you shared the PHP code which is not behaving as you expected.

The code you shared is incorrect, and if you fix it, you will likely find that it works better. If not, always feel free to discuss further.

The problem with your code is that you are taking a UTF-8 string, converting it to ISO-8859-15, and then telling mb_substr that it is still a UTF-8 string, which is wrong. After conversion, the string is not UTF-8 any more. It is ISO-8859-15. So the correct code is:

<?php 
var_dump(
    mb_convert_encoding(
        mb_substr(
            mb_convert_encoding(
            	"¨l",
            	"ISO-8859-1",
            	"UTF-8"
            ),
            0,
            null,
            "ISO-8859-1"
        ),
        "UTF-8",
        "ISO-8859-1"
    )
);

@DAdq26
Copy link

DAdq26 commented Sep 5, 2024

Thanks @alexdowad right, you are on the right way but really, the difference on the final output is not stable for us.
I think you could understand the behavior of mb_functions were on a specific way till 8.3.1 so since many php versions, our app is oriented on this feature.
Now this particular issue (lost in github for us) force us to arrange our code without a proper answer.

You said mb_substr does not handle wrong encoding binary correctly. as developer we can not say that to our customer...

Best

@alexdowad
Copy link
Contributor

alexdowad commented Sep 5, 2024

You said mb_substr does not handle wrong encoding binary correctly. as developer we can not say that to our customer...

I'm sorry that my answer has led to a misunderstanding. It is not true that "mb_substr does not handle wrong binary encodings correctly".

Over the last couple of years, mb_substr (and other mbstring functions) have moved in the direction of handling wrong text encodings more consistently and logically. Also, in some cases (especially for UTF-8), the handling of wrong text encoding now complies with relevant standards. So in a number of ways, mbstring has improved in this area.

However, it occasionally happens that a user prefers the old behavior. Often this is a sign that their code may have bugs, and when the bugs are fixed, they will not have any problem. However, in extremely unusual cases, there might actually be good reasons to prefer the old behavior.

The maintainers are very willing to look at such rare cases and advise what the best workaround is; but the above code sample is not such a case. The code sample that you shared is simply buggy.

In conclusion:

  1. Some changes have occurred in how mbstring handles faulty, unintended input. This is intentional, and the changes are better for most users.
  2. I understand that right now you are experiencing a problem due to the change, and I sympathize.
  3. The maintainers are willing to help you figure out what to do, but you need to make the effort to analyze the issue and present clear information helping the maintainers understand your problem. It is not enough to simply show that the output of an mbstring function is different between PHP 8.x and 8.y (due to point 1).
  4. While you are doing that analysis, it is very reasonable to remain on PHP 8.3.10 as a temporary measure.

@DAdq26
Copy link

DAdq26 commented Sep 5, 2024

Many thanks for your time @alexdowad
Be sure we understand your point of view, simply take into account the fact of this small issue could generate trouble for further implementation in the future.
We are conscient php was a bit too wide opened by the past and now you need to make it stronger version after version.

Thanks for your job!

best

@ramsey-breakaway6
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
@youkidearitai @alexdowad @jahyvari @damianwadley @DAdq26 @ramsey-breakaway6 and others