Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#432 is highly incompatible => add additional flag #574

Open
cohomology opened this issue Nov 22, 2024 · 10 comments
Open

#432 is highly incompatible => add additional flag #574

cohomology opened this issue Nov 22, 2024 · 10 comments

Comments

@cohomology
Copy link

cohomology commented Nov 22, 2024

Hi,

You introduced an incompatible change in PCRE2 with #432, which caused many of our unit tests to fail.

Would it be possible to introduce a compile flag to restore the old behavior? If so, I could supply a patch as well.

Couldn't this affect also many other software products and trigger subtle bugs?

Thanks
Kilian

@carenas
Copy link
Contributor

carenas commented Nov 22, 2024

Couldn't this affect also many other software products and trigger subtle bugs?

PCRE2 aims to be compatible with Perl, so technically that change was a bugfix.

I think preserving the old behaviour might be possible by using an EXTRA flag, but could you elaborate on why that behaviour was preferred for your use case?

@cohomology
Copy link
Author

cohomology commented Nov 22, 2024

We are authors of a programming language. We don't write the regex or the subject, but the users of the programming language do. We don't know what regular expressions they write.

This programming language is compatible along releases. So the same program should give the same result with the new compiler. This is the standard in all programming languages as well (with minor exceptions like Python 2 => 3).

Do you know how other programming languages like PHP consume PCRE2? I guess they also try to be compatible along releases and do not invalidate user code. Do they fork PCRE or do they stay on the same version forever, ignoring all bugfixes?

The best and most modern approach would be to use feature flags throughout. Is that an option?

The same problems of course also hold for new features, as they also change the behavior of existing programs.

We would be very unhappy to actively monitor all changelogs of PCRE, check whether the change was incompatible or not or if it introduced a new feature or not, or if it is just a bugfix. We would than have to fork our own version of PCRE2, cherry-picking or feature-flagging all changes. Over time, this version of PCRE will be quite drastically different from upstream. Really unsure what the correct approach would be here.

In this particular case, I also feel that the old behavior was more consistent. But that's just personal taste.

@cohomology
Copy link
Author

cohomology commented Nov 22, 2024

I have seen PHP doesn't seem to mind invalidating user code, see: php/php-src#13413

@carenas
Copy link
Contributor

carenas commented Nov 23, 2024

We are authors of a programming language

Just curious, but who is we and which language?

PCRE2 is used in several and some have their own per language compatibility flag that might come handy on that upcoming patch of yours.

I also feel that the old behavior was more consistent. But that's just personal taste.

Philip seemed to agree with you, so I am sure something implementing PCRE2_EXTRA_ALT_CASELESS_BSP or something more properly named is likely to have no problem getting merged

@PhilipHazel
Copy link
Collaborator

Indeed, a new option seems like the best way out of this, and that seems like the appropriate name (similar to other option names). As you can see from the comments in #432, the change was made in response to user complaints. We try to please everybody, but sometimes an option is the only way to do it. Awaiting your patch....

@PhilipHazel
Copy link
Collaborator

Note that the change also fixes a bug in POSIX classes. Previously, [:upper:] (for example) matched only upper case when PCRE2_UCP was set, but matched both cases when PCRE2_UCP was not set. This is because [:upper:] is translated to \p{Lu} when PCRE2_UCP is set. I suppose for consistency [:upper:] should always behave in the same way as \p{Lu}.

@NWilson
Copy link
Contributor

NWilson commented Nov 24, 2024

We are authors of a programming language. We don't write the regex or the subject, but the users of the programming language do. We don't know what regular expressions they write.

So, what you actually have is a hypothetical problem: you're worried that some user's regex might be broken when you release the new behaviour.

That's quite a sensible worry: but we should still calibrate our mental "risk model" to see this as a fairly low-impact bugfix (brining PCRE2 into alignment with Perl and other regex engines). Keeping the old idiosyncratic behaviour has its own compatibility risks for new applications (whereas moving to the new behaviour has compatibility risks for old existing applications).

Do you know how other programming languages like PHP consume PCRE2? I guess they also try to be compatible along releases and do not invalidate user code. Do they fork PCRE or do they stay on the same version forever, ignoring all bugfixes?

I believe most downstream consumers just move to the latest releases, and rely on PCRE2 to be sufficiently stable that they don't need to worry. In this case... it's borderline whether the /\p{Lc}/i bugfix is an acceptable risk.

The same problems of course also hold for new features, as they also change the behavior of existing programs.

That's out of scope for PCRE2's backwards compatibility.

  • We will never ever break the library ABI.
  • We will change the behaviour of existing regexes with caution (namely, only bugfixes for incorrect match behaviour). Some bugfixes may have feature flags to keep the old behaviour. Others are too minor to care. All such changes should be flagged in release notes.
  • However, we absolutely reserve the right to extend PCRE2 with new features in future releases. A string which is not a valid regex in old PCRE2 may become accepted as valid in a future release. I do not think we regard this as a breaking change (nor do other programming languages).

(Philip and Zoltan, correct me if I'm wrong!)

@zherczeg
Copy link
Collaborator

Please understand, that PCRE2 is not funded in any way, so developers spend their free time on it. There are things we do, but spending a lot of effort on minor things is not considered a valuable effort (at least from my side). Adding flags to everything increases the maintenance costs, and I would avoid that. I know people often consider regex as something simple, and doing these requests are simple tasks, but I would be curious how Nick's opinion changed about the project before and after he joined it (from a newcomers perspective).

@cohomology
Copy link
Author

@carenas The ABAP programming language. It is used to run large enterprise software and has very strict compatibility requirements.
@zherczeg I totally understand. This is why I proposed to supply a patch myself. I would only add feature flags if the behavior of existing regular expressions is changed totally in an incompatible way. This happens very seldom, I guess. Being compatible is a value of itss own and makes PCRE more useful in real-world applications. I agree that adding flags for every new feature is not acceptable for such a project.

We are still discussing, whether we can change the behavior also in ABAP. Searching \p{Lu} with "i" flag is quite obscure. Probably no one did that.

@carenas
Copy link
Contributor

carenas commented Nov 25, 2024

Searching \p{Lu} with "i" flag is quite obscure.

don't forget also the effect it has in other related properties and even potencial aliases, as pointed out by Philip in the case of [:upper:] under UCP mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants