Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-latin symbols in slug does not work anymore? #288

Closed
mord0d opened this issue Jan 23, 2025 · 5 comments
Closed

Non-latin symbols in slug does not work anymore? #288

mord0d opened this issue Jan 23, 2025 · 5 comments
Labels

Comments

@mord0d
Copy link

mord0d commented Jan 23, 2025

I'm not sure if this is a bug or feature…

Some time ago it worked fine, I have a lot of articles with non-latin symbols in their slugs, but today I edited an article and Chyrp Lite doesn't allow me to save/publish it.

Is this how it was intended?

@xenocrat
Copy link
Owner

Hello there,

This is an intended behaviour change. It was changed in the last release (2024.03) and was mentioned in the release notes. Slug rules for posts and pages were always intended to be a-z, 0-9, and hyphen only, but were for a long time not being enforced properly.

As mentioned in the release notes, a workaround when editing existing posts and pages is to empty the slug field to preserve the existing slug on update. If you want the slugs to be relaxed globally for all posts, pages, categories, and tags, you can change a constant in inlcudes/common.php - but think carefully before doing this, because it has side-effects.

@mord0d
Copy link
Author

mord0d commented Jan 25, 2025

mentioned in the release notes

Ugh, sorry, I've skipped this release.

Slug rules for posts and pages were always intended to be a-z, 0-9, and hyphen only

Is there any reasons behind this decision?

I've seen #287 and it's a good start (this PR adds only Russian subset of Cyrillic, but we have a lot more in Ukrainian, Serbian, Kazakh, and many others). It will work fine for many languages… but if someone will add Chinese, Vietnamese and/or other Asian script(s), which have a lot of (multibyte!) symbols, helpers.php will grow drastically, and everything will be slowed down.

@cuixiping has already suggested using url-encoded, it's a way cheaper operation and browsers can convert it back to human-readable text since long ago (but I'm not sure if it will work the opposite).

@xenocrat
Copy link
Owner

xenocrat commented Jan 25, 2025

Chyrp Lite has followed the philosophy that URLs should be able to survive transit through multiple potentially misbehaving systems, by eliminating all chars that a system might naively try to escape or convert. URL encoding breaks that, because a naive system might try to escape the percent signs and you'll end up with everything double-encoded. This philosophy has always applied to categories and tags. The help documentation for posts and pages states that it applies to them too - but I realised last year that it was not being properly enforced, hence the change to align the behaviour with the docs and to standardise the behaviour across posts, pages, tags and categories.

The sanitize() helper does have some internal conversion tables to transliterate multi byte chars with commonly accepted single byte alternatives that have a close visual counterpart. I recently accepted a pull request that pushes things a bit further, doing transliteration for 64 Cyrillic chars that have a close visual or phonetic equivalent. I'm open to considering a PR for additional Cyrillic letters.

Certainly I will never accepted a pull request that, for example, attempts to transliterate the 8000+ multi bytes chars of Simplified Chinese into Latin, firstly because this would be unwieldy and secondly because the transliteration of Chinese characters to Latin is acknowledged to be highly imperfect.

You can get exactly the behaviour you are requesting by setting the constant SLUG_STRICT to false in common.php. As explained in that post, this change could cause side-effects for tags that contain multi byte chars, requiring administrator action. But in all other respects, it is safe to do and will continue to be supported in future.

In summary, I've done things the way I think they should be done, but I've retained the option for you to do things differently if you disagree. I hope that makes sense!

@xenocrat xenocrat pinned this issue Jan 25, 2025
@mord0d
Copy link
Author

mord0d commented Jan 25, 2025

if you disagree

I'm not disagree, I'm curious. ☺

Is there any reasons behind this decision?

↑ This was my question and everything below is just my thoughts about the problem I didn't understand at the moment.

Chyrp Lite has followed the philosophy that URLs should be able to survive transit through multiple potentially misbehaving systems, by eliminating all chars that a system might naively try to escape or convert. URL encoding breaks that, because a naive system might try to escape the percent signs and you'll end up with everything double-encoded. This philosophy has always applied to categories and tags. The help documentation for posts and pages states that it applies to them too - but I realised last year that it was not being properly enforced, hence the change to align the behaviour with the docs and to standardise the behaviour across posts, pages, tags and categories.

It makes sense. Thanks for the detailed answer.

@mord0d mord0d closed this as completed Jan 25, 2025
@xenocrat
Copy link
Owner

Thank you for asking! I'm always happy to explain the reasoning for my choices. ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants