-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata length validation is buggy for unicode strings #158
Comments
Another alternative (ChatGPT again) which seem a bit more maintained is the
|
We've used the regent approach and an icu based one in the past. I think it's mandatory to clarify spec requirement first. Matches one of the topic of the hackathon as well! |
Yesterday discussions in the Hackathon were quite clear, but I can only agree that requirement must be clear first ^^ |
Then that's OK. Would be good to add clarity to the spec Wiki as well. |
Spec has been updated during the hackathon to specify that we need to count graphemes: https://wiki.openzim.org/wiki/Metadata |
The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).
Currently scraperlib is using the
len
function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).Looks like (according to ChatGPT, let's be honest) we could use the
grapheme
library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.The text was updated successfully, but these errors were encountered: