Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata length validation is buggy for unicode strings #158

Closed
benoit74 opened this issue Apr 30, 2024 · 5 comments · Fixed by #176
Closed

Metadata length validation is buggy for unicode strings #158

benoit74 opened this issue Apr 30, 2024 · 5 comments · Fixed by #176
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Collaborator

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).

Currently scraperlib is using the len function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).

Looks like (according to ChatGPT, let's be honest) we could use the grapheme library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.

import grapheme

print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 25 => Correct
@benoit74 benoit74 added bug Something isn't working question Further information is requested labels Apr 30, 2024
@benoit74 benoit74 changed the title Metadata length validation is buggy for unicode strings and escap Metadata length validation is buggy for unicode strings Apr 30, 2024
@benoit74
Copy link
Collaborator Author

Another alternative (ChatGPT again) which seem a bit more maintained is the regex library (not the re standard library)

import regex
len(regex.findall(r'\X', text))   # Outputs: 25 => Correct

https://pypi.org/project/regex/

@rgaudin
Copy link
Member

rgaudin commented Apr 30, 2024

We've used the regent approach and an icu based one in the past. I think it's mandatory to clarify spec requirement first. Matches one of the topic of the hackathon as well!

@benoit74
Copy link
Collaborator Author

benoit74 commented May 1, 2024

Yesterday discussions in the Hackathon were quite clear, but I can only agree that requirement must be clear first ^^

@rgaudin
Copy link
Member

rgaudin commented May 1, 2024

Then that's OK. Would be good to add clarity to the spec Wiki as well.

@benoit74 benoit74 added this to the 3.4.0 milestone Jun 11, 2024
@benoit74 benoit74 modified the milestones: 3.4.0, 3.5.0 Jun 20, 2024
@benoit74 benoit74 self-assigned this Jul 2, 2024
@benoit74 benoit74 removed the question Further information is requested label Jul 2, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Jul 2, 2024

Spec has been updated during the hackathon to specify that we need to count graphemes: https://wiki.openzim.org/wiki/Metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants