Skip to content

Metadata length validation is buggy for unicode strings #158

Closed
@benoit74

Description

@benoit74

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).

Currently scraperlib is using the len function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).

Looks like (according to ChatGPT, let's be honest) we could use the grapheme library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.

import grapheme

print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 25 => Correct

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions