Metadata length validation is buggy for unicode strings

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).

Currently scraperlib is using the `len` function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).

Looks like (according to ChatGPT, let's be honest) we could use the `grapheme` library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.

```
import grapheme

print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 25 => Correct
``` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata length validation is buggy for unicode strings #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Metadata length validation is buggy for unicode strings #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions