Extending to Non-ASCII characters with corpora loading and saving #93

IssacXid · 2024-12-22T16:50:29Z

Observation: After saving the indices along with the corpus.json, it was unable to load if the corpora contain Non-ASCII character and was throwing error:

doc = json_functions.loads(line)
orjson.JSONDecodeError: invalid escaped character in string: line 1 column 24 (char 23)

In a similar PR Added support for saving and loading non ASCII chars in corpus and vocab #86, it was done for vocab but corpus saving was missed.

Changes made in the PR:

-> Added json_functions ensure_ascii = False while saving when corpus is also supplied.
-> Added encoding='utf-8' for ensuring uniformity across different systems.
-> Modified the unit test cases to support the above scenario as well.

… test case

xhluca · 2024-12-23T23:01:19Z

I have not ran this locally, however the updated tests look good and all the checks have passed, so merging this now!

Added changes to load/save corpora with non-ascii character with unit…

bbc6a38

… test case

xhluca approved these changes Dec 23, 2024

View reviewed changes

xhluca merged commit ce8f886 into xhluca:main Dec 23, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending to Non-ASCII characters with corpora loading and saving #93

Extending to Non-ASCII characters with corpora loading and saving #93

IssacXid commented Dec 22, 2024 •

edited

Loading

xhluca commented Dec 23, 2024

Extending to Non-ASCII characters with corpora loading and saving #93

Extending to Non-ASCII characters with corpora loading and saving #93

Conversation

IssacXid commented Dec 22, 2024 • edited Loading

Observation: After saving the indices along with the corpus.json, it was unable to load if the corpora contain Non-ASCII character and was throwing error:

Changes made in the PR:

xhluca commented Dec 23, 2024

IssacXid commented Dec 22, 2024 •

edited

Loading