Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending to Non-ASCII characters with corpora loading and saving #93

Merged

Conversation

IssacXid
Copy link
Contributor

@IssacXid IssacXid commented Dec 22, 2024

Observation: After saving the indices along with the corpus.json, it was unable to load if the corpora contain Non-ASCII character and was throwing error:

doc = json_functions.loads(line)
orjson.JSONDecodeError: invalid escaped character in string: line 1 column 24 (char 23)

In a similar PR Added support for saving and loading non ASCII chars in corpus and vocab #86, it was done for vocab but corpus saving was missed.

Changes made in the PR:

-> Added json_functions ensure_ascii = False while saving when corpus is also supplied.
-> Added encoding='utf-8' for ensuring uniformity across different systems.
-> Modified the unit test cases to support the above scenario as well.

@xhluca
Copy link
Owner

xhluca commented Dec 23, 2024

I have not ran this locally, however the updated tests look good and all the checks have passed, so merging this now!

@xhluca xhluca merged commit ce8f886 into xhluca:main Dec 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants