Sanitize json with control characters #473
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
I recently saw that VK seems to be really bad at sanitizing some fields such as usernames.
Loading a bunch of members with vk_api yielded an ugly
JSONDecodeError: Invalid control character at: line 1 column 68248 (char 68247)Inspecting this by hand, I found that the response contains this abomination:
Of course, <0x01> is not a codepoint that you'd want in a nickname, ever!
Since the data is nonsensical, not printable and potentially dangerous, my suggestion would be to catch
JSONDecodeErrors and try to sanitize the raw content before parsing it with json.loads.There are many options for replacing the problematic characters, but regex should be reasonably fast. As a bonus, it allows us to catch control characters as a category (
\p{C}) rather than listing them by hand.