Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json: Output not guaranteed to be valid UTF-8 #6151

Closed
sophie-h opened this issue Jan 22, 2022 · 8 comments
Closed

json: Output not guaranteed to be valid UTF-8 #6151

sophie-h opened this issue Jan 22, 2022 · 8 comments
Assignees
Milestone

Comments

@sophie-h
Copy link
Contributor

The JSON standard requires strings to be valid UTF-8. There are a bunch of JSON API fields that can contain invalid strings.

  • The b-prefixed fields "intentionally" output invalid strings
  • borg info --json seems to output everything without removing invalid strings parts (for example name or command_line)

Maybe it would be good if borg info and borg list are based on the same metadata source to not do the escape etc. at two points in code.

My current suggestion for 1.2:

  • All field names not prefixed use remove surrogates for their contents
  • We deprecate the use of b-prefixed fields and do not add new ones to borg info. Instead, the borg info --json output has a slightly incompatible change to remove surrogates.
  • As an incompatible change, barchive is no longer contained in borg list for archives per default. Not using the field but having it in the output could already break JSON parsers. It can be added by --fomat {barchive}.
  • We add something like base64_ prefixed fields, probably for everything string. I'm not sure, but I wouldn't be surprised if fields like hostname can also be invalid UTF-8 in theory.

I'm also fine with dropping the b-prefixed fields. Not sure if that breaks anything.

@ThomasWaldmann
Copy link
Member

related: #2273

@enkore
Copy link
Contributor

enkore commented Jan 24, 2022

Docs say

--json: Note: JSON can only represent text. A “barchive” key is therefore not available.

Even though it appears by default

We add something like base64_ prefixed fields, probably for everything string. I'm not sure, but I wouldn't be surprised if fields like hostname can also be invalid UTF-8 in theory.

iirc Borg uses the resolver to reverse 127.0.0.1 or something like that, so hypothetically it should be ASCII, though some other ways to determine the hostname involve parsing /etc/hosts or sticking /etc/hostname in front of the search domain from /etc/resolv.conf -- the latter two don't cause hangs during network problems.

So... yeah... just offering a b64 field for everything that could be not-strictly-text is a good idea.

The b-fields would only work with a receiver that also supports Python-style surrogate escaping. Likely never used.

@sophie-h
Copy link
Contributor Author

@ThomasWaldmann I would suggest removing the b-fields from 2.0 because they can deliver invalid JSON. Potentially add base64 or something like that instead.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.2.x, 2.0.0rc1 Jul 17, 2022
@sophie-h
Copy link
Contributor Author

sophie-h commented Oct 7, 2022

Actually, for 2.0, maybe it makes sense to just not allow binary values for things like comments, etc.

  • Reject command execution if the value is not valid Unicode
  • Strip invalid utf8 from output if it is caused by bit flip or whatever

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0rc1, 2.0.0b5 Dec 6, 2022
@ThomasWaldmann
Copy link
Member

#7197 barchive, bcomment, bpath were all removed.

  • barchive and bcomment is not needed any more because borg2 will validate these more strictly, so we won't have binary stuff in there. this includes borg transfer rejecting to transfer such archives.
  • bpath was just removed for now and a better solution needs to get implemented. as unix allows non-unicode / binary stuff in paths, we can not avoid that.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 15, 2022

List of affected Item attributes:

  • path is surrogate-escaped str (s-e-str). paths can have all sorts of weird binary stuff on unix, sometimes old files with non-unicode encodings, samba shares especially.
  • source is s-e-str (same issue as with path, just for symlink targets ehrm "sources")
  • user is s-e-str
  • group is s-e-str
  • acl_* are all bytes
  • hlid is bytes
  • chunks / chunks_healthy is [(bytes, int), ...]
  • xattrs is {bytes: bytes, ...}

List of affected Archive attributes:

  • name is str (solved in borg2, validated unicode)
  • comment is str (solved in borg2, validated unicode)
  • hostname is s-e-str
  • username is s-e-str
  • cmdline, recreate_cmdline is [s-e-str, ...]

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Dec 28, 2022
binary bytes:
- json_key = <key>_b64
- json_value == base64(value)

text (potentially with surrogate escapes):
- json_key1 = <key>
- json_value1 = value_text (s-e replaced by ?)
- json_key2 = <key>_b64
- json_value2 = base64(value_binary)

json_key2/_value2 is only present if value_text required
replacement of surrogate escapes (and thus does not represent
the original value, but just an approximation).
value_binary then gives the original bytes value (e.g. a
non-utf8 bytes sequence).
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Dec 28, 2022
binary bytes:
- json_key = <key>_b64
- json_value == base64(value)

text (potentially with surrogate escapes):
- json_key1 = <key>
- json_value1 = value_text (s-e replaced by ?)
- json_key2 = <key>_b64
- json_value2 = base64(value_binary)

json_key2/_value2 is only present if value_text required
replacement of surrogate escapes (and thus does not represent
the original value, but just an approximation).
value_binary then gives the original bytes value (e.g. a
non-utf8 bytes sequence).
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 29, 2022

@sophie-h please check PR #7232. ^^^

@ThomasWaldmann ThomasWaldmann self-assigned this Dec 29, 2022
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Dec 29, 2022
item: path, source, user, group

for non-unicode stuff borg 1.2 had "bpath".

now we have:
path - unicode approximation (invalid stuff replaced by ?)
path_b64 - base64(path_bytes)  # only if needed

source has the same issue as path and is now covered also.

user and group are usually unicode or even pure ASCII,
but we rather are cautious and cover them also.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jan 16, 2023
binary bytes:
- json_key = <key>_b64
- json_value == base64(value)

text (potentially with surrogate escapes):
- json_key1 = <key>
- json_value1 = value_text (s-e replaced by ?)
- json_key2 = <key>_b64
- json_value2 = base64(value_binary)

json_key2/_value2 is only present if value_text required
replacement of surrogate escapes (and thus does not represent
the original value, but just an approximation).
value_binary then gives the original bytes value (e.g. a
non-utf8 bytes sequence).
@ThomasWaldmann
Copy link
Member

considering there was no feedback, i hope everyone is happy with it. will merge the PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants