Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(recreate_)cmdline encoding / binary requirement #7246

Closed
ThomasWaldmann opened this issue Jan 8, 2023 · 4 comments
Closed

(recreate_)cmdline encoding / binary requirement #7246

ThomasWaldmann opened this issue Jan 8, 2023 · 4 comments
Assignees
Milestone

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 8, 2023

borg1: archive.cmdline is basically a copy of the sys.argv list and when accessing it, borg1 even expects surrogate escapes in each list element. when outputting the cmdline to screen or to json, borg1 removes the surrogate escapes. also, it uses shlex.quote on each list element and joins the list elements together with blanks in between.

borg2: I'ld like to simplify that, just not sure how.

the reason why I want to simplify it is that if we really must have a s-e string, then the corresponding json would need to have cmdline and cmdline_b64 keys (with lists as values) to correctly represent this information without data loss. But this seems overkill. And I could not even base64 encode the whole cmdline in one go, but it would have to be a list with b64 encoded elements...

So:

  • borg2: do we have to expect surrogate escapes in sys.argv anyway?
  • are there borg1 archives with surrogate escapes in archive.cmdline and if yes, could we just remove or replace them at borg transfer time?
  • could cmdline just be one simple, valid unicode string (without s-e) and not a sys.argv list copy?

The same issues apply to archive.recreate_cmdline (borg recreate uses this).

This is related to #7232 and #6151.

@ThomasWaldmann ThomasWaldmann added this to the 2.0.0b5 milestone Jan 8, 2023
@ThomasWaldmann ThomasWaldmann changed the title (recreate_)cmdline in archive metadata (recreate_)cmdline encoding / binary requirement Jan 8, 2023
@ThomasWaldmann ThomasWaldmann self-assigned this Jan 19, 2023
@horazont
Copy link
Contributor

borg2: do we have to expect surrogate escapes in sys.argv anyway?

As much as you expect surrogate-escaped filenames, right? The source paths for data to put into an archive is part of the cmdline (also with borg2, right?), so it could theoretically contain non-UTF-8 and thus surrogate escapes.

could cmdline just be one simple, valid unicode string (without s-e) and not a sys.argv list copy?

Couldn't you have a middle-ground, with shlex.quote'd strings? You could decode the surrogates as \xNN or so (and then also encode \ as \\ I guess).

@ThomasWaldmann
Copy link
Member Author

@horazont usually we only have recursion roots and excludes there and they need to get typed in somehow. would shell completion complete to a path that has surrogate escapes? guess maybe it could come in via c&p.

In PR #7289 I used shlex.join(sys.argv) (and not removing s-e (yet?)).

@horazont
Copy link
Contributor

horazont commented Jan 20, 2023

Yes, you can end up with a path which contains non-UTF-8:

$ ls foo/$'\232'
bar

I can imagine this happening when interacting with the storage backing samba shares for instance.

To be clear: I have no use-case for that, just being exact here.

@ThomasWaldmann
Copy link
Member Author

@horazont ok, so guess the code as in the PR is ok now. it stores shlex.join(sys.argv) as s-e-str and later, when generating json or screen output, it removes the s-e (screen/json) and adds a _b64 key to the dict that has the base64 encoding of the bytestring.

ThomasWaldmann added a commit that referenced this issue Jan 22, 2023
ArchiveItem.cmdline list-of-str -> .command_line str, fixes #7246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants