Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base64 UTF-8 for image-layout ref names? #174

Closed
wking opened this issue Jul 21, 2016 · 36 comments
Closed

Base64 UTF-8 for image-layout ref names? #174

wking opened this issue Jul 21, 2016 · 36 comments

Comments

@wking
Copy link
Contributor

wking commented Jul 21, 2016

Spun off from #173.

We currently have a charset restriction on ref names, but folks may want to use colons, etc. I don't think folks should care about human-readability of the refs filenames, so I'd like to change the mapping from refs to filenames from “they're the same thing” to “the filename is the NFC Unicode that is UTF-8 encoded and then filesystem-safe Base 64 encoded”. Then folks can put whatever they want in the ref name.

As an example in Python 3:

>>> import base64
>>> import unicodedata
>>> name = 'Ζεύς'
>>> name_nfc = unicodedata.normalize('NFC', name)
>>> name_utf8 = name_nfc.encode('UTF-8')
>>> name_base64 = base64.b64encode(name_utf8, altchars=b'-_')
>>> print((name, name_nfc, name_utf8, name_base64))
('Ζεύς', 'Ζεύς', b'\xce\x96\xce\xb5\xcf\x8d\xcf\x82', b'zpbOtc-Nz4I=')

So the Ζεύς ref would be stored in refs/zpbOtc-Nz4I=.

There's previous Base 64 discussion (in the context of container IDs) in opencontainers/runc#675 and opencontainers/runc#676.

@runcom
Copy link
Member

runcom commented Jul 21, 2016

I don't think folks should care about human-readability of the refs filenames

I think naming should be expressed in some way - how to you expect people to talk about a given image? "HEY I'm using ./refs/busybox:latest" over "HEY I'm using ./refs/djEuMA=="
Also, I'm not sure every tool out there is ready to change to UTF8 for something like image names (but that can be just me)

I'm not saying I'm against changing the set of allowed chars in refs to introduce colons - I'm saying going to human readable to robot-only isn't really what I expect cause I think I'm losing the way to communicate an image name. Is there any other way to say this ref is "busybox:latest" w/o having to name the ref like that?

@wking
Copy link
Contributor Author

wking commented Jul 21, 2016

On Thu, Jul 21, 2016 at 11:35:51AM -0700, Antonio Murdaca wrote:

I don't think folks should care about human-readability of the refs filenames

I think naming should be expressed in some way - how to you expect
people to talk about a given image? "HEY I'm using
./refs/busybox:latest" over "HEY I'm using ./refs/djEuMA=="

I expect people to say “Hey, I'm using ‘Ζεύς’” or “I'm using
‘busybox:latest’”. Why talk about ./refs/… at all? That's an
internal detail of the image-layout 1.0.0 spec.

Also, I'm not sure every tool out there is ready to change to UTF8
for something like image names (but that can be just me)

Support for Unicode refs in image-layout does not mean that all
consumers need to support all of Unicode. You just say “our service
only supports {your charset}” and error out if someone tries to push a
ref that doesn't comply.

@runcom
Copy link
Member

runcom commented Jul 21, 2016

I'm all for this if solves this char issue and there aren't any downsides in doing this

@wking
Copy link
Contributor Author

wking commented Jul 21, 2016

On Thu, Jul 21, 2016 at 11:45:48AM -0700, Antonio Murdaca wrote:

I'm all for this if solves this char issue…

I'm pretty sure this solves the “I want to use {char}” issue.

and there aren't any downsides in doing this

This is harder to get a handle on ;). There is clearly some increased
computational cost associated with going from Unicode → NFC → UTF-8 →
Base 64, but I doubt that will be a significant.

For services / tooling that does not want to support Unicode refs,
this also puts that decision on them. They have to defend that choice
directly, and can't say “we don't do that because the image-layout
format doesn't support it. Go bother the image-spec maintainers if
you really want it”. If you're a maintainer of such a service /
tooling, that may give this change a political downside ;).

I can't think of any other issues, but that doesn't mean they don't
exist ;).

@vbatts
Copy link
Member

vbatts commented Jul 21, 2016

base64 for filenames isn't great. It would work if the name were in a file, but then mapping them is a function of indexing. hrm

@wking
Copy link
Contributor Author

wking commented Jul 21, 2016

On Thu, Jul 21, 2016 at 12:34:35PM -0700, Vincent Batts wrote:

base64 for filenames isn't great. It would work if the name were in
a file, but then mapping them is a function of indexing. hrm

Can you unpack that a bit more for me? RFC 3548 specifically calls
this alphabet “Filename Safe” 1.

@stevvooe
Copy link
Contributor

Packing naming into refs is a really bad idea. I'm going to comment more on #173, but these act more like tags, in the way that git works.

Indeed, making these base64 is probably not such a good idea, either. The character set has been chosen for wide artifact and filesystem compatibility.

This conversation really needs to be held in the context of naming.

@wking
Copy link
Contributor Author

wking commented Jul 21, 2016

On Thu, Jul 21, 2016 at 04:20:32PM -0700, Stephen Day wrote:

Packing naming into refs is a really bad idea.

Agreed. This is just “what do we gain by restricting the ref
charset, let's remove that restriction”?

Indeed, making these base64 is probably not such a good idea,
either. The character set has been chosen for wide artifact and
filesystem compatibility.

“Base 64 Encoding with URL and Filename Safe Alphabet” 1 seems like
it would have wide filesystem compatability as well ;). Do you have a
specific concern on that front?

For other artifact stores, I expect they'll handle this in their own
fashion and won't be impacted by image-layout decisions.

This conversation really needs to be held in the context of naming.

I think naming as in “Trevor signed image sha256:5b… as debian:7.0” is
completely orthogonal to “what charset do we allow for the ref
nicknames?”. Refs are like Git's lightweight tags, with no security.
Signed names will be like Git's tag objects, and the crypto bits will
live in CAS (like they do in Git).

@stevvooe
Copy link
Contributor

“Base 64 Encoding with URL and Filename Safe Alphabet” [1] seems like
it would have wide filesystem compatability as well ;). Do you have a
specific concern on that front?

It is not human-readable and it makes debugging harder. Simple as that.

I think naming as in “Trevor signed image sha256:5b… as debian:7.0” is
completely orthogonal to “what charset do we allow for the ref
nicknames?”. Refs are like Git's lightweight tags, with no security.
Signed names will be like Git's tag objects, and the crypto bits will
live in CAS (like they do in Git).

This is an apt analogy.

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Thu, Jul 21, 2016 at 04:41:02PM -0700, Stephen Day wrote:

It is not human-readable and it makes debugging harder. Simple as
that.

These are not compatibility concerns. I'm generally in favor of
human-readable representations (e.g. JSON over protobuf 1), but for
an image-layout that must be compatible with all sorts of filesystems
I think the cost of filename-safe Base 64 + UTF-8 is worth paying to
avoid “we've chosen not to support your requested {character}”
arguments.

How frequently do you expect to want to debug an image-layout instance
when you won't have oci-image-tool available for (using the #159
tooling)?

$ oci-image-tool refs list image.tar
Ζεύς
busybox:latest

Or are you concerned that that image-spec tooling will be so difficult
to test that you'll be constantly debugging oci-image-tool itself?
Neither of those seem very likely to me.

@stevvooe
Copy link
Contributor

@wking I'll say it again, packing names into refs is probably a really bad idea.

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Fri, Jul 22, 2016 at 01:17:53PM -0700, Stephen Day wrote:

@wking I'll say it again, packing names into refs is probably a
really bad idea.

This proposal is not about packing names into refs. It's about
lifting the restriction on allowed ref-name characters. If you feel
those concepts are linked, can you spell out the link in detail?

@runcom
Copy link
Member

runcom commented Jul 22, 2016

I believe the restriction on characters can stay the way it is now w/o over-complicating it - since this came up as part of allowing colons to express "name:ref"

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Fri, Jul 22, 2016 at 01:32:37PM -0700, Antonio Murdaca wrote:

I believe the restriction on characters can stay the way it is now
w/o over-complicating it - since this came up as part of allowing
colons to express "name:ref"

This would also help simplify portability with ref names that include
slashes 1.

If there is nobody actively pushing for a particular new character,
then I'm fine tabling this proposal. However, I expect we will see
“why can't I use {char} in my ref name” issues like [2,3], and when we
do, I think this proposal is a good way to handle all of them at once,
instead of adding case-by-case extensions and workarounds.

@runcom
Copy link
Member

runcom commented Jul 22, 2016

This would also help simplify portability with ref names that include
slashes [1].

that's still tied to solving the naming problem.

If there is nobody actively pushing for a particular new character,
then I'm fine tabling this proposal. However, I expect we will see
“why can't I use {char} in my ref name” issues like [2,3], and when we
do, I think this proposal is a good way to handle all of them at once,
instead of adding case-by-case extensions and workarounds.

having name restriction is something we can just document as said in [3] from your comment. Btw, if you guys really feel this should be extended I'm all for this but this is a non-issue right now to me.

@stevvooe
Copy link
Contributor

Here is an example of a recent breakage: http://latkin.org/blog/2016/07/20/git-for-windows-accidentally-creates-ntfs-alternate-data-streams/.

I am very familiar with the "I want another character problem". The solution to this is usually not to add more characters, but remove the contention over character usage by making provisions to store arbitrary image names outside of an area that has more strict requirements. For example, only put names in fully escaped contexts (not a url path).

There is also the consideration that a simplified naming scheme is usually more secure. When you start introducing unicode, there are many code points that look the same but aren't, making copypasta a risky business.

There is also the consideration that these names may want to be mapped to DNS, which creates further restrictions.

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Fri, Jul 22, 2016 at 01:44:29PM -0700, Antonio Murdaca wrote:

This would also help simplify portability with ref names that
include slashes 1.

that's still tied to solving the naming problem.

For your use-case ;). But who knows what other people will want to
use refs for? Base 64 is straightforward enough that restricting the
charset seems unnecessary.

If we want to wait until someone gives us a meaningful use case and
then judge the validity of those use-cases before deciding to allow
those characters, you have to make a lot of decisions. For example,
you need to reject “I want a ref named alpine:1.0” 1 as not good
enough 2 because you'll probably trust it too much 3.

I'd rather have the image-layout layer say “use whatever ref names you
want, I don't care”, because deciding whether a particular use-case is
good enough to deserve an allowed-character extension doesn't seem
like a good use of time. But it's not going to be my time, and I'm
not an image-spec maintainer, so it's clearly not my call ;).

@stevvooe
Copy link
Contributor

This would also help simplify portability with ref names that include
slashes [1].

This isn't really a problem. All platforms have some concept of a file system separator that can be mapped 1:1.

If you feel
those concepts are linked, can you spell out the link in detail?

Human readability and debugging are important here. If you remove the incentive to complicate the on-disk filename mapping, this problem goes away at the cost of obfuscation. Restricting the character set is one solution, while encoding is another. One must also take into account the security ramifications.

You can read about the considerations made in mapping names to paths and urls in docker/distribution for more info.

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Fri, Jul 22, 2016 at 01:47:51PM -0700, Stephen Day wrote:

The solution to this is… to store arbitrary image names outside of
an area that has more strict requirements.

Like a filename-safe-base-64-encoded filename ;).

There is also the consideration that a simplified naming scheme is
usually more secure. When you start introducing unicode, there are
many code points that look the same but aren't, making copypasta a
risky business.

Sure, and this is a concern for domain names too. But the IETF
solution to that is to allow Unicode in the format and warn registries
and clients to take care [1,2], not to ban Unicode from domain names.

There is also the consideration that these names may want to be
mapped to DNS, which creates further restrictions.

I think this is #175.

@wking
Copy link
Contributor Author

wking commented Jul 22, 2016

On Fri, Jul 22, 2016 at 02:03:45PM -0700, Stephen Day wrote:

This would also help simplify portability with ref names that
include slashes [1].

This isn't really a problem. All platforms have some concept of a
file system separator that can be mapped 1:1.

I'm not saying you couldn't work around any missing character, I'm
saying that this proposal is a simple, general fix to all such
problems (as long as the char you want is in Unicode ;).

You can read about the considerations made in mapping names to paths
and urls in docker/distribution for more info.

Can you post a link? I can't turn up something with those keywords
using grep:

docker/distribution$ git describe
docs-v2.4.1-2016-06-28-51-g4abae2a
docker/distribution$ git grep -6 URL | grep -i6 name | grep -i6 secur
…no meaningful hits…

@stevvooe
Copy link
Contributor

Hopefully, this will get you started:

https://github.com/docker/distribution/blob/master/registry/storage/paths.go
distribution/distribution#208
distribution/distribution#130
distribution/distribution#1201

This is mostly an intractable problem and a balance needs to be struck.

@wking
Copy link
Contributor Author

wking commented Jul 23, 2016

On Fri, Jul 22, 2016 at 03:29:20PM -0700, Stephen Day wrote:

Hopefully, this will get you started:

Thanks for hunting those down for me :).

https://github.com/docker/distribution/blob/master/registry/storage/paths.go

This is a good overview of how docker/distribution organizes refs, but
I couldn't find references to security concerns there.

distribution/distribution#208

This leads to docs about path traversal attacks 1, which we avoid
entirely with a filename safe base 64 encoding.

distribution/distribution#130

This references a proposal to base 64 (or 32) encode names in
distribution/distribution#124 and distribution/distribution#128. That ended up
where you are now, suggesting keeping some illegal characters (in
particular name positions) and using that to avoid encoding 2. I
agree with folks advocating future-proofing 3 over the temporary
convenience of “does not require base 64 decoding to be
human-readable”.

distribution/distribution#1201

This looks like “please add support for an additional character”,
which seems like an argument in favor of this sort of generic encoding
approach.

This is mostly an intractable problem and a balance needs to be
struck.

It's only intractable if you make human readability a hard requirement
and can't read base 64 by eye, while only the latter is true for me
;).

 Linked from docker/docker-registry#727

@stevvooe
Copy link
Contributor

@wking We have tests for path traversal. Most of this is thematic, but there is a long history of discussion in this area.

human readability a hard requirement

Humans use this system.

I actually think the refs should probably take on the role of sections in ELF.

@wking
Copy link
Contributor Author

wking commented Jul 23, 2016

On Fri, Jul 22, 2016 at 05:22:50PM -0700, Stephen Day wrote:

@wking We have tests for path traversal.

Again 1, not encoding isn't impossible, it's just that always
encoding lets you neatly sidestep all of these problems.

human readability a hard requirement

Humans use this system.

But not without tools. Use the tools to see the decoded refs. They
don't have to sit on the disk or cross the wire in that format.

I actually think the refs should probably take on the role of
sections in ELF.

I'll read up on ELF sections, but ELF is certainly not a format I can
read with just ‘cat’, so invoking it seems like another argument for
encodings ;).

@stevvooe
Copy link
Contributor

@wking The comparison with ELF means that OCI controls the names of refs. That doesn't give license to obscure everything unnecessarily.

@wking
Copy link
Contributor Author

wking commented Jul 25, 2016

On Mon, Jul 25, 2016 at 02:15:29PM -0700, Stephen Day wrote:

@wking The comparison with ELF means that OCI controls the names of
refs.

I think you mean “we are specifying the ref name ↔ path mapping”, and
I agree with that. You may mean, “we are limiting the ref-name
charset”, and I agree that we are currently doing that, but think that
this proposal is a good way to stop doing it (because it allows us to
avoid a lot of decisions 1).

That doesn't give license to obscure everything unnecessarily.

I'm not suggesting we use Base 64 to obscure ref names. I'm
suggesting we use it to get a safe filename without limiting the
ref-name charset.

@runcom
Copy link
Member

runcom commented Jul 25, 2016

I still believing restricting the charset is useful, right now this won't have happened if there wasn't a restriction and that's good because we can advance step by step here. With your base64 refs people could have just done what I was proposing (multiple images in the image-layout) but that could have been wrong and by opening the charset to whatever also open the spec to something not wanted (or better, not specified)

@runcom
Copy link
Member

runcom commented Jul 25, 2016

That said, I still think this might be useful. But it doesn't open the doors just to my use case. And it can probably make other decisions and design harder here.

@wking
Copy link
Contributor Author

wking commented Jul 25, 2016

On Mon, Jul 25, 2016 at 02:41:35PM -0700, Antonio Murdaca wrote:

With your base64 refs people could have just done what I was
proposing (multiple images in the image-layout) but that could have
been wrong and by opening the charset to whatever also open the spec
to something not wanted (or better, not specified)

That's “wrong” / “not wanted” according to this judgement call 1. I
don't think image-layout is the right place to be making those policy
decisions.

@runcom
Copy link
Member

runcom commented Jul 25, 2016

Wrong/not wanted == probably mis-used

@wking
Copy link
Contributor Author

wking commented Jul 25, 2016

On Mon, Jul 25, 2016 at 03:02:57PM -0700, Antonio Murdaca wrote:

Wrong/not wanted == probably mis-used

This is still based on pushing a higher-level abstraction (verified
names) down into image-layout, when I think all image-layout should
care about is “these are mutable refs”.

@philips
Copy link
Contributor

philips commented Aug 24, 2016

I will address this when I close out #173 . The idea is to only have refs be things that are classicly "tags". So, v1.0.0 not tons of crazy characters.

@wking
Copy link
Contributor Author

wking commented Aug 24, 2016

On Wed, Aug 24, 2016 at 07:18:22AM -0700, Brandon Philips wrote:

The idea is to only have refs be things that are classicly
"tags". So, v1.0.0 not tons of crazy characters.

This will make unrelated images in the same ref store more difficult
1, requiring mappings from names to default-refs that do not work in
all cases 2.

And I still think we'll make our future lives easier by making refs
generic instead of pushing the current consensus policies
(e.g. “tag-like names ref”) down into the ref-engine layer 3.
Consensus policies can change, but landing something like base 64 refs
would require a major bump to image-layout 2.0.0.

I still don't see why image-spec needs to have an opinion on ref
format at all. Do we have reasons beside “this may be a foot gun” 4
and “I can't read it without tools” 5?

@vbatts
Copy link
Member

vbatts commented Aug 29, 2016

well, given that we could adopt the git-style of ./refs/tags/$tag, then adding a non-version-breaking piece like ./refs/base64/$value could be an option, but for now that is not the route we're going to get a v1.0.0 released

@philips
Copy link
Contributor

philips commented Sep 14, 2016

Conclusion on this one based on the discussion on the call today. Essentially everyone feels we have selected the correct subset of characters to support most filesystems. And we have not attached any semantic meaning to these refs in this directory.

So, if someone wants fancy UTF-8 refs they can create a higher-level tooling expectation that says, org.prefix.base64-aGVsbG8K is actually displayed as "hello" tag to the user.

@wking
Copy link
Contributor Author

wking commented Sep 14, 2016

On Wed, Sep 14, 2016 at 02:56:30PM -0700, Brandon Philips wrote:

Essentially everyone feels we have selected the correct subset of
characters to support most filesystems.

I would still rather have gone with this approach, allowing safe,
consistent UTF-8 for everyone. But I'm am ok with an out-of-spec
base64-… convention for folks who need additional characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants