-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPFS permits undesirable paths #1710
Comments
Agreed. perhaps we should follow what linux/bsd do.
Agreed. |
I'm not so sure how sold i am on limiting what we can put in links. Applications (like unixfs) can and should restrict them, but it seems like we may lose out on some flexibility by doing so globally |
POSIX filenames can contain everything except
Linux permits all characters except Besides stuff like newline and backspace and escape, the fact that Linux (and POSIX) does not require filenames to use any particular character encoding is a source of continuing struggle. See e.g. Python and Rust hammering out how to map OS-provided byte sequences to encoding-aware strings in the least terrible way possible. Requiring IPFS link names (and thus paths) to be well-formed UTF-8 would be a win, even if it means some filenames representable by the OS are not permitted in IPFS. Speaking of length restrictions, link names should probably also be constrained to be non-empty, and have a maximum length constrained on purpose, rather than by being incidentally constrained by the network transport. Linux @whyrusleeping Would you permit arbitrary byte sequences in link names? That would have interesting properties (i.e. permitting protobuf messages to be used as link names directly), but it would also mean a sequence of links could not be combined into a path using |
|
(thanks @willglynn for bringing this up, making the case, and links, etc) |
This is exactly the discussion I was hoping to have. If link names need to be binary-friendly, then let's make them so; catch is every component would need to treat paths as
nods
In that case, we might need more restrictive rules. Leading and/or trailing whitespace might have to go. Unicode canonicalization might need to be specified – i.e.
The basic problem is that major platforms permit filenames that are outside of Unicode. Declaring "IPFS paths shall be UTF-8 without pointy edges" is good for IPFS users and developers, even if that means some things would need a translation layer. As for platforms: Linux is entirely laissez faire, permitting any byte string that excludes IPFS paths should permit every filename that a reasonable person would use, while avoiding the complexities introduced by being too permissive. This is a balancing act and there's definitely room for discussion. (And again, POSIX filename portability demands only
Agreed. I was advocating a maximum length per path component, rather than restricting the path in its totality, for exactly the reason that a node might be found using many arbitrarily long paths. I picked a limit of 4 KB above for comparison with |
@willglynn agreed throughout.
|
Another option would be to just disallow anything in the big list of naughty strings. I'm not sure that's right, but it hasn't been mentioned yet and we don't know what people will want to do with IPFS paths. If we allow a subset of this list, we should have a public sanitizer module that allows people to check their proposed strings easily. |
That list appears to be intended to stress test input validation, and is suitable for e.g. testing tools to automatically enter data that might cause undesirable behavior, and incorporating it into an IPFS test suite somewhere might have value. However, using it as a blacklist would prohibit things like Most of the rules I proposed are readily enforced by e.g. Go's |
Cool. Yeah, yours is simpler. Let's go with that. |
The experience of writing some path validation code and corresponding tests prompts me to raise three new topics: Length limitsPOSIX limitations ( Two special noncharactersDo
Some sources state that these two non-character codepoints in particular can cause problems: These two codepoints are specifically excluded from XML. I note that XML dates to the period when noncharacters "should never be interchanged", but I note also that only these two particular non-characters are excluded despite having the opportunity to exclude other non-characters. (Edit: HTML5 prohibits these and all other non-characters, along with most C0 and C1 control characters.) Mostly there's a lot of discussion and no clear consensus: D, SpiderMonkey, Perl. Official word is:
"Just be careful out there!" isn't really great guidance. I'm inclined to not exclude these values, since they are valid Unicode codepoints with a valid encoding and are permitted for interchange, even if they're not characters. Typical tools handling UTF-8 won't explode when handed these codepoints, so… that's okay, right? UTF-8 has multiple definitionsUTF-8 as originally specified in ISO/IEC 10646 and in RFC 2044 can encode characters up to 31 bits over 6 bytes. UTF-8 as specified in RFC 3629 limits it to 21 bits and 4 bytes, matching the current definition of Unicode (having planes 0-16) and the range of codepoints encodable with UTF-16. Go's |
I think going to force a Unicode semantic on IPFS paths is not a good idea. The reason is:
Byte streams are much more simple to handle. Parsing UTF-8 is of no use, especially if we must encode it back for consumption in filesystem paths or HTTP URL. |
UNIX paths are arbitrary byte sequences, and that's confusing, because programs – and programmers – usually treat them as if they're encoded strings. Byte sequences are simple to handle until you need to do something with them, like display them in a user interface or print them to a log file, since essentially every context where text occurs outside a UNIX filesystem itself has a defined encoding that relates to Unicode in some specific way. UTF-8 strings are exactly as simple to handle as byte sequences, and they're much simpler than byte sequences to handle when you need a UTF-8 string, or a UTF-16 string, or a string in some other encoding. Processing potentially-malformed UTF-8 strings adds one additional step – you have to validate that your string is UTF-8 (which is not hard), but once you've done that, you're left with a UTF-8 string that's valid everywhere, including contexts that blindly accept bytes like FUSE. URLs are US-ASCII with a percent encoding system allowing arbitrary bytes. However, programs and programmers tend to work in terms of encoded strings, so just like UNIX filenames, many things assume that percent-encoded URLs correspond to valid UTF-8 and fail when they aren't. In the case of HTTP, this specifically includes ECMAScript. Looping back, yes, UNIX paths are arbitrary byte sequences… but these bytes are to be interpreted according to the user's C locale. Typical Linux users today run UTF-8 locales, so typical Linux filenames are already UTF-8. If a user is not running UTF-8, they'll have problems: common things like glib assume filenames are UTF-8 regardless of locale, while others like KDE do honor the locale by default. Worse, even this behavior isn't consistent because both behaviors can be overridden by user-level configuration. This is a mess. Paths should not be subject to re-interpretation depending on a matrix of each user's environment variables, least of all paths that are shared in one global namespace. Filesystems on major non-Linux platforms – including Windows and at least one UNIX – avoid this mess by requiring their filenames to contain Unicode strings. NFS avoids this mess by declaring that paths are UTF-8. IPFS should do the same. I think it's better to add a small amount of complexity to IPFS (validate that paths are UTF-8 and prohibit the most problematic characters) than to ignore the issue and shift complexity to everything else (forcing every other application to deal with IPFS paths that are not necessarily Unicode, that contain |
Yes, I agree with UTF-8. I think it's time for the Path System to be native UTF-8. After all, the focus of paths is print (screens, humans, paper, etc), including societies which have different alphabets, and would want different characters in their paths. — On Mon, Sep 21, 2015 at 2:05 PM, Will Glynn notifications@github.com
|
Per my proposal in ipfs#1710: - Paths must be valid UTF-8 per RFC 3629. - Paths may not contain ASCII/Unicode C0 control characters (U+0000-U+001F). - Paths may not contain ASCII DEL (U+007F). - Paths are delimited by `/` (U+002F), and therefore path segments may not contain it. - Path segments may contain up to 255 Unicode codepoints. Total path length remains unbounded. - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as in POSIX. - Path segments must not be `.` and `..`, so that these can mean what they do in POSIX. Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples.
Per my proposal in ipfs#1710: - Paths must be valid UTF-8 per RFC 3629. - Paths may not contain ASCII/Unicode C0 control characters (U+0000-U+001F). - Paths may not contain ASCII DEL (U+007F). - Paths are delimited by `/` (U+002F), and therefore path segments may not contain it. - Path segments may contain up to 255 Unicode codepoints. Total path length remains unbounded. - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as in POSIX. - Path segments must not be `.` and `..`, so that these can mean what they do in POSIX. Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples.
Per my proposal in ipfs#1710: - Paths must be valid UTF-8 per RFC 3629. - Paths may not contain ASCII/Unicode C0 control characters (U+0000-U+001F). - Paths may not contain ASCII DEL (U+007F). - Paths are delimited by `/` (U+002F), and therefore path segments may not contain it. - Path segments may contain up to 255 Unicode codepoints. Total path length remains unbounded. - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as in POSIX. - Path segments must not be `.` and `..`, so that these can mean what they do in POSIX. Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples. License: MIT
If paths are UTF-8, I think there might be issues with normalizing or not the paths. If we normalize before hashing, people might not understand that what they will get out will not be binary identical with what they put in. If we don't normalize, people might not understand that UFT-8 identical paths produce different hashes. |
I agree that we should consider the paths to be UTF-8 if to be interpreted as unicode, or even require that the string be valid UTF-8, but the primary representation of the path should be a byte string and not a unicode string. Else we have all the normalisation problems to deal with as @chriscool points out. As such, IPFS should require path names to be valid UTF-8 but should also ensure that the byte string is left unmodified and can be byte-compared for identity. Also, this is true for unixfs, but can't we imagine applications that would like to have binary keys ? Or will the applications that want to use binary strings will have to take a similar approach as lz-string: convert binary data to Unicode? I can think for example than an application would want to store a public key as a link name. Would this application will have to convert it using Base64 or similar? Isn't it ironic that we can't store binary data when the wire format is binary and would let us do that easily? |
Yep – damned if you do, damned if you don't. Normalizing means the "same" path gets the same hash even if it's different, but yep, it has the disadvantage of making the path you put in different than the path you get out. This can cause problems which seem more serious than the gains from normalizing. Not normalizing means that the same character can be encoded using different sets of codepoints, which probably will cause problems where text as entered won't match text as encoded in IPFS. On the other hand, round-tripping paths through a human is a problem even with most canonicalization schemes, since many different characters are rendered using identical glyphs, e.g. Ω OHM SIGN HFS+ rewrites Unicode filenames in a way that supports their desired case-insesntivitity behaviors. Specifically, it performs Unicode decomposition using a fixed table that originally dated to Unicode 2.1. They updated this table in 10.3 to Unicode 3.2, requiring a special I favor validating names but leaving them alone.
Agreed. This is my approach in #1740.
The wire format would permit it, but enabling |
For unixfs paths, I understand fully why we want them human readable, but there are many applications that could use ipfs beside unixfs. I'm in favor for only allowing UTF-8 paths for unixfs, but not restrict paths for other data models, unless there is a good reason to. |
👍 |
@mildred the path name space is already limited, and not arbitrary-byte: FWIW:
|
On normalizing, yeah this is a difficult decision. I'm not yet sure what to do. changing the user's data is -- to me, in general -- a source of problems. But as @willglynn mentions, damned if you do, damned if you don't. It may be that in some implementation or mode, objects which are not normalized -- i.e. normalizing them changes the bytes -- are not silently changed but instead rejected with a meaningful error. This would bypass the uncertainty. (it sounds that's what you suggest, @willglynn?)
I usually see HFS+ as a source of examples of what not to do. Though indeed useful to study. |
@willglynn @jbenet |
Is this still an issue ? |
IPFS path components (that is, IPFS merkledag node link names) can contain undesirable things:
0x00
, ASCIINUL
)0x0A
,"\n"
, ASCIILF
)0x09
,"\t"
)0x1B
, ASCIIESC
)0x00
…0x1F
,0x7F
)0x2F
,"/"
)0xFE
,0xFF
)0x80
in isolation)U+D800
…U+D8FF
)0xC0 80
,0xE0 80 80
,F0 80 80 80
all decode asU+0000
, which isNUL
)Path components can also be strings that are commonly understood as path components with a special meaning, namely
"."
and".."
.I propose that path components (link names) be restricted both by specification and implementation, preferably by defining "valid" path components in a way that excludes the above.
go-ipfs
should refuse to create invalid links.go-ipfs
should also refuse to process any invalid links it sees, either by discarding the link or discarding the node, and that behavior should be specified as well.The text was updated successfully, but these errors were encountered: