Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CID Decoding Algorithm does not define "string", and allows false positives #44

Open
JDLH opened this issue Mar 21, 2021 · 4 comments
Open

Comments

@JDLH
Copy link

JDLH commented Mar 21, 2021

The CID Decoding Algorithm contains a step, "1. If it's a string (ASCII/UTF-8):". This is not sufficiently specified, and is subject to false positives.

The first problem is the consequences of the term "UTF-8", and the lack of an algorithm to determine if the target byte sequence "is a string (in UTF-8)". UTF-8 strings may consist of byte values less than, or greater than, or equal to 0x80. But not all sequences of byte values are valid UTF-8. The Unicode Standard, Section 3.9, Unicode Encoding Forms, goes into the details of what is valid and what is not. But some of the ins and outs are non-obvious, and probably undesireable in a CID decoding algorithm. If the intention is that the test should be, "1. if the target byte sequence consists of byte values from 0x21 to 0x7F inclusive", then say that. If the intention is that the test should be, "1. if the target byte sequence is a valid base58

The second problem is that any byte sequence which can be interpreted as a valid ASCII string, or a valid UTF-8 string, is still also a valid binary byte sequence. Thus the decoding algorithm does not rule out the case where someone constructs a CID which happens to consist of a sequence of bytes with values from 0x21 to 0x7F inclusive. This might accidentally get parsed according to case 1 of the algorithm, when it should get processed according to case 2. If there is something about the CID structure which means that a binary CID will never pass the test of case 1, it would be clearer for the documentation to say so.

A third problem is ambiguity is what a decoder should do if it attempts to decode a string according to case 1, but the contents are not valid according to the base58btc or multibase specifications. Should the decoder treat the string as a binary byte sequence and attempt to decode that way, or should the decoding attempt fail with an error?

I am new to the CID design, but I am a software engineer with experience working with text-based formats, including UTF-8. These sort of ambiguities can cause real problems. It may be that they are made clear elsewhere. This issue reflects my naive reading of the CID Decoding Algorithm in isolation. I suggest that this algorithm should not be ambiguous even when read in isolation.

Suggested rewording of the spec to address the first problem above:

The algorithm to decode a byte sequence into a CID is as follows.

  1. If the bytes in the sequence, when interpreted as a UTF-8 string, consist only of the characters of the Base58 alphabet ("123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"), then …
@vmx
Copy link
Member

vmx commented Mar 23, 2021

Thanks @JDLH for having such a thorough review of the spec. This spec really needs an update to be more precise and also some related specs/tech changed, which is why I opened #43.

The CID Decoding Algorithm contains a step, "1. If it's a string (ASCII/UTF-8):". This is not sufficiently specified, and is subject to false positives.

You are right. If you only have raw bytes as input, you won't know whether it's binary or a string.

Currently there is an issue open (#42) that aims to clarify the relation to Multibase. I think that would help solve these decoding ambiguities. If it is Multibase encoded, you would first decode it and then follow the steps of the binary CID decoding.

Though in your application you might already know whether the CID is a string or binary, depending on where you got the CID from. So it might be easier to treat the string and binary decoding differentiation in the way it is currently outlined.

Suggested rewording of the spec to address the first problem above:

The algorithm to decode a byte sequence into a CID is as follows.

  1. If the bytes in the sequence, when interpreted as a UTF-8 string, consist only of the characters of the Base58 alphabet ("123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"), then …

I agree that this is way clearer than "1. If it's a string (ASCII/UTF-8)". Would you mind opening a PR with that change?

@JDLH
Copy link
Author

JDLH commented Mar 24, 2021

So, it sounds like the decoding algorithm should account for three cases:

  1. The decoder believes from context that the data is a binary representation of a CID. Decode according to current part 2.
  2. The decoder believes from context that the data is a text representation of a CID. Decode according to current part 1. (disregarding the "If it's a string" condition).
  3. The decoder has the data, and no context to tell what kind of representation it is. In that case, attempt to decode as if it were a text representation. If errors occur, then stop, and attempt to decode as if it were a binary representation. If that fails, the overall decode attempt fails.

That eliminates the entire "if it's a string" concept. It delegates the question of what characters are valid in a text representation to the base58btc and multibase specifications, to which it refers. Note that in new case 3, if the data is a binary representation, the text representation decode will fail because some byte values will be outside the base 58 alphabet.

This still raises the question about whether a binary representation of a CID could happen to consist entirely of base 58 characters in a valid sequence.

Also, does it matter if current software does not behave this way? Suppose a current decoder in fact checks whether the data consists only of base 58 characters, instead of trying to decode according to multibase and failing. Is that a problem?

@rvagg
Copy link
Member

rvagg commented Mar 24, 2021

I'm not sure we want to care too deeply about case (3), it's very unlikely that there will exist cases where we have disembodied strings of bytes that we want to probe to figure out "is this a CID? and if so, should it be a string or bytes?". CIDs are almost always embedded with context that defines this, and maybe the spec should be more clear on that. It does read as if it's accounting for the most general case of "I have some bytes .. what do I do with them?", but I don't believe that's the intention. I think the intention is that there are two cases, each with its own path, for decoding a CID. The case is determined by the use & context and that's outside of the scope of the CID spec.

But clarity on any of this would be helpful of course. I'd just lean to turning (3) into something along the lines of "it's up to the use-case to determine whether to decode as string or bytes, it's not the responsibility of the algorithm and disambiguation of the two is not in scope of the spec".

@lidel
Copy link
Member

lidel commented Apr 6, 2021

it's very unlikely that there will exist cases where we have disembodied strings of bytes that we want to probe to figure out "is this a CID? and if so, should it be a string or bytes?".

[Based on anecdotal evidence] I've seen people in web3 space do all kinds of weird things when storing CIDs on chain, and any type of third-party analytics tool that want to inspect contracts will have to do (3) if it wants to correctly detect all CIDs and avoid false-negatives. We should account for this, even if it is a one sentence heuristic "if you are unsure, try binary path, if that fails, retry with string" (iiuc binary-first is less expensive due to the lack of multibase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants