Too many Code duplicates #89

fluency03 · 2018-11-15T13:04:24Z

Such as:

raw 0x55 - base64urlpad 'U', which is 0x55
bencode 0x63 - base32pad 'c', which is 0x63
dbl-sha2-256 0x56 - base32hex-upper 'V', which is 0x56

And,

multihash 0x31 - base1 '1', which is 0x31
multicodec 0x30 - base2 '0', which is 0x30
dns6 0x37 - base8 '7', which is 0x37

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-11-15T18:25:05Z

See: #59 and the followup #68.

Basically, 'U' happens to encode to 0x55 in ASCII/UTF-8 but 'U', itself, is a symbol. Multibase only really makes sense in a text context where we have a string of character symbols.

Note: after some followup discussions, we realized that these really don't belong in the same table. Technically, bytes are also symbols but I'm not aware of any text encoding that allows for both character symbols and byte symbols. The current setup causes more confusion than it's worth.

fluency03 · 2018-11-15T18:33:28Z

@Stebalien

The terminology "symbol" is really confusing, because it does not belong to any data type in, as far as I know, any programming language. At least we have data type Byte in most of languages.

From the implementation point of view, I still don't understand how to represent a symbol, where the other Codes are having Byte field at the same place.

fluency03 · 2018-11-15T18:35:20Z

According to this implementation https://github.com/multiformats/js-multicodec/blob/master/src/base-table.js, there are only the following bases implemented:

// bases encodings
exports['base1'] = Buffer.from('01', 'hex')
exports['base2'] = Buffer.from('00', 'hex')
exports['base8'] = Buffer.from('07', 'hex')
exports['base10'] = Buffer.from('09', 'hex')

And here, the so called symbols are actually treated as hex in Byte.

Can I do this if I want to implement it in another language?

fluency03 · 2018-11-15T18:55:02Z

Update from #76:

Both js-multicodec and py-multicodec are wrong.

Stebalien · 2018-11-15T18:56:06Z

The terminology "symbol" is really confusing, because it does not belong to any data type in, as far as I know, any programming language.

Copied from #76 (comment) to keep everything in this thread:

For example, binary is composed of two symbols 0 and 1 (or true and false). Bytes are defined to each be a string of 8 binary symbols but are also, themselves, symbols (there are 256 of them).

Every character is also a symbol. On a computer, these symbols may be encoded into bits/bytes but there are often several ways to encode a single symbol into bits/bytes and the symbol exists apart from these encodings (an '1' on paper is a '1', not 0x31).

When I say symbol, I'm talking about these: https://en.wikipedia.org/wiki/Turing_machine

(I agree this is confusing. It's "technically" correct but I can't think of a better explanation that's still correct.)

fluency03 · 2018-11-15T19:03:48Z

So that means, in an actual implementation, a symbol has to be implemented as a special data structure, maybe called Symbol. Some properties of this data type Symbol could be something like this:

class Symbol (
  val isByte: Bool = ~
  val value: Bytes = ~
)

Resolution from a discussion with Juan and the discussion on the following issues: fixes #89 fixes #76

Stebalien · 2018-11-15T19:31:38Z

It's probably best to just have two tables:

Multicodecs: these use varint byte sequences.
Multibases: use text symbols.

Combining them under a single abstraction probably isn't worth it.

For multibase, you'd just use whatever encoding your language supports. For example, the symbol 👍 has one encoding in UTF-8, another in UTF-16, and another in UTF-32. At the end of the day, that doesn't really matter. The important part is whether or not some string starts with the symbol 👍 (regardless of encoding).

fluency03 · 2018-11-15T20:50:23Z

However, a new question is: if we treat so many different things (such as protobuf, md4, murmur3, even ip4, udp and http) as different type of codec, why should we exclude BaseN form the Codec?

Stebalien · 2018-11-15T23:54:46Z

Those all occur in a binary context. That is, they all answer the question "what does this series of bytes mean". However, mulitbase occurs in a text context. It answers the question "how do I convert this sequence of characters to a sequence of bytes".

fluency03 · 2018-11-16T00:00:10Z

import com.github.fluency03.multibase.Multibase
import com.github.fluency03.multibase.Base._

val str = "Multibase is awesome! \\o/"

Multibase.encodeString(Base32Upper, str)              // BJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP
Multibase.encodeString(Base32Pad, str)                // cjv2wy5djmjqxgzjanfzsaylxmvzw63lfeeqfy3zp
Multibase.encodeString(Base32PadUpper, str)           // CJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP

Multibase.encodeString(Base32Z, str)                  // hji4sa7djcjozg3jypf31yamzci3s65mfrrofa53x
Multibase.encodeString(Base58Flickr, str)             // ZxaJjNnAzU5jHQLhoLrXxcVM66Ca1VkLWAT
Multibase.encodeString(Base58BTC, str)                // zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt

Multibase.encodeString(Base64, str)                   // mTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw
Multibase.encodeString(Base64Pad, str)                // MTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==
Multibase.encodeString(Base64URL, str)                // uTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw
Multibase.encodeString(Base64URLPad, str)             // UTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==


val encodedStr: String = Multibase.encode(Base16, str.getBytes)
// encodedStr: String = f4d756c74696261736520697320617765736f6d6521205c6f2f

val decodedBytes: Array[Byte] = Multibase.decode(encodedStr)
// decodedBytes: Array[Byte] = Array(77, 117, 108, 116, 105, 98, 97, 115, 101, 32, 105, 115, 32, 97, 119, 101, 115, 111, 109, 101, 33, 32, 92, 111, 47)

val decodedStr = new String(decodedBytes)
// decodedStr: String = Multibase is awesome! \o/

If you take this as an example, you can also say: what does this series of bytes mean?

That is, for this f4d756c74696261736520697320617765736f6d6521205c6f2f:

f indicates it is with codec base16
based on the codec base16, we can answer the question what does this series of bytes mean?, which is: it means "Multibase is awesome! \\o/".

Stebalien · 2018-11-16T00:35:41Z

That's a series of characters. That may or could, potentially, encode to entirely different sequences of bytes depending on the underlying encoding. For example, "f4d756c74696261736520697320617765736f6d6521205c6f2f" encoded in UTF-32 is [255, 254, 0, 0, 102, 0, 0, 0, 52, 0, 0, 0, 100, 0, 0, 0, 55, 0, 0, 0, 53, 0, 0, 0, 54, 0, 0, 0, 99, 0, 0, 0, 55, 0, 0, 0, 52, 0, 0, 0, 54, 0, 0, 0, 57, 0, 0, 0, 54, 0, 0, 0, 50, 0, 0, 0, 54, 0, 0, 0, 49, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 54, 0, 0, 0, 57, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 54, 0, 0, 0, 49, 0, 0, 0, 55, 0, 0, 0, 55, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 54, 0, 0, 0, 102, 0, 0, 0, 54, 0, 0, 0, 100, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 50, 0, 0, 0, 49, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 53, 0, 0, 0, 99, 0, 0, 0, 54, 0, 0, 0, 102, 0, 0, 0, 50, 0, 0, 0, 102, 0, 0, 0] (bytes/binary).

fluency03 · 2018-11-16T01:00:19Z

According to this Protocol Description - How does the protocol work?:

multicodec is a self-describing multiformat, it wraps other formats with a tiny bit of self-description. A multicodec identifier may either be a varint (in a byte string) or a symbol (in a text string).

A chunk of data identified by multicodec will look like this:
<multicodec><encoded-data>
# To reduce the cognitive load, we sometimes might write the same line as:
<mc><data>

So, in this example f4d756c74696261736520697320617765736f6d6521205c6f2f (following this format <multicodec><encoded-data>), it self-describes itself by the starting <multicodec> - f followed by the <encoded-data> - 4d756c74696261736520697320617765736f6d6521205c6f2f.

Because it starts with f, which means it is self-describing itself as base16. Then all of the following <encoded-data> should be treated as base16.

Stebalien · 2018-11-16T20:23:27Z

I added the "symbols" concept in #68 in an attempt to address this exact issue. I'm now proposing that we remove it in #90 because it's clear that it's still confusing.

Really, multibase is a multicodec (of sorts). However, our other multicodecs all show up in a binary context while multibase shows up in a text context but this distinction and why it matters is confusing.

Nit: other multicodecs usually use <mc><length><value> where multibase is always <mc><value>.

fluency03 · 2018-11-16T21:08:34Z

"Nit: other multicodecs usually use <mc><length><value> where multibase is always <mc><value>."

I think this is also inaccurate.

other multicodecs usually use - How I understand this this is:
- multicodecs are just a bunch sort of Codecs with names and codes.
- what you mentioned the format <mc><length><value> is only for multihash not for all Codecs in multicodecs, right?
multibase is always <mc><value>

Therefore, the difference <mc><length><value> vs <mc><value> is

multihash vs multibase

instead of

multicodec vs multibase

Resolution from a discussion with Juan and the discussion on the following issues: fixes #89 fixes #76

fluency03 changed the title ~~Code duplicates~~ Too many Code duplicates Nov 15, 2018

fluency03 mentioned this issue Nov 15, 2018

Text, raw varint, bytecode distinction #76

Closed

Stebalien added a commit that referenced this issue Nov 15, 2018

move multibase prefixes out of this table

b17ccaa

Resolution from a discussion with Juan and the discussion on the following issues: fixes #89 fixes #76

Stebalien mentioned this issue Nov 15, 2018

move multibase prefixes out of this table #90

Merged

ghost assigned Stebalien Nov 15, 2018

ghost added the in progress label Nov 15, 2018

vmx mentioned this issue Nov 28, 2018

Clarifying the varint nature of multibase codecs multiformats/multibase#43

Closed

vmx closed this as completed in #90 Dec 18, 2018

vmx pushed a commit that referenced this issue Dec 18, 2018

move multibase prefixes out of this table

1ec0e97

Resolution from a discussion with Juan and the discussion on the following issues: fixes #89 fixes #76

ghost removed the in progress label Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many Code duplicates #89

Too many Code duplicates #89

fluency03 commented Nov 15, 2018 •

edited

Loading

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 15, 2018

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018 •

edited

Loading

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018

Stebalien commented Nov 15, 2018

fluency03 commented Nov 16, 2018

Stebalien commented Nov 16, 2018

fluency03 commented Nov 16, 2018 •

edited

Loading

Stebalien commented Nov 16, 2018

fluency03 commented Nov 16, 2018 •

edited

Loading

Too many Code duplicates #89

Too many Code duplicates #89

Comments

fluency03 commented Nov 15, 2018 • edited Loading

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018 • edited Loading

fluency03 commented Nov 15, 2018 • edited Loading

fluency03 commented Nov 15, 2018

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018 • edited Loading

Stebalien commented Nov 15, 2018

fluency03 commented Nov 15, 2018

Stebalien commented Nov 15, 2018

fluency03 commented Nov 16, 2018

Stebalien commented Nov 16, 2018

fluency03 commented Nov 16, 2018 • edited Loading

Stebalien commented Nov 16, 2018

fluency03 commented Nov 16, 2018 • edited Loading

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 15, 2018 •

edited

Loading

fluency03 commented Nov 16, 2018 •

edited

Loading

fluency03 commented Nov 16, 2018 •

edited

Loading