Libbitcoin support for mnemonic wallet seed encoding began with Electrum v1. Later came BIP39, driven by the fine folks behind Trezor, which we believed Electrum was adopting. When the fine folks behind Electrum decided against BIP39, we found ourselves with three implementations. We had dropped Electrum v1 in the expectation that BIP39 would become sufficient. Later we added Electrum but found it necessary to also restore Electrum v1. It is not possible to properly implement Electrum mnemonic support without also implementing Electrum v1 and BIP39 mnemonics.
An overhaul of our mnemonic implementations was well overdue. What was anticipated to require one week required over a month of full time work. Test coverage is nearly complete and it will be merged soon. Before I forget the various lessons learned, I decided to write them down here. The information is all out there, somewhere. But ultimately it required digging through a lot of Python and C code. Wallet seeds are not something for a developer to take lightly, and code is always authoritative. Eventually I found myself sifting through Python internals, a deeper rabbit hole than I expected.
I will state for the record that I truly appreciate both Electrum and Trezor. Otherwise I would not have spent the time to provide comprehensive support for all three of these encodings. These observations are provided for my own record and to possibly aid others who may at some point find themselves in that same rabbit hole. When one goes this deep into implementation, interesting discoveries abound.
A universally-unique natural language.
Libbitcoin refers to a languages by the IANA subtag standard.
In linguistics a token
is an "individual occurrence of a linguistic unit in speech or writing".
Tokens contain no whitespace code points.
Tokens may or may not be normal form.
Electrum allows seed generation from tokens (i.e. non-dictionary words) in normal form.
A dictionary
is a standard ordered set of distinct reference tokens of a single language.
There may be more than one dictionary per language.
Dictionaries of the same or distinct languages may intersect.
A dictionary defines its word order, which may or may not be a lexicographic sort.
An interpreter
is a set of same length (word-count) dictionaries of distinct languages, each identified by language.
An interpreter maps between entropy and mnemonic forms, given a specified or detected language.
There is no necessary standard defining the set of interpreter dictionaries.
A word
is a dictionary token.
A mnemonic
is an ordered set of words from a common dictionary, conforming to standard size and checksum constraints.
Electrum v1 does not implement checksum constraints.
A mnemonic may be fully contained by multiple dictionaries.
A mnemonic may be referred to as
recovery seed
by some implementations.
A whitespace
character is a standard character with a glyph of no visible pixels.
A sentence
is a mnemonic serialized as a sinistrodextral string of its words with whitespace delimiters.
An encoding
is a standard bidirectional map between any mnemonic and its numeric representation.
The Electrum v1 encoding is (inadvertently) not fully bidirectional.
A normal form
is a standard word, sentence or passphrase character representation.
A single glyph may have multiple distinct code points, and many distinct glyphs may be rendered similarly or identically.
Word containment by a dictionary is determined by normal form equality.
Its entropy
is the numeric representation of a mnemonic.
Both a mnemonic and its entropy represent the same entropic value.
A passphrase
is arbitrary text that may be combined with a mnemonic in the formation of a seed.
Electrum v1 does not implement a passphrase.
A seed
is a secret number, derived using a standard one-way hash from a mnemonic.
A master private key
is an secp256k1 private key, obtained from a seed in a standard manner, allowing spending.
BIP39 wallets typically derive this (and a chain code) from the seed in accordance with BIP32.
Electrum serializes the seed as a secret and chain code in accordance with BIP32 serialization.
Electrum v1 maintains this as a 32 byte value.
A master public key
is a secp256k1 public key, derived in the standard one-way manner from the master private key, allowing receiving.
Electrum and typical BIP39 wallets derive this in accordance with BIP32.
Electrum v1 maintains this as a 64 byte value (uncompressed, without prefix).
The following standards are implied by the above terminology.
- Language (identification)
- Dictionary (words and order)
- Mnemonic (length and checksum)
- Whitespace (delimiters)
- Normal Form (word, sentence, and passphrase)
- Encoding (entropy mapping)
- Seed (derivation)
- Master Private Key (derivation)
- Master Public Key (derivation)
The reliance of Electrum and BIP39 on Unicode word and passphrase normalization is an inherent risk. Unicode implementations are large and complex. Trivial conversions in ASCII, such as lower-casing, become treacherous in Unicode.
"When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Samba file and printer sharing software. Samba did not recognize the altered filenames as equivalent to the original, leading to data loss. Resolving such an issue is non-trivial, as normalization is not losslessly invertible."
For this reason we have implemented Libbitcoin mnemonics without a hard dependency on Unicode normalization. The Electrum v1, Electrum, and BIP39 classes do not require Unicode normalization unless a non-ASCII passphrase is provided. If the library is compiled with WITH_ICU undefined all features remain available with the exception that seed passphrases are ASCII-limited.
For the same reason Libbitcoin does not support Electrum token-based seeding. All words must correspond to a dictionary. When WITH_ICU is defined, words are Unicode normalized before comparison, to improve the chance of matching. Ideally an implementation provides a dictionary-based word selector, making this unnecessary. If WITH_ICU is undefined then word normalizations are ASCII-limited, though pre-normalized non-ASCII words will match the dictionary.
A mnemonic sentence must be parsed into a list of words for dictionary matching and seed generation. Similarly a mnemonic is often emitted in sentence form for portability.