Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New protocol #404

Closed
readcash opened this issue Aug 15, 2020 · 17 comments
Closed

New protocol #404

readcash opened this issue Aug 15, 2020 · 17 comments

Comments

@readcash
Copy link

readcash commented Aug 15, 2020

Hi, guys.

It's going to be a bit long, since while thinking about HD wallets I came to the conclusion that current solutions are not optimal when you have HD wallets with thousands of addresses and become progressively slower.

I think I have a great solution not only for HD wallets, but for TONS of applications that want to interact with BCH blockchain (like, for example a PoS terminal, that waits for a transaction to an address and wants to be notified of a double-spend)...

...alas it's basically a completely new protocol. So the issue is a bit long.

My questions to you would be:

  1. does it make any sense to you? (I'm not a protocol developer, so I might tell something really stupid here);
  2. If it makes any sense would you be able to implement it? (if money is an issue - I can probably ask mainnet funders whether we can spend some part of this on your work or I can try to find a private investment);
  3. If you can't implement it, but I can find a competent Go developer (I can develop in Go, but I wouldn't call myself competent) who does the work - would you accept it into BCHD?

Thank you!

Frankly, this is a protocol that unless it has some deficiency that I can't see, I'd personally pay to see implemented (especially using WebSockets and JSON).

Problems I want to solve:

  1. HD wallets become slow as more and more HD addresses are added, since you need to ask for UTXO for each and every address in the DB each time you start your HD wallet; or you need to monitor every transaction (this becomes a problem during bull runs or gigabyte blocks);
  2. read.cash has a fund, we need to monitor every address there. We subscribe to transactions to these addresses, but each time we do a restart/deploy we risk losing some information about transactions; same problem occurs if you try to listen to transactions from the browser - you lose connection for a sec, you reconnect and you possibly missed an important transaction;
  3. Neutrino offers some solutions, but it doesn't solve on how to listen to mempool (especially if we are talking about gigabyte blocks in the future);
  4. None of the current solutions will help us with situation where a transaction was evicted from the mempool, because a block included a conflicting transaction;
  5. No solutions currently offer notifications about double-spend proofs;
  6. Current solutions don't work with SLP tokens;
  7. Current solution of asking for UTXOs is not private or efficient.

I think I have a solution to all of these problems, which also has a slider that goes from "go fast" to "go privately".

Proposed solution:

We need a new protocol that resembles Apache Kafka or an append-only log file. I imagine a protocol that is WebSocket-style, where both client and server could send messages back and forth. I'll use something like JSON, but it's not necesssarily JSON and WebSocket, maybe bi-directional GRPC or something (though for me personally GRPS sucks because read.cash is in PHP and PHP has a very bad support for GRPC that would slow down our deploys from 1 minute to 20+ minutes, because I need to compile a module for GRPC on each deploy :( So personally I would be very-very happy if this is done as a WebSocket, not as GRPC :) )

It all starts with client telling the server that he's new.

> {"offset": null, "address_filter": "*", "event_filter": "*"}

It says that "I have no data and I want all next events that you saw, don't filter anything". I put *, but it's more of an idea, rather than concrete protocol - it could be null or anything more suitable.

The server then proceeds to output events as it (BCHD) recorded them in its own append-only file.

< {"next_offset": 7, "type": "confirmed_transaction", "transaction": { .... }}

^ Satoshi doing the first transaction ever

< {"next_offset": 16, "type": "new_block", "block": {"hash": ..., "transactions:" :["txid1", "txid2"]}}

^ Satoshi mined the first block

< {"next_offset": 16, "type": "done"}

So, the server tells the client that it first saw a confirmed transaction (because it got it as a part of the block from another node), then it saw a block, then it has nothing more (because no new events happened after this - imagine we're in 2008).

I think this "done" event might be useful, but I'm not 100% sure. It's not recorded into the append-only, but rather tells "that's it folks, I don't have anything else", which informs the client that he might finally stop spinning animation that the wallet is syncing.

What is an "offset"? The client shouldn't really care beside just remembering it. The only thing that concerns him is that if he supplies offset: 16 now - he will get only events that happened after things that he already saw.

As for BCHD I think the offset could be a physical file offset in the append-only log. That way it would be easy to fseek to any place that a new client requests.

Now if a client disconnects, crashes or something - he has an "offset" where he knows that "up to this point I'm in sync with BCHD". (Unlike BitSocket or SubscribeTransactionsRequest, where a disconnect means the clinet is losing data)

So, now the client connects back again and sends:

> {"offset": 16, "address_filter": "*", "event_filter": "*"}

The server sends back some more events:

< {"next_offset": 27, "type": "transaction_enters_mempool", "transaction": {"id": "...", .... }}

So, now we are at a point where BCHD instance was fully synced with the network and started receiving uncofirmed transactions to its mempool.

< {"next_offset": 37, "type": "confirmed_transaction", "transaction": {"id": "...", .... }}
< {"next_offset": 47, "type": "new_block", "block": {"hash": ..., "transactions:" :["txid3"]}}

The same transaction was confirmed and put into a block. Why do we need to send a full tx every time? Well, it adds over head to the protocol, but if you asked with a filter for only confirmed transactions - then you will have no idea what this transaction does.

Now, comes the interesting part:

< {"next_offset": ..., "type": "transaction_enters_mempool", "transaction": {"id": "...", .... }}
< {"next_offset": ..., "type": "transaction_evicted_from_mempool_due_to_conflict", "transaction": {"id": "...", .... }}

Basically, a new block came and a transaction that was in the mempool is no longer valid.

This is a very important event as no protocol helps you to "hear" this event - you need to constantly poll addresses/transactions to find out what was evicted and no longer valid.

< {"next_offset": .., "type": "double_spend_proof", "transactions": [{"id": "...", .... }]}

Well, ain't that nice! We now KNOW that somebody is trying something nasty!

Ok, let's dream! Ideally... ideally.. we can add even this:

< {"next_offset": .., "type": "slp_genesis", "transaction": {actually decoded SLP transaction, so that you know which token was created and its parameters}}

< {"next_offset": .., "type": "slp_transaction_enters_mempool", "transaction": {SLP transaction}}

... SLP transaction = actually decoded SLP transaction, so that you know which token was sent where, ideally with explicit information about burned tokens, so that you don't need to calculate it client-side

What this achieves? Well, you can actually EASILY build all kinds of cool light applications from HD wallets to full nodes that will easily sync to this BCHD node's state, restarting, dropping, failing at any time.

The best part is that if a client uses some kind of transactions - he doesn't care if he is rebooted in the middle - he will only commit data to his local db only along with the latest offset. (That's how working with Apache Kafka goes)

Now, to filters.

The event_filter is easy - it should be either * (or all or null) or something like transaction_enters_mempool, slp_genesis - so, a list of events I'm interested in.

The address_filter is either * (or all or null) or a Bloom filter of all addresses that I'm interested in. Basically, this means that I can build a Bloom filter for thousands of addresses that I'm interested in and ask BCHD to go through all of it's append-only file to search for stuff that might be of use for me. That creates some load, but Bloom filters are generally pretty fast. Why Bloom filters and not Golomb Coded Sets? I don't know if GCS work for mempool transactions.. I don't think so.

But now, we can add one more operation:

> {"offset": -1}

which will give us a latest current offset back and do nothing else:

> {"next_offset": 1726389712}

So, now when we are creating a new HD wallet - we can be sure that we don't need to know anything before this offset, because we know that our wallet didn't exist prior to this (this doesn't work for wallets restored from backup - this would have to re-scan the chain, sadly)

Now I can create a Lightning(tm) fast in-browser HD wallet in less than 18 months. Just build a Bloom filter and ask for stream of transactions after the last known offset - most of the time I will get no responses other than a transaction that interests me and "done".

EDIT: A late thought: when we start from the beginning and a filter is applied - it probably makes sense to dump a "noop" periodically (once every 30seconds maybe?) with a current offset as we go through the history, because most likely we're going to see nothing for a long period of time and unless offsets are dumped - we'll have to start over each time. Cold start is also a problem that might require some thinking. Maybe we could provide a first of HD addresses so that BCHD can quickly jump to the first offset that uses this address? EDIT2: Actually, with a Bloom filter index I described - we might scan the whole blockchain in less than a second possibly, so no need fo noop or the first address.

One thing that might be needed in request is whether we need to stop sending after the "done" event, so maybe like "stop_after_done": false (by default). Or maybe we shouldn't stop at all.

I think it would be super-cool, especially if it is WebSockets+JSON - that would mean that ANY developer without any libraries would easily get access to the latest and greatest in BCH: SLP tokens, double-spend proofs (which could be as easy just showing two signed transactions - that's enough proof), merchants will have an EASY way to protect customers on PoS terminals - just listen to transaction_enters_mempool,double_spend_proof events on the address that intersts you - wait 5 seconds and let the customer go!

I totally understand why GRPC is cool, but in some languages (PHP :( ) it's a bitch to work with, I even asked Fountainhead guys to install a HTTP proxy that translates the calls to GRPC - that's what I use for read.cash. And WebSocket+JSON support is unprecedented! Every language has it and it opens the floodgates for developers.

Let's dream: slp_double_spend_proof :) That would be amazing!

Side note: more potential uses:

> {"broadcast": {"tx_hex": "A3DEF47..."}}

> {"broadcast": {"transaction": {"inputs": [], "outputs": [], "signatures": .....}}

> {"broadcast": {"slp_transaction": {"token_id": "...", "inputs": [], "outputs": [], "signatures": .....}}

If we also have a "broadcast" command in this protocol (especially if we could send both a hex-encoded AND a struct... that would mean that users can easily build wallets and applications without having to implement anything except signing. Though this command would need something like "request_id", so that you can send back response (sent + txid) along with this "request_id".

But I think the protocol would be amazing even without broadcasting.

Client-side privacy. (Solving this problem)

Ok, this won't give the best privacy ever, but we can get some privacy.

The easiest way: we can add some random bits to the Bloom filter. The more random bits we add, the more spurious transactions we will get, that we'll have to filter out client-side.

Hard way: connect to one BCHD node, get some history of transactions, find some trees among them (transactions that end up in the same UTXO), add bits for them - that way you are adding a plausible set of bits that look like it IS your wallet, not some random txes that you added. For paranoid: connect to ANOTHER BCHD node and now use the Bloom filter that includes your addresses + spurious bits that you got from the trees from another node.

Additional protection: you can add a few random bits to your last filer each time you start a new connection thereby cutting the trace to your previous connection and unlike UTXOs approach - you don't need to ask about these UTXO's, thereby a compromised server would have to log all requests and do some bitwise distance search to connect your two sessions.


Ok, important thing... While most of this would be pretty easy to implement (just record the events as they happen into the append-only file), but it's important to do two things, to avoid failing the customers.

  1. You need to record to AOF file first, before you broadcast the action to the subscribed clients, because if BCHD dies mid-transaction - the clients will have the information that you won't have. Basically, BCHD and the client will have very different opinion about what happened at last offset.

  2. When starting BCHD - you need to read your own the AOF to recover the latest real state (at least the mempool transactions). The problem is these annoying "disappearing" transactions that we keep having on read.cash - the "bubbles" from Add fee/standardness checks to gRPC submitTransaction #382 and Mempool out of sync #400. Basically, we still see these bubbles daily. But if we start using this protocol to get the information - let's say there's transaction A that we think is in the mempool (according to the data you sent us), but then a restart comes - this transaction is now is NOT in the mempool, but we have no idea about it :( there's no event about eviction or that a block came and it was erased due to being in conflict, etc...

So, what are you thoughts?

  1. Makes sense? Good or stupid?
  2. Can you do it?
  3. Need money to do it?
  4. Would you merge if it was implemented by another Go dev?

Thanks for your time!

CC @2qx

Notes: needs cross-browser locking on client probably: https://github.com/tejacques/crosstab https://github.com/slimjack/IWC https://github.com/nodeca/tabex http://balpha.de/2012/03/javascript-concurrency-and-locking-the-html5-localstorage/ https://github.com/jitbit/TabUtils

@readcash
Copy link
Author

Just to clarify - wallets like Electron Cash or even Bitcoin.com could benefit from this protocol too, since they would have a more effective way to watch the HD wallet state, than current options of re-requesting UTXOs. Though I might be mistaken about how they work.

@readcash
Copy link
Author

readcash commented Aug 15, 2020

Also, to be clear - offsets will be different between different BCHD nodes, that is totally expected - since each node has its own order in which the events happened to it. So if you use another node - you need to start from the beginning.

@readcash
Copy link
Author

@monsterbitar @fyookball @markblundeberg Sorry guys to disturb you, but as you work on Electron Cash, I think your input on this might be very valuable. Thank you!

@readcash
Copy link
Author

readcash commented Aug 15, 2020

To clarify one more thing: by "decoded SLP transaction" I mean that it would look like a normal transaction, rather than its internal protocol representation, so something like this:

{
  "txid": "123abc",
  "token_id": "123abcd",
  "token_name: "SPICE",
  "inputs": [
    {"txid": "1234", "vout": 1},
  ],
  "outputs": [
    {"address": "bitcoincash:qq123", "amount": "123.10"}, <-- 123.10 SPICE tokens sent
    {"address": null, "amount": "0.10"}, <-- burned
  ]
}

Because, frankly SLP protocol is quite hard to parse. This, on the other hand, will allow anyone communicating with the BCH blockchain to understand SLP transactions as easy as if they were BCH transactions.

But, yeah, it means that we'll need to send both transaction_enters_mempool event and a slp_transaction_enters_mempool event for the same txid (one is low-level about BCH sats movements, another one is high-level decoding of SLP movements)

@readcash
Copy link
Author

readcash commented Aug 15, 2020

Also, maybe it might be a good idea to have offsets as abc123,7890, where first part is a randomly generated string/number that is generated the first time BCHD starts from a clean state. That way if a node was erased and started over, it won't be able to respond to abc123,7890 offset, because it expects something like 456ef,7890 request. That will tell the client to start all over from the blank slate, since the data on server was lost. (7890 could be the byte offset in the append-only file, for example)

@readcash
Copy link
Author

readcash commented Aug 15, 2020

One optimization that comes to mind to avoid full-scan of the AOF file to match the bloom filters could be to create another index file where we could basically dump a bloom filter of addresses for let's say each 100,000 events. That way it would be enough to scan this file - checking the client's Bloom filter against the Bloom filter from "index" via the bitwise AND operation if they have any intersection - if not - skip these 100,000 events (basically doing only a scan of this index file). Or 10,000 or 1,000 - I don't know, depends on the size of the Bloom filter I guess.

@monsterbitar
Copy link

I've read through this and from my perpective, any SPV client today would function block-by-block + some mempool data, and having to "replay" one block extra (the block we're uncertain if we have all info about?) isn't much of a problem. There are some nuggest in here though, that's on the "we might want to do this" for electrum.

Overall, I don't see anything wrong or clearly bad with the protocol outline here, I just think the existing protocols are good enough with a few missing bits and pieces.

I'll leave it to BCHD to take a closer look and determine if this more granular setup is something they'd want to look into.

@readcash
Copy link
Author

@monsterbitar Thank you for reading it!

Here's the problem with SPV approach I see: Imagine you are a developer for a website and you need something like an HD wallet functionality. With the current protocols - you need to request and get all SPV blocks (in browser) - that's a ton of data, you need to parse it, and figure out whether any of 640,000 blocks contain anything relevant to you. Imagine how long it would take. If you want to be real-time, you need to start monitoring mempool parallel to that (because you might miss some transactions). If the user used any navigation between pages while you were doing this - you are out of luck and your real-time doesn't work anymore, because you are missing what's happening with mempool while the tab is navigating. (The same happens when read.cash does any deploy - restart causes us to miss on some transactions). We don't have any notifications about mempool evictions (because a block was mined), so we need to manually manage that. No notifications about conflicting transactions. But even with that - imagine how long will it take to parse 640,000 blocks and their GCS - it's just not acceptable approach for building in-browser wallets (or actually any wallets that want to start fast). Having a server-side index could mean that maybe you can start in less than a second (especially if we can figure out some database-like index (BTree?) for Bloom filters).

So, yeah, you could tie some stuff together that resembles this protocol, but 1) you will need to juggle a lot of state on client side; 2) it will be much slower; 3) you won't get all the relevant information you need.

Oh and to be clear: this approach is definitely for building non-SPV wallets - i.e. wallets that assume they can trust the server. (Which is basically what read.cash does... I don't think it makes sense to so in-browser real SPV wallets - they will probably be very slow)

@cpacia
Copy link
Contributor

cpacia commented Aug 17, 2020

So if I understand the idea right it seems like this would be very heavy on the server to support many wallets using this protocol.

To some extent bchd already has two different protocols that work something like this but they are very resource intensive.

First is the SPV you mentioned (classic SPV, not neutrino). To serve an SPV wallet the wallet gives the nodes a bloom filter and the server starts loading blocks from genesis and scanning through every transaction to see if it matches the filter. This is very resource intensive. Especially if many wallets are connecting to the same node and requesting block filtering.

Fortunately not many wallets use this functionality so nodes don't really get slammed with these type of requests.

The second we have in bchd is a websocket protocol for filtering blocks. Similar to the SPV protocol, but it doesn't use bloom filters and just returns matched transactions and not merkleBlocks. But this is really designed with the primary consumer of this protocol being the node operator himself. Or at least his wallet(s). It wasn't really intended as an open protocol where a bunch of wallets connect to the same server for the same reason that this can be heavyweight to scan through all transactions this way.

@readcash
Copy link
Author

If your question is about this being resource intensive - then note that I propose a Bloom filter index.

(I might be mistaken here, correct me if I misunderstand how Bloom filters work) You can take 10,000 transactions and create a Bloom filter for all participating addresses (let's call it BloomFilter0). You can do the same for next 10,000 transactions (let's call it BloomFilter1). Then you can store these filters as an "index".

Now, when a client connects - we can take his Bloom filter (let's call it BloomFilterClient) and do BloomFilterClient & BloomFilter0 (where & is bitwise AND). If you have a non-zero value - then first 10,000 transactions contain something of interest for this client and we should do a full page scan of first 10,000 transactions, applying the Bloom filter to each one and seeing if it is of interest to client. If you get zero - then you know - there's nothing of interest to client in first 10,000 and move on. That way you can scan the whole blockchain very efficiently.

And of course, you can then take BloomFilter0 | BloomFilter1 | .... | BloomFilter9 (where by | I mean bitwise OR) and create a BloomFilter for first 100,000 transactions (let's call it BloomFilter0-0), forming kind of like a search tree.

So, now when a client comes we can do only (BloomFilterClient & BloomFilter0-0) to see if anything in first 100,000 match. If there's no match - we skip all 100,000 and move to BloomFilter0-1 (or whatever it is called... for next 100,000). If there is a match - we check each of BloomFilter0, BloomFilter1, .... , BloomFilter9 & BloomFilterClient (again skipping up to 10,000 records at a time).

Of course this would require periodic re-balancing, probably... or not.

That should solve the performance issue (hopefully). But now we have a second problem, specifically the mempool.

You have an interface for filtering blocks, but not the mempool. That means that upon each page reload and before each send the wallet has only one option - to ask for WHOLE mempool and see what is there of interest to him... (Think 20 MB blocks! Gigabyte blocks?) That's very ineffective. And since there's no ordering in mempool - we can't even tell BCHD that "tell us everything that happened after this TX that we last saw"

BTW, I would be totally OK with an answer that "This seems OK as a protocol, but we don't have time or will to do this". In that case I'll go find a Go dev to implement this, because the existing solutions don't seem to be too effective for web-based wallets. (And they still lack notifications about mempool evictions due to double-spends or some other reason, so the client must somehow figure out that some transactions didn't get into the block and they won't ever go there).

The only question is whether you will accept this into the codebase later or is it doomed to be an unofficial extension?

I do understand that having 3 protocols is worse than having 2 and that every line of code is additional burden to maintain. But so far I really don't see a way to build a real-time web wallet that is HD and supports 0-conf transactions. Using both described approaches you are guaranteed to miss something eventually.

(NB: A thought occured to me now that something that is missing from this protocol is "re-orgs"... if a re-org happens there need to be some additional events probably... but an approach of filtering blocks would totally miss it.. since you already have information about block 1234... you have no idea that it was re-orged... That's why I still think that both approaches are not robust and fast enough)

@markblundeberg
Copy link

Regarding wallets with thousands of addresses and huge histories, @cculianu might have some insight on this problem, he's familiar with the benefits and flaws of the electrum protocol (having reimplemented the whole server).

@2qx
Copy link
Contributor

2qx commented Aug 17, 2020

@readcash a cfindex for the mempool, seeded with the hash of the previous block, and the ability to query mempool transactions by siphash value may be useful. A subscription should also be possible.

I'm imagining the situation of a tube rider on memo (somewhere there are cell transmitters in stations but not tunnels). It would be typical to drop the streaming connection 12 times in a 40 minute commute. Syncing would involve getting the cfindex for mempool and querying any matching values.

A subscription service would also greatly reduce bandwidth over listening to all transactions.

It would require the client to recompute the relevant siphash values when a new block is mined. But I think that is necessary for security.

EDIT: For the server, it would require building the golomb coded set on every transaction, (or every request, caching and invalidating the cache on each transaction). Perhaps there is a clever way to incrementally update a GCS.

@readcash
Copy link
Author

@2qx A tube rider on memo is a bit different, because you need to listen to specific format of transactions to any address, whereas an HD wallet needs to listen to specific set of addresses. I think this is where something like bitdb is more suited to the task.

The problem with cfindex is that you need to get gcs'es for all 640,000 blocks before you start anything. Then (let's assume we have very big blocks) it needs to download a 20MB/100MB/1GB for every block that has 1 transaction. Imagine downloading a 100MB block to find 1 transaction (you also need to filter 100MB client-side on a mobile)... I mean if you want guarantee that Neutrino provides - yep, be prepared to suffer. My approach however gives you very efficient way to "subscribe" to addresses of an HD wallet, and optionally add some client-side privacy if you wish.

Then the next thing is the protocol itself. Sewing all current solutions together (and even more new functionality like ability to query mempool) is a pretty hard task. The protocol I describe will allow any mediocre programmer like me to build something like an HD wallet or a PoS with double-spends and tokens.

More developers = more adoption.

Hard things like CFindex, filtering whole blocks, tying multiple protocols together to get something cohesive is a sure-fire way to get very few developers active. Which is what we have in BCH right now. Very few developers can build an HD wallet using current state of affairs. As a proof of that the fact that we have close to zero of them on the web - the only ones that we found are some abandoned ideas.

It really saddens me that nobody cares about getting an average developer into BCH ecosystem (which is the scope of Mainnet project).

It saddens me deeply.

@cpacia
Copy link
Contributor

cpacia commented Aug 17, 2020

@2qx we do already have a mempool cf filter in the wire protocol. But it does build it every request.

@2qx
Copy link
Contributor

2qx commented Aug 17, 2020

@cpacia This is just seeded with an empty hash right?

bchd/server.go

Line 667 in 7929e82

zeroHash := &chainhash.Hash{}

In bchd, since BIP-158 uses a list of differences, it should be possible to walk along the list and replace old values, inserting new values and the remainder to the next value, without rebuilding the gcs. Or walk along the list and insert a list of values in the same way for each new transaction.

What about the idea of making new transactions available as a GCS stream for limited bandwidth uses?

@cpacia
Copy link
Contributor

cpacia commented Aug 19, 2020

How would that work as a stream?

@2qx
Copy link
Contributor

2qx commented Aug 19, 2020

@cpacia , I was thinking that individual transactions could be compressed once-time per BIP-158 and broadcast to all transaction subscribers that created the stream with compressed:true.

Would this allow neutrino clients to watch addresses and utxos?, or would they still need full transaction data to watch addresses?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants