Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and load block data using IPFS #374

Merged
merged 7 commits into from
Jun 23, 2021
Merged

Save and load block data using IPFS #374

merged 7 commits into from
Jun 23, 2021

Conversation

evan-forbes
Copy link
Member

@evan-forbes evan-forbes commented May 28, 2021

Description

This PR builds off of #372 and #356 to save and load block data via IPFS. It should be noted that it does not completely remove the use of block parts or the PartSetHeader. It should also be noted that it is not yet explicit or configurable in when it retrieves data from IPFS per the request of #373.

Closes: #373

@evan-forbes evan-forbes requested review from liamsi and Wondertan May 28, 2021 03:24
@evan-forbes evan-forbes self-assigned this May 28, 2021
@codecov-commenter
Copy link

codecov-commenter commented May 28, 2021

Codecov Report

Merging #374 (21a18b2) into master (a7326d5) will decrease coverage by 0.15%.
The diff coverage is 49.27%.

@@            Coverage Diff             @@
##           master     #374      +/-   ##
==========================================
- Coverage   61.81%   61.66%   -0.16%     
==========================================
  Files         262      263       +1     
  Lines       22978    23003      +25     
==========================================
- Hits        14204    14184      -20     
- Misses       7277     7303      +26     
- Partials     1497     1516      +19     
Impacted Files Coverage Δ
consensus/replay_file.go 0.00% <0.00%> (ø)
rpc/core/blocks.go 27.90% <0.00%> (-2.87%) ⬇️
rpc/core/tx.go 0.00% <0.00%> (ø)
state/services.go 40.00% <ø> (ø)
blockchain/v0/reactor.go 61.77% <33.33%> (-2.19%) ⬇️
consensus/replay.go 70.08% <33.33%> (-1.29%) ⬇️
consensus/state.go 67.48% <60.00%> (-0.28%) ⬇️
store/store.go 65.08% <73.07%> (+0.48%) ⬆️
node/node.go 55.92% <100.00%> (ø)
store/mock.go 100.00% <100.00%> (ø)
... and 18 more

@evan-forbes
Copy link
Member Author

evan-forbes commented Jun 3, 2021

I wrote a quick test to figure out if we could use the offline/online versions of the ipfs API object interchangeably, and it looks like we can, at least with the mock ipfs api object, as it passes. https://gist.github.com/evan-forbes/ad8fd510874f2a8c7d990ab080087da9

Although, after discussing with @liamsi, it seems that making API changes to incorporate offline and online usage is outside the scope of this PR, and will be handled in a separate ADR/issue. There also needs to be more testing and investigation into the exact impact of using the offline vs online version of the API.

Therefore, I'm making this PR ready for review.

@evan-forbes evan-forbes marked this pull request as ready for review June 3, 2021 18:18
@evan-forbes evan-forbes requested a review from tac0turtle as a code owner June 3, 2021 18:18
@evan-forbes evan-forbes removed the request for review from tac0turtle June 3, 2021 18:21
store/store_test.go Outdated Show resolved Hide resolved
// NOTE: The existence of meta should imply the existence of the
// block. So, make sure meta is only saved after blocks are saved.
panic(fmt.Sprintf("Error reading block: %v", err))
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have an error check here to return a meaningful error to the caller in case a block is not found locally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: if the dag passed in dag has the offline mode enabled, is that sufficient to ensure that we failed to find the block locally? Or is there some sentinel error or error type to check against?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, resolved here 507f04d

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: if the dag passed in dag has the offline mode enabled, is that sufficient to ensure that we failed to find the block locally? Or is there some sentinel error or error type to check against?

That's a great point and I think we should have a test that ensures that the correct error is thrown and the timeout is not reached. Added here 51c7f44

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the dag passed in dag has the offline mode enabled, is that sufficient to ensure that we failed to find the block locally?
I guess so, but there could be other errors as well, e.g. related to erasure coding. Also, retrieval code has its own error type, but usually, for DAG-related not found errors I check for ipld.ErrNotFound.

// If all the nodes restart after committing a block,
// we need this to reload the precommits to catch-up nodes to the
// most recent height. Otherwise they'd stall at H-1.
func (bs *BlockStore) SaveBlock(block *types.Block, blockParts *types.PartSet, seenCommit *types.Commit) {
func (bs *BlockStore) SaveBlock(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kinda concerned about this method as we have PutBlock already being called on consensus and I don't think relying just on Blockstore.SaveBlock instead for networking intensely operation is a good idea. In case we save Block two times(overheads) just to persist Meta, it sounds reasonable for me now to change blockstore into MetaStore or HeaderStore and no more handle block data read/writes at all.

Copy link
Member

@liamsi liamsi Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds reasonable for me now to change blockstore into MetaStore or HeaderStore and no more handle block data read/writes at all.

You are right about this. But we currently already store the block twice. We need to change this indeed.
The question is if in this PR or in a separate one. Swapping the blockstore with a header store is basically reviving and slightly rethinking @marbar3778's PR: #218.

My suggestion was to tackle this in a separate PR but if you feel strongly about this and @evan-forbes agrees, I'm OK with doing this in this PR as well.

Copy link
Member

@liamsi liamsi Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually might require more thoughts. Also the API @renaynay started which will enable retrieving the state-relevant portion might be necessary for this (prioritized https://github.com/lazyledger/lazyledger-core/issues/221 to be included in the next larger milestone).
See: #218 (comment)

Copy link
Member

@liamsi liamsi Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to not be blocked on that API (https://github.com/lazyledger/lazyledger-core/issues/221), we could make it that wherever the Txs are needed for replay, the full block (instead of the reserved nid Txs) is downloaded via IPFS (if not locally available anyways).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Ismail also shared your concern over the this, particularly when loading block data. I think it's improved slightly by using the offline version of the dag api 8619f27 . Unfortunately, until we add some caching mechanism I don't think we have another way to ensure that the data in the block being saved is added to IPFS other than PutBlock.

Out of curiosity, is saving data to IPFS that network intensive? I know there's certainly overhead, as opposed to saving to local only DB, but unlike fetching data there's no blocking involved, right?

I really like the idea of breaking up the Blockstore struct to two different entities, one that strictly uses the local store, and another focused on retrieving block data from IPFS. I think that makes a lot of sense for LazyLedger. However, I do feel that is out of the scope of this PR. This PR's main purpose is to get us to the next step of removing BlockID and stop saving block data twice.

Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to tackle this in a separate PR.

Out of curiosity, is saving data to IPFS that network-intensive?

It is, as down the road Bitswap provides data to DHT and that's kinda network-intensive. Further, after #375 is merged, PutBlock may even become blocking for providing. The providing itself is also a concern as it definitely should be out of Blockstore and currently there is no way to disable it. Luckily, offline dag thin should solve that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am fully ok to work over this in another PR, as a proper solution for this is not obvious and would probably require some level of refactoring over multiple components

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's try to tackle the story behind this in another PR. That PR will require some clearly outlined reasoning and an implementation plan in an issue or maybe even (as part of) a full-blown ADR (specifically around storage, i.e. storage for validator nodes, storage for partial nodes that passively participate in the consensus–those that will replace the current tendermint fullnodes).

I think on master, only block-proposers call PutBlock. The other nodes currently only explicitly store the block data via the tendermint store (#218 tries to change that). I think for this PR it's OK to skip the online/offline story.
BTW, I could imagine that if PutBlock is called twice and there is no update on the data, the network IO would not really be triggered (not sure though, would need to look into the interplay between the ipfs block store and bitswap).

store/store.go Outdated
}

// NewBlockStore returns a new BlockStore with the given DB,
// initialized to the last height that was committed to the DB.
func NewBlockStore(db dbm.DB, dagAPI ipld.DAGService) *BlockStore {
func NewBlockStore(db dbm.DB, dagAPI format.DAGService) *BlockStore {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure we only passing here Offline Dag

Copy link
Member Author

@evan-forbes evan-forbes Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, changed the API provider to have an online or offline option, and then started only using the offline version for blockstore 8619f27

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, I couldn't find a way to quickly check inside the function that the DAGService is offline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually is not trivial and requires passing IPFS's blockstore and just DAG 😞

This is how the func for offline DAG would look like:

func OfflineDAG(bstore blockstore.Blockstore) format.DAGService {
	return dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore)))
}

ipfs/provider.go Outdated
Comment on lines 11 to 47
type APIProvider func(online bool) (coreiface.APIDagService, io.Closer, error)

type onlineOfflineProvider struct {
online, offline coreiface.APIDagService
closer io.Closer
}

func NewOnlineOfflineProvider(online coreiface.CoreAPI, closer io.Closer) (APIProvider, error) {

offline, err := online.WithOptions(OfflineOption)
if err != nil {
return nil, err
}

provider := onlineOfflineProvider{
online: online.Dag(),
offline: offline.Dag(),
}

return provider.Provide, nil
}

func (o onlineOfflineProvider) Provide(online bool) (coreiface.APIDagService, io.Closer, error) {
switch online {
case true:
return o.online, o.closer, nil
case false:
return o.offline, o.closer, nil
}
// the compiler does't know this code is unreachable
return nil, nil, nil
}

func OfflineOption(settings *options.ApiSettings) error {
settings.Offline = true
return nil
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows us to still be able to use a dag only mock, and have an offline APIDagService for the blockstore

Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I am responding to this quite late, but I am against this way of having offline DAG. Its usage is confusing, it feels like we have two IPFSs offline and online instead of one and provide two of them independently in the node constructor.

Instead, I would suggest going this way:

func OfflineDAG(bstore blockstore.Blockstore) format.DAGService {
	return dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore)))
}

or straightly the constructor

func NewBlockStore(db dbm.DB, bstore blockstore.Blockstore) *BlockStore {
	bs := LoadBlockStoreState(db)
	return &BlockStore{
		base:   bs.Base,
		height: bs.Height,
		db:     db,
		dag:    dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore))),
	}
}

It just ensures 'offliness' of our Blockstore by receiving or wrapping offline blockstore from IPFS. Blockstores wrapping each other also makes more logical sense to me than passing dags or even full APIs. And, importantly this continues our road of decomposing IPFS into pieces, specifically, here we would already rely on IPFS's blockstore instead of very high-level API. Besides this, we would also simply some testability with easily mockable Blocksktore and this is just less code solution. The blockstore could be taken from the IpfsNode object. Hope that also makes sense for you.

Sorry again for the late response, I should've suggested this earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all bueno, it's better that we get it right the first time! 🙂 I like this better too, it's less hacky and ensures that all store.Blockstores are offline. The thing that I think can be improved is readability. I can see this being incredibly confusing to someone unfamiliar with what we're doing. The word blockstore is used way too much, and the purpose isn't directly obvious. It will get even more confusing when we have to modify the APIProvider to also return a blockstore.Blockstore, which we pass into the store.Blockstore to make a dag, which is also returned by the APIProvider, but we only use that dag api for consensus... We know it's because we want to enforce the online and offline usage of the DAG API, but I think we could make that more readable by doing something like this

// Maybe name this something else, like Gateway or Connector
// APIProvider allows customizable IPFS core APIs.
type APIProvider func() (DAGProvider, io.Closer, error)

// DAGProvider returns an online or offline DAG API
type DAGProvider func(online bool) coreiface.APIDagService

and

func NewBlockStore(db dbm.DB, dagProvider ipfs.DAGProvider) *BlockStore {
	bs := LoadBlockStoreState(db)
	return &BlockStore{
		base:   bs.Base,
		height: bs.Height,
		db:     db,
		dag:    dagProvider(false),
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should for sure use the mechanism you mentioned

func OfflineDAG(bstore blockstore.Blockstore) format.DAGService {
	return dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore)))
}

in the DAGProvider instead of the one used in onlineOfflineProvider

Copy link
Member Author

@evan-forbes evan-forbes Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also do something like this too

// maybe name this something different
// APIProvider allows customizable IPFS core APIs.
type APIProvider func() (exchange.Interface, blockstore.Blockstore, io.Closer, error)

and just construct the DAGService wherever needed

Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this approach:

type APIProvider func() (exchange.Interface, blockstore.Blockstore, io.Closer, error)

Is that I am also started to use another component from IPFs in another PR, so passing full Node instead of io.Closer above is inevitable. Furthermore, I expect other components to be used by us as well.

Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the new PuBlock semantics here. It now accepts ContentRouting from IPFS/libp2p and I need to pass it down from IpfsNode.

Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So overall, to keep readability, we can also rename APIProvider to NodeProvider

Copy link
Member Author

@evan-forbes evan-forbes Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #400 can solve our problems

@liamsi liamsi requested a review from Wondertan June 4, 2021 13:09
ipfs/provider.go Outdated
Comment on lines 11 to 47
type APIProvider func(online bool) (coreiface.APIDagService, io.Closer, error)

type onlineOfflineProvider struct {
online, offline coreiface.APIDagService
closer io.Closer
}

func NewOnlineOfflineProvider(online coreiface.CoreAPI, closer io.Closer) (APIProvider, error) {

offline, err := online.WithOptions(OfflineOption)
if err != nil {
return nil, err
}

provider := onlineOfflineProvider{
online: online.Dag(),
offline: offline.Dag(),
}

return provider.Provide, nil
}

func (o onlineOfflineProvider) Provide(online bool) (coreiface.APIDagService, io.Closer, error) {
switch online {
case true:
return o.online, o.closer, nil
case false:
return o.offline, o.closer, nil
}
// the compiler does't know this code is unreachable
return nil, nil, nil
}

func OfflineOption(settings *options.ApiSettings) error {
settings.Offline = true
return nil
}
Copy link
Member

@Wondertan Wondertan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I am responding to this quite late, but I am against this way of having offline DAG. Its usage is confusing, it feels like we have two IPFSs offline and online instead of one and provide two of them independently in the node constructor.

Instead, I would suggest going this way:

func OfflineDAG(bstore blockstore.Blockstore) format.DAGService {
	return dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore)))
}

or straightly the constructor

func NewBlockStore(db dbm.DB, bstore blockstore.Blockstore) *BlockStore {
	bs := LoadBlockStoreState(db)
	return &BlockStore{
		base:   bs.Base,
		height: bs.Height,
		db:     db,
		dag:    dag.NewDAGService(bsrv.New(bstore, offline.Exchange(bstore))),
	}
}

It just ensures 'offliness' of our Blockstore by receiving or wrapping offline blockstore from IPFS. Blockstores wrapping each other also makes more logical sense to me than passing dags or even full APIs. And, importantly this continues our road of decomposing IPFS into pieces, specifically, here we would already rely on IPFS's blockstore instead of very high-level API. Besides this, we would also simply some testability with easily mockable Blocksktore and this is just less code solution. The blockstore could be taken from the IpfsNode object. Hope that also makes sense for you.

Sorry again for the late response, I should've suggested this earlier.

ipfs/provider.go Outdated Show resolved Hide resolved
@tac0turtle
Copy link
Collaborator

any chance this could be merged instead of cherry-picking it into my PR? Would like to avoid scope creep in my PR, there are already quite a few changes

@liamsi
Copy link
Member

liamsi commented Jun 15, 2021

any chance this could be merged instead of cherry-picking it into my PR?

I think it can. The current justified concern is that calling PutBlock in store.SaveBlock could be potentially very network IO intensive. Because PutBlock currently also "provides" to the DHT (think of it as announcing to a bunch of peers to update their distributed lookup table).
A clean solution to this would mean to decouple the local storing and the providing or any network-IO related parts. Storing locally should be blocking while the "providing" should happen async. This is non as easy as it sounds because of the way we are using ipfs currently.

Potential workarounds to unblock this faster:

  1. Always use an "offline dag" when calling PutBlock inside of the store package. This PR already does something similar. And although not the cleanest solution (basically the object keeps around two versions of the dag), I'd be OK to merge it as is and open a followup issue to decide how to refactor this afterwards.
  2. Just accept the resource intensive network IO part. Then it is important that even proposers do not call PutBlock twice for the same data. This is achievable by looking at the latest height in block meta for instance (if the block with that height was already stored we don't call PutBlock).

IMO, 1. is clearly preferable. @Wondertan Can you re-visit your PR from the perspective that even this change would be temporary. If there are not further consequence than it being (temporarily) confusing, we should just capture this in an issue and move on with this and Marko's PR.
@evan-forbes can you update your PR to the lastest master?

@tac0turtle
Copy link
Collaborator

How resource intensive are we talking? If its a couple kb or even mb per block that gets garbage collected after a few seconds, that should be fine IMO.

@liamsi
Copy link
Member

liamsi commented Jun 15, 2021

The networking IO can take up to several minutes (depending on block size on to how many ipfs peers you are connected to). Especially after @Wondertan's change that makes providing more explicit: #375 (previously it was implicitly handled by IPFS: async and also throttled and we couldn't be sure when this would be done). Providing and ideally any IPFS networking IO needs to be decoupled from storing.

@evan-forbes
Copy link
Member Author

I'm updating this PR to the latest master and quickly making the changes @Wondertan suggested. If everything goes smoothly, it should be ready to go within the hour

@liamsi liamsi requested a review from Wondertan June 15, 2021 14:56
@evan-forbes
Copy link
Member Author

The e2e test is failing locally for me, even when temporarily reverting the most recent change in #413, and I'm still trying to debug why

@Wondertan
Copy link
Member

Wondertan commented Jun 15, 2021

Can you re-visit your PR from the perspective that even this change would be temporary.

@liamsi, it is not clear to me now what long-term approach we should take here. It is clear now that IPFS brought to us architectural change that combines downloading, saving, and sharing into one operation. Previously, all of them were separated and independent. In general, we should answer if we want to fully adopt that approach or not. If not - then workarounds like with OfflineDAG are inevitable, if yes - additional architectural/designing investigation should be made to adopt properly.

Personally, I am in favor of 'yes', as I am already going this way in DA interface, but still how to make that work for Blockstore is not clear. Ideally, if 'yes', we should eliminate tm's Blockstore and make it Headerstore instead, but the path to that is not clear for me now.

@liamsi
Copy link
Member

liamsi commented Jun 15, 2021

Ideally, if 'yes', we should eliminate tm's Blockstore and make it Headerstore instead, but the path to that is not clear for me now.

That is what @marbar3778' PR (#218) that is kind blocked a bit on this change is trying to achieve.

Regarding the long-term–which again, is a bit orthogonal to this PR–it depends a bit on how we decide about the tasks and relations of node types (celestiaorg/celestia-specs#178). For example: the fact that we even do call PutBlock on block proposing stems from the (now invalidated) idea that validators will do DAS (see: celestiaorg/celestia-specs#3 (comment)). Tendermint does not even store the block at that point.

In fact now we could instead fall back to calling PutBlock just everywhere where we (or rather the existing tendermint code base) used to call store.SaveBlock. Note, that providing can happen completely async again! It is not consensus critical anymore that DAS can happen as quickly as possible: block production won't depend on DAS anymore. That said your manual providing is still important: we need to have some control and some clarity over when we write to the DHT (and when it is done).

I don't have a full answer to how the exact long-term solution should look, at least not today ;) We must first decide on which node type cares about the consensus messages and which care about after the block is finalized. Until we decide on this it is probably safe to assume only validators and fullnodes (actual full nodes that download all data) care about consensus messages. As the alternative is just an optimization that (we should pursue IMO) that allows DAS (partial) nodes to start sampling after they see a proposal (instead of waiting until the commit).

It is clear now that IPFS brought to us architectural change that combines downloading, saving, and sharing into one operation.

If we stick to this we still need more control over resources here. E.g. as said above, storing should be blocking but other aspects like providing to the content routing can surely be async. For triggering rebroadcasting it would also be good to give node operators some control over general resource consumption too. While I like the abstractions of IPFS I don't like the fact that the networking IO parts are extremely implicit (only after your change providing became more explicit).

So the question for you is: are you OK with merging this PR as it is?

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@evan-forbes
Copy link
Member Author

Thanks to a lot of help from @liamsi , we managed to make a lot of progress debugging the e2e test. Unfortunately, there are still some issues we have to work out/decide on before moving on. In short, the tests pass if we stop manually providing, but continue to fail if we keep manually providing and revert saving and loading block data via IPFS. That's really weird, given that both master and this branch passed the e2e tests before merging. We should expect reverting the changes to store.LoadBlock and store.SaveBlock would cause the e2e tests to pass again. This could mean that the issue is a bit more complex and lies in one of the other changes introduced in this PR.

The second issue we at least found a solution to, although the underlying cause is still not yet clear. The solution was to not attempt to load the block during rpc requests. Instead, we loaded the block only if the blockmeta could be found first. While this is hacky, it will be fixed in #218 when only the header is searched for returned via rpc.

More importantly, I still don't understand why this happens. The blockstore is using an offline DAG, and, at least during other tests, the offline dag is working as intended in that it will not hang if the data is not stored locally, it returns an error relatively quickly.

As far as options go, I'm planning on continuing to debug tmrw. I might try to bootstrap each of the validator's embedded IPFS nodes. I am also going to try using the old hacky solution of offline dags, to hopefully find precisely what is stopping block data from being transferred over IPFS.

@Wondertan
Copy link
Member

@evan-forbes, regarding e2e tests stalling on unknown height. I had a similar issue in #375 and it was reproducible in 100% until I reworked context handling, as I mentioned here. Not sure if this helpful anyhow right now, but I guess it might be related.

@Wondertan
Copy link
Member

Wondertan commented Jun 16, 2021

panic: failure to retrieve block data from local ipfs store: failure to retrieve data square: merkledag: not found

I guess that this is because PutBlock is async and the same for another test.

I think PutBlock should still be blocking as before and instead of passing nil for content routing, we need to hijack ipfs.MockRouting() into there

@evan-forbes
Copy link
Member Author

evan-forbes commented Jun 17, 2021

So, I thought I fixed it, but that was a fluke. The change does seem to make the e2e tests fail later, but they still fail 98% of the time locally. I still don't know which change introduced in this PR is causing the test to stall.

If a dedicated DHT bootstrap node is used as the only bootstrap peer, or all bootstrap peers are removed, then the e2e test passes. Both might be similar to just turning off providing, which also makes the e2e test to pass.

typical logs
 I[2021-06-17|03:32:08.271] Starting SignerServer service                impl=SignerServer
I[2021-06-17|03:32:08.272] Remote signer connecting to tcp://0.0.0.0:27559 
D[2021-06-17|03:32:08.272] SignerDialer: Reconnection failed            retries=1 max=100 err="dial tcp 0.0.0.0:27559: connect: connection refused"
D[2021-06-17|03:32:09.272] SignerDialer: Reconnection failed            retries=2 max=100 err="dial tcp 0.0.0.0:27559: connect: connection refused"
D[2021-06-17|03:32:10.274] SignerDialer: Connection Ready               
generating ED25519 keypair...done
peer identity: 12D3KooWKPWXMJsnDUjS5wt3ugoqMmcDGKzeRAJbAVUf9TNJhJLE
I[2021-06-17|03:32:10.295] Successfully initialized IPFS repository     module=main ipfs-path=ipfs
I[2021-06-17|03:32:11.501] Successfully created embedded IPFS node      module=main ipfs-repo=ipfs
I[2021-06-17|03:32:11.502] Version info                                 module=main software= block=11 p2p=8
I[2021-06-17|03:32:11.512] Starting Node service                        module=main impl=Node
I[2021-06-17|03:32:11.512] Starting StateSyncShim service               module=statesync impl=StateSyncShim
I[2021-06-17|03:32:11.513] Starting StateSync service                   module=statesync impl=StateSync
E[2021-06-17|03:32:11.617] Stopping peer for error                      module=p2p peer="Peer{MConn{10.186.73.5:26656} 8773a83e2f9fa4bc7a3e94303eb4df33af294288 out}" err=EOF
E[2021-06-17|03:32:11.667] dialing failed (attempts: 1): auth failure: secret conn failed: read tcp 10.186.73.10:54656->10.186.73.9:26656: read: connection reset by peer module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:32:11.821] Stopping peer for error                      module=p2p peer="Peer{MConn{10.186.73.4:26656} 0a9c99096b50a6d72a24d5aa5286d3f7022b3555 out}" err=EOF
I[2021-06-17|03:32:12.004] Started Providing to DHT                     module=main height=1000
I[2021-06-17|03:32:12.005] Finished providing to DHT                    module=main height=1000 took=228.811µs
I[2021-06-17|03:32:12.013] Executed block                               module=state height=1000 validTxs=14 invalidTxs=0
I[2021-06-17|03:32:12.015] Committed state                              module=state height=1000 txs=14 appHash=26D127B89D66D8FFDE754948E2431893CFD8455562DB5B60852F973BD1A99F75
I[2021-06-17|03:32:12.059] Started Providing to DHT                     module=main height=1001
I[2021-06-17|03:32:12.060] Finished providing to DHT                    module=main height=1001 took=282.262µs
I[2021-06-17|03:32:12.062] Executed block                               module=state height=1001 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.063] Committed state                              module=state height=1001 txs=7 appHash=A17E1DFBBE5A9416C0524035061130632C353088EBBD78FE6E440E9F61C954D6
I[2021-06-17|03:32:12.103] Started Providing to DHT                     module=main height=1002
I[2021-06-17|03:32:12.103] Finished providing to DHT                    module=main height=1002 took=57.569µs
I[2021-06-17|03:32:12.105] Executed block                               module=state height=1002 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.106] Committed state                              module=state height=1002 txs=7 appHash=C419CCD5CA79A8DF9261FC897D5634DD061D5BF0BC69E6CF3DA85CCB1583F1BF
I[2021-06-17|03:32:12.150] Started Providing to DHT                     module=main height=1003
I[2021-06-17|03:32:12.150] Finished providing to DHT                    module=main height=1003 took=54.503µs
I[2021-06-17|03:32:12.152] Executed block                               module=state height=1003 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.155] Committed state                              module=state height=1003 txs=7 appHash=9144C585319102FCE2F195219715BA882B3C3F20E19A2830D5DCD1F8B16E3A90
I[2021-06-17|03:32:12.200] Started Providing to DHT                     module=main height=1004
I[2021-06-17|03:32:12.200] Finished providing to DHT                    module=main height=1004 took=67.327µs
I[2021-06-17|03:32:12.202] Executed block                               module=state height=1004 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.204] Committed state                              module=state height=1004 txs=7 appHash=2EC31B522E9045236CCD863A72B20E515A54650327A44543AE4084017FC0C57F
I[2021-06-17|03:32:12.248] Started Providing to DHT                     module=main height=1005
I[2021-06-17|03:32:12.249] Finished providing to DHT                    module=main height=1005 took=382.4µs
I[2021-06-17|03:32:12.250] Executed block                               module=state height=1005 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.252] Committed state                              module=state height=1005 txs=7 appHash=660CADA99FC1626E638808278D66E0FE8DFEDB1CC2FBBADEBA86DE9A0A865691
I[2021-06-17|03:32:12.293] Started Providing to DHT                     module=main height=1006
I[2021-06-17|03:32:12.294] Finished providing to DHT                    module=main height=1006 took=139.924µs
I[2021-06-17|03:32:12.296] Executed block                               module=state height=1006 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.298] Committed state                              module=state height=1006 txs=7 appHash=F644500ABACDAF7B40D45863096710F8D6B59E823822D1A2C2E5B370803C5BF3
I[2021-06-17|03:32:12.358] Started Providing to DHT                     module=main height=1007
I[2021-06-17|03:32:12.358] Finished providing to DHT                    module=main height=1007 took=44.564µs
I[2021-06-17|03:32:12.360] Executed block                               module=state height=1007 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.361] Committed state                              module=state height=1007 txs=7 appHash=FF1D60C246EAC3FBCAB83DF71274758185AC8F634E432DDF74B84FC0058115AC
I[2021-06-17|03:32:12.667] Started Providing to DHT                     module=main height=1008
I[2021-06-17|03:32:12.667] Finished providing to DHT                    module=main height=1008 took=251.234µs
I[2021-06-17|03:32:12.670] Executed block                               module=state height=1008 validTxs=7 invalidTxs=0
I[2021-06-17|03:32:12.673] Committed state                              module=state height=1008 txs=7 appHash=D11E4CEDD2638B58CD29F75372E1DFEBDEA29035C46903AAE68E9D9A6D69853F
E[2021-06-17|03:32:20.883] Stopping peer for error                      module=p2p peer="Peer{MConn{10.186.73.8:26656} 927c50a5e508c747830ce3ba64a3f70fdda58ef2 out}" err=EOF
E[2021-06-17|03:32:25.514] Stopping peer for error                      module=p2p peer="Peer{MConn{10.186.73.7:26656} 4b1068420ef739db63377250553562b9a978708a out}" err=EOF
E[2021-06-17|03:32:30.254] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:30.265] Stopping peer for error                      module=p2p peer="Peer{MConn{10.186.73.6:26656} 2551a13ed720101b271a5df4816d1e4b3d3bd133 out}" err=EOF
E[2021-06-17|03:32:30.368] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:30.422] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:30.896] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:35.528] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.235] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.300] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.321] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.401] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.423] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:40.930] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:42.517] dialing failed (attempts: 1): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:32:42.517] dialing failed (attempts: 2): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:32:42.517] dialing failed (attempts: 1): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:32:42.517] dialing failed (attempts: 1): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:32:45.752] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.269] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.312] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.397] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.461] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.471] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:50.985] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:32:55.796] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:00.271] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:00.430] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:00.462] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:00.494] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:00.589] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:01.019] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:05.840] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:10.273] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:10.719] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:10.729] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:10.771] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:10.813] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:11.148] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:12.516] dialing failed (attempts: 2): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:33:12.517] dialing failed (attempts: 3): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:33:12.517] dialing failed (attempts: 2): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:33:12.517] dialing failed (attempts: 2): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:33:15.874] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:20.338] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"
E[2021-06-17|03:33:42.517] dialing failed (attempts: 4): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:33:42.517] dialing failed (attempts: 3): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:33:42.517] dialing failed (attempts: 3): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:33:42.517] dialing failed (attempts: 3): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:34:12.517] dialing failed (attempts: 4): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:34:12.517] dialing failed (attempts: 4): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:34:12.517] dialing failed (attempts: 5): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:34:12.517] dialing failed (attempts: 4): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:34:42.517] dialing failed (attempts: 5): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:34:42.517] dialing failed (attempts: 5): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:34:42.517] dialing failed (attempts: 5): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:35:12.517] dialing failed (attempts: 6): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656
E[2021-06-17|03:35:42.517] dialing failed (attempts: 6): dial tcp 10.186.73.6:26656: i/o timeout module=pex addr=2551a13ed720101b271a5df4816d1e4b3d3bd133@10.186.73.6:26656
E[2021-06-17|03:35:42.517] dialing failed (attempts: 6): dial tcp 10.186.73.7:26656: i/o timeout module=pex addr=4b1068420ef739db63377250553562b9a978708a@10.186.73.7:26656
E[2021-06-17|03:35:42.517] dialing failed (attempts: 6): dial tcp 10.186.73.8:26656: i/o timeout module=pex addr=927c50a5e508c747830ce3ba64a3f70fdda58ef2@10.186.73.8:26656
E[2021-06-17|03:36:42.517] dialing failed (attempts: 7): dial tcp 10.186.73.9:26656: i/o timeout module=pex addr=63d9d8350f13171350e95bd0019df33c5f26bcb2@10.186.73.9:26656

Comment on lines 58 to 74
rowRoots := eds.RowRoots()
colRoots := eds.ColumnRoots()

// commit the batch to ipfs
err = batchAdder.Commit()
if err != nil {
return err
}

prov := newProvider(ctx, croute, int32(squareSize*4), logger.With("height", block.Height))
for _, root := range rowRoots {
prov.Provide(plugin.MustCidFromNamespacedSha256(root))
}
for _, root := range colRoots {
prov.Provide(plugin.MustCidFromNamespacedSha256(root))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing does not require nodes to be committed and this could potentially increase PutBlock execution time for multicore systems, as before starting providing we will wait while all the data is written. We should start providing as early as possible

@evan-forbes
Copy link
Member Author

evan-forbes commented Jun 20, 2021

sorry for the force push, I didn't find the bug until restarting from scratch, and the old branch was a little messy from debug attempts.

besides some minor cleaning up, I don't think I changed anything else besides the bug fix. Still, I'd feel better if this was given a quick glance over by someone else before merging, thanks!

@Wondertan
Copy link
Member

I'll take a look surely. Also, what's the problem with force pushing to sorry for it? IMO clean history is important

@Wondertan
Copy link
Member

besides the bug fix

But can you explain what were the bugs specifically? For TestTxSearch and e2e.

@evan-forbes
Copy link
Member Author

evan-forbes commented Jun 20, 2021

Also, what's the problem with force pushing to sorry for it? IMO clean history is important

You're right, clean history is important! In other scenarios, like strictly cleaning up the history before merging or asking for a review, I don't think it's that big of a deal. In this case, I restarted from scratch, so while the important changes are the same, the only way for a reviewer to know for sure is to basically re-review each line of code. If I would have made the changes via normal commits, then reviewers could view the precise changes and have confidence in the them without having to re-review each line of code.

typically, I squash merge, as that ensures that every commit added to master passes tests/is in a working state. I didn't mean to bring up very bike sheddy issue though. I'm just apologizing for basically asking for a re-review after it was already approved. 🤷

But can you explain what were the bugs specifically? For TestTxSearch and e2e.

for TestTxSearch, I had commented out adding proofs to rpc Tx responses because I was debugging. That is no longer the case, it passes without any changes.

for the e2e test, I'm still not 100% sure, as the logs returned are not deterministic and I can't manage to recreate it in a test. Sometimes, after waiting long enough to time out, we get logs that say

panic: cannot create context from nil parent [recovered]

other times, we get opaque peer disconnection logs that only say EOF. It has to do with canceling or resetting the context passed to SaveBlock. This is fixed by passing a new context to SaveBlock.

With the current design, every SaveBlock and LoadBlock is strictly offline, and we probably don't even need to pass contexts or return errors. As discussed in sync discussions, it has yet to be decided and documented as to precisely when we will be saving data to IPFS, but what this PR does show, is that it is possible to save block data to the local IPFS store.

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only left some questions

consensus/replay.go Outdated Show resolved Hide resolved
blockchain/v0/reactor.go Outdated Show resolved Hide resolved
consensus/state.go Show resolved Hide resolved
@evan-forbes
Copy link
Member Author

evan-forbes commented Jun 23, 2021

There are a lot of upcoming changes that diverge from the original needs which spawned this PR a month ago, and those changes are not yet fully decided upon. That being said, we will still be able to use the changes in this PR in at least a few if not all of the planned node types. The changes here are made significantly more useful after we incorporate the decoupling of Providing introduced in #427. For those reasons we're merging this now, but I expect that we will need further changes to both the mechanics and the API.

@evan-forbes evan-forbes merged commit 8da1644 into master Jun 23, 2021
@evan-forbes evan-forbes deleted the evan/save-dah branch June 23, 2021 17:42
evan-forbes added a commit that referenced this pull request Aug 25, 2021
evan-forbes added a commit that referenced this pull request Aug 25, 2021
* Revert "Remove the PartSetHeader from VoteSetMaj23, VoteSetBits, and state.State (#479)"

This reverts commit 3ba47c2.

* Revert "Stop Signing over the `PartSetHeader` for Votes and remove it from the `Header` (#457)"

This reverts commit 818be04.

* Revert "Decouple PartSetHeader from BlockID (#441)"

This reverts commit 9d4265d.

* Revert "Save and load block data using IPFS (#374)"

This reverts commit 8da1644.

* fix merge error

* Revert "Add the ipfs dag api object in Blockstore (#356)" and get rid of embedded ipfs objects

This reverts commit 40acb17.

* remove ipfs node provider from new node

* stop init ipfs repos

* remove IPFS node config

* clean up remaining ipfs stuff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save and Load Block Data via IPFS
5 participants