Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsonrpc autoconnect, indexes client recovery, docker volumes, bugfixes #68

Merged
merged 5 commits into from
Jan 29, 2020

Conversation

jsign
Copy link
Contributor

@jsign jsign commented Jan 24, 2020

While running our reputation index, which is a long-running thing, the lotus client was kind of fatal when the websocket connection was lost. This was a known thing for a while in the jsonrpc from lotus, so I jumped to help us both in that (filecoin-project/lotus#1155).
The current solution that is in this PR, is more in line with a previous PR than the mentioned before... so it may change again with my new solution, but the current one is enough for our interests.

While installing and leaving running the service in the cloud instance, I discovered a set of issues or needed improvements to keep it more: reliable on lotus or self crashes, tested the persistence of indexed data and recovery after restarting, and some extra tunes to dockerized things that run on the cloud (and I also use locally to test things, since helps fast iteration with the many wires to connect).

I'll make some comments, but that's pretty much the summary of this.

Closes #41
Closes #67

@jsign jsign added this to the Sprint 28 milestone Jan 24, 2020
@jsign jsign self-assigned this Jan 24, 2020
@jsign jsign changed the title [WIP] jsonrpc autoconnect, indexes client recovery, docker volumes, bufixes [WIP] jsonrpc autoconnect, indexes client recovery, docker volumes, bugfixes Jan 24, 2020
@jsign jsign changed the title [WIP] jsonrpc autoconnect, indexes client recovery, docker volumes, bugfixes jsonrpc autoconnect, indexes client recovery, docker volumes, bugfixes Jan 25, 2020
@@ -78,3 +81,9 @@ func instrumentationSetup() {
}
}()
}

func pprofSetup() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled pprof. I wanted to inspect some things about a memprofile of the service.
All good. Left it here anyways, is binded to localhost so not public.

Comment on lines +18 to +19
volumes:
- textile-fc-data:/root/.texfc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a change in the docker-compose used to launch the server, prometheus, grafana, etc.
The textile-fc (our service) uses a volume to map the repo folder to a Docker volume. Mostly for saving progress while restarting the compose environment, and tests index persistence.

@@ -523,7 +523,7 @@
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some improvements in clarity and new boxes that helped discover some things.

@@ -151,33 +150,3 @@ func (mi *MinerIndex) persistMetaIndex(index MetaIndex) error {
}
return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to miners.go since loading from store is a general thing.

@@ -153,10 +154,11 @@ func (mi *MinerIndex) Close() error {
// miners.
func (mi *MinerIndex) start() {
defer func() { mi.finished <- struct{}{} }()
n, err := mi.api.ChainNotify(mi.ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop using ChainNotify for listening to the chain.
I've been thinking about this for quite some time since puts a lot of stress. It notifies about every possible forks that might happen in 10-20s time frame... we're only interested in the heaviest tipset known and let time (and lotus) decide which are the heaviest ones.

Switched to waiting the avg tipset time and just asking for the heaviest current tipset (which is what we need). Mini forks or possible future forks aren't interesting. We don't need to be at the bleeding-edge of that.

if len(info.Epochs) > 0 && info.Epochs[len(info.Epochs)-1] == patch[addr] {
continue
}
info.Epochs = append(info.Epochs, patch[addr])
index.Miners[addr] = info
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix. info was edited but not assigned to current index (so lost changes).

@@ -61,7 +61,7 @@ func New(maddr ma.Multiaddr, authToken string) (*API, func(), error) {
closer, err := jsonrpc.NewMergeClient("ws://"+addr+"/rpc/v0", "Filecoin",
[]interface{}{
&api.Internal,
}, headers)
}, headers, jsonrpc.WithReconnect(true, time.Second*3, 0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original PR that I did for lotus. You can configure a interval time to wait so that the websocket connection is recovered, adn the client can still use it as "nothing happened".

In the newer PR that will get merged, you can annotate with retry:"true" methods of the API. If that annotation is detected, the client retries itself the call until having success on a recovered connection. This makes the client even do less work since shouldn't do any kind of retry logic.

I may eventually consider switching to the new solution.

@@ -242,16 +242,10 @@ func monitorLotusSync(ctx context.Context, c *API) {
}

func refreshHeightMetric(ctx context.Context, c *API) {
var h uint64
state, err := c.SyncState(ctx)
heaviest, err := c.ChainHead(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify curent height metric in grafana dashboard to the current heaviest tipset... not the newer potential new tipset that we discovered recently.

@@ -85,12 +85,21 @@ type client struct {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lotus/jsonrpc package is from Lotus. Has the changes proposed for autoconnect feature.
Can be dismissed for review... but if like crazy things might be fun.


ma "github.com/multiformats/go-multiaddr"
)

const (
AvgBlockTime = time.Second * 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testnet is 45s. But in mainet will be 30s as mentioned in the spec.
Leaving 30s here since updating more often doesn't hurt; worst thing that can happen is doing nothing.

@jsign
Copy link
Contributor Author

jsign commented Jan 25, 2020

Will DCO after review just in case that might not invalidate review comments.

@jsign jsign marked this pull request as ready for review January 25, 2020 16:28
Copy link
Contributor Author

@jsign jsign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some extra things added.

@@ -50,6 +49,7 @@ func (mi *MinerIndex) metaWorker() {
newIndex, err := updateMetaIndex(mi.ctx, mi.api, mi.h, mi.lr, addrs)
if err != nil {
log.Errorf("error when updating meta index: %s", err)
break
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for some reason generating a new index fails, log the error and break here. If we continue we would override the current-good-state-index with an empty one and lost things. Better to keep it outdated than to lose it.

Comment on lines +104 to +105
Online: mi.index.Meta.Online,
Offline: mi.index.Meta.Offline,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These copies were missing so didn't appear correctly in the output.

Signed-off-by: jsign <jsign.uy@gmail.com>
Signed-off-by: jsign <jsign.uy@gmail.com>
Signed-off-by: jsign <jsign.uy@gmail.com>
Signed-off-by: jsign <jsign.uy@gmail.com>
@jsign
Copy link
Contributor Author

jsign commented Jan 28, 2020

Rebased.

Copy link
Member

@asutula asutula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty unfamiliar with the details here, but didn't see anything that looks terribly wrong. Merge away!

@jsign jsign merged commit 2a5d3cb into master Jan 29, 2020
@jsign jsign deleted the jsign/clientretrier branch January 29, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

docker: user map repo to volume lotus/client: reconnect on failure
2 participants