Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Syncing from scratch randomly slows down #352

Closed
ghost opened this issue Dec 11, 2016 · 13 comments
Closed

Syncing from scratch randomly slows down #352

ghost opened this issue Dec 11, 2016 · 13 comments

Comments

@ghost
Copy link

ghost commented Dec 11, 2016

This is issue i've been already reporting in early versions of Lisk. Syncing from scratch works better than before but it's still far from perfect.

It's understundable that on beginning syncing could be slow due many transactions made and cpu time needed to verify. But after time syncing speeds up to reasonable value. Then suddenly without additional logging process becomes very slow again. (i have marked areas on disk usage chart which shows good speed with green and slow with red). Restarting lisk process fix issue temporarily, but anyway with current version of Lisk it took 24h to sync from genesis block to 3209245310885481431. I've described in #351 why only to this block. #351 is different issue than this, with this one there wasn't any additional errors/logging as i've mentioned before.

Cpu usage seems to be the same while it's syncing with reasonable speed and when syncing very slow. By slow i mean abnormally slow, sometimes getting new block takes longer than network interval which technically makes it impossible to sync.

Additionally CPU usage vary at around 25% roughly, the same information can be read from load average, which simply indicates that syncing can be possibly 4x faster than currently with the same implementation of cryptographic functions and logic to verify block & transactions. There is 3/4 cpu power left in idle state.

Possible solutions:

  • Improve current logic to fix randomly slowed down syncing

  • Improve current logic to use additional 75% CPU power which is idle

  • Implement better than official commonly used JS cryptographic library, possibly written in C++/C or any other low level fast language. Possibly rewrite transaction/block verification code as a C++/C library for JS - this is very necessary step to achieve reasonable scaleability.

  • Possibly syncing speed can be also improved by moving communication between nodes to Web Sockets as proposed in Network connectivity channels #347 - this can be big step forward as it will positively affect block propagation times over network.

Another problem is that starting Lisk to sync without snapshot form lisk.io is tricky and confusing enough so most users will ignore this and go with centralised snapshot. As i've reported here LiskArchive/lisk-build#57 but it have been ignored, moreover option in installLisk.sh to sync from scratch does not work. I've managed to do it tricky way with creating fake file.db.gz and bash lisk.sh rebuild -f file.db.gz I think this should be as easy as deciding with install location or choosing between main network or test network. So more people will get encouraged to sync from other people, locate issues with syncing etc. It's good approach in decentralised project generally.

Screenshot 1
test1
Screenshot 2
test2

Additional information about hardware which i've run tests

Hardware Class: cpu
Arch: X86-64
Vendor: "GenuineIntel"
Model: 6.63.2 "Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz"
8gb Ram
4 cores CPU
@mrv777
Copy link
Contributor

mrv777 commented Dec 11, 2016

It's only using 25% because you have 4 cores. Node is only a single threaded process so it only can use one core. Multithreading would be great, but isn't easy. You could start tackling it if you wanted though :)

@ghost
Copy link
Author

ghost commented Dec 11, 2016

Im not familiar with Node, as im not big fan on JS at all, doesn't matter how many cores i've.
It needs to get fully multithreaded, didn't worked with Node.js, but it must be possible, even reasonable multithreading can be implemented to php, did this and it was working flawlessly. Even if its so trouble to multithread Node.js, some of syncing logic can be rewritten to low level language, as additional module which can take care of things which does not work well in one thread per request model.

@mrv777
Copy link
Contributor

mrv777 commented Dec 11, 2016

Sorry if there is some confusion. Node.js can be used in multi threading. Just lisk's node code is not written that way currently. I'm sure it will be rewritten at some point, but currently that is why you see it at 25% on 4 core system

@ghost
Copy link
Author

ghost commented Dec 11, 2016

I have checked, it seems some versions of Node.js supports reasonable multithreading some not. Let's wait for @karmacoma to take position.

@Isabello
Copy link
Contributor

There are plans to clusterize the process at some point.

@maxkordek
Copy link

maxkordek commented Dec 12, 2016

There are also plans to re-write time/performance critical functionalities into a low level language. This will probably be done later in the second part of the Ascent phase.

@karmacoma
Copy link
Contributor

@karek314 Regarding your possible solutions:

Improve current logic to fix randomly slowed down syncing

I don't see a possible solution here.

Improve current logic to use additional 75% CPU power which is idle

I assume you mean allow for other CPU cores to be utilized. At the persistence layer, PostgreSQL is already utilizing multiple cores, which is where much of the heavy lifting is being conducted. As already mentioned by @Isabello, we plan to clusterize the node.js application itself into several distinct processes. There is also ongoing work: #302 that will improve the efficiency by which "work" is actually delegated to the persistence layer.

Implement better than official commonly used JS cryptographic library, possibly written in C++/C or any other low level fast language. Possibly rewrite transaction/block verification code as a C++/C library for JS - this is very necessary step to achieve reasonable scaleability.

@4miners recently introduced a change from js-nacl to libsodium which has improved the speed of cryptographic operations by approx. 3 times.

At this point, imo the bottleneck is not the language or level at which it is written. The inefficiencies are largely related to the way db connections / queries are being conducted. Once again #302 should address this, especially in the area of block / transaction processing. We are also looking at ways we can improve the apply and undo transaction operations, which are the most costly.

Possibly syncing speed can be also improved by moving communication between nodes to Web Sockets as proposed in #347 - this can be big step forward as it will positively affect block propagation times over network.

Yes, we are already in agreement on this. We can further discuss your proposal in #347.

Another problem is that starting Lisk to sync without snapshot form lisk.io is tricky and confusing enough so most users will ignore this and go with centralised snapshot. As i've reported here LiskArchive/lisk-build#57 but it have been ignored, moreover option in installLisk.sh to sync from scratch does not work.

@Isabello has reopened the issue on lisk-build where installLisk.sh is maintained. We are not ignoring the issue.

@ghost
Copy link
Author

ghost commented Dec 12, 2016

I don't see a possible solution here.
There is solution, something can be improved when simple restart of Lisk fix this issue, and syncing remains at reasonable speed for some random length. If it's related to connectivity, rewriting and moving node-2-node communication to web sockets should improve that. I haven't taken look at code, don't have time do this for free. I believe there is always solution to every problem.

At this point, imo the bottleneck is not the language or level at which it is written. The inefficiencies are largely related to the way db connections / queries are being conducted. Once again #302 should address this, especially in the area of block / transaction processing. We are also looking at ways we can improve the apply and undo transaction operations, which are the most costly.

Good to know about db inefficiencies caused by how queries are made, but i think that language will be next bottleneck sooner or later.

@Isabello has reopened the issue on lisk-build where installLisk.sh is maintained. We are not ignoring the issue.

Yes and no. I've been discussing with her, but couldn't make her to agree with me. Lisk is decentralised project. We should encourage every user to build their database on the top of data collected over peers found in network, instead allowing to go with snapshot. Snapshot is a great way to sync node very fast way, same good as syncing ethereum not in archive mode which is simply fast but less secure.
There should be clear question on installing / rebuilding node in installLisk.sh and as well in lisk.sh
Clear question if user would like to do full sync from network or choose to sync from centralised snapshots. Currently only option to do that is tweaky and buried down in help, hardly noticeable, while it's buggy anyway. This should be loud and clear.

Let me bring some possible attacks when every user is encouraged to chose sync from snapshot. Im saying encouraged since there is no question, it's default option to go with snapshot.

  • When Lisk has been running over 101 delegates solely owned by LiskHQ this was actually LiskHQ centralised network, not surprise price was falling from ICO but anyway - at any time Lisk blockchain could have been easily hijacked by forcing all users to upgrade at one time. Im not saying you ever did you ever wanted to do. Just saying it's vulnerability. But owning all delegates by 1 entity is big threat to decentralised project anyway, believe no need to elaborate on this.

  • What if someone will take over lisk.io dns records and distribute fake blockchain copy with some amounts of money stolen from others while LiskHQ publish new version of network without backwards support - forcing all to hardfork ? This is possible, and there are few possible scenarios to perform such attacks.

In summary, I believe in every distributed ledger - blockchain based decentralised projects, syncing blockchain from genesis block should be primary option. With fast sync as opt in, in case of Lisk (snapshots) - as it's obviously less secure. Moreover forging delegates should be even clearly encouraged to sync from genesis block as forging ones are the ones writing data to blockchain.

I know snapshots were mandatory at first stages in Lisk when node couldn't possibly sync from beginning, but now ? It works stable enough.

@4miners
Copy link
Contributor

4miners commented May 12, 2017

That issue is still valid, slow down during sync can be noticed on 0.9, will investigate.

@4miners
Copy link
Contributor

4miners commented May 14, 2017

After some investigation I found that slow down of sync is probably caused by transactions received by node during sync. Each transaction received need to be processed before and after processing a block (undo/redo to unconfirmed balance). With time they are stacking and block processing became slower and slower.

Solution:
Don't allow to receive transactions when node is in syncing state.

@Isabello
Copy link
Contributor

Isabello commented May 14, 2017 via email

@ghost
Copy link
Author

ghost commented May 14, 2017

What may be important to add is that this issue has been occurring from very early versions of Lisk (first testnet release). Up to now including last release.

@diego-G
Copy link

diego-G commented Sep 10, 2018

Superseded by #2384

@diego-G diego-G closed this as completed Sep 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants