Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Linux OOM-killer kills Parity due to excessive I/O #1395

Closed
AlbieC opened this issue Jun 22, 2016 · 44 comments
Closed

Linux OOM-killer kills Parity due to excessive I/O #1395

AlbieC opened this issue Jun 22, 2016 · 44 comments
Labels
F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue.
Milestone

Comments

@AlbieC
Copy link

AlbieC commented Jun 22, 2016

Hi guys

Pretty new to this, but have been mining succesfully on Geth node with multiple PC's (Windows and Linux ) on LAN using ethminer. Thought I'd switch to Parity client to give it a go and supporting "client diversity" and all that you know ... Anyway, got it all working, but suddenly this issue below have come up which I don't know how to resolve. Parity stops working and exit.

Any ideas ?
Many thanks.

albie@linux3 ~/Desktop $ parity -jw --jsonrpc-interface 192.168.127.2 --jsonrpc-port 8545 --author 0x3e089b6dF17ad25019488fA3665252Bd70E6292F --identity Eilandia

2016-06-22 17:59:52 Starting Parity/v1.1.0-beta/x86_64-linux-gnu/rustc1.8.0
2016-06-22 17:59:52 Configured for Frontier/Homestead using "Ethash" engine
2016-06-22 17:59:52 Public node URL: enode://a4d484e1146792580199a83c71efb7667683f3c09bbdedae9291ae0fd7220e1eddfc43ba0ebb02f843c658a35d8c761bec7b52a67fb61f61e7d9c6f0bed54c19@0.0.0.0:30303
2016-06-22 17:59:54 Using a conversion rate of Ξ1 = US$14.5 (16420361000 wei/gas)
[ #1751510 2af8…d546 ]---[ 0 blk/s | 0 tx/s | 0 gas/s  //··· 0/1 peers, #1751510, 0+0 queued ···// mem: 49 MiB db, 195 KiB chain, 2 KiB queue, 7 KiB sync ]
[ #1751510 2af8…d546 ]---[ 0 blk/s | 0 tx/s | 0 gas/s  //··· 0/2 peers, #1751510, 0+0 queued ···// mem: 49 MiB db, 195 KiB chain, 2 KiB queue, 7 KiB sync ]
thread 'IO Worker #1' panicked at 'Not found! 131a13167076d37f8fa56d04c67394b87ada5cd39d97693cdf38141a359beb57', util/src/trie/triedb.rs:215
note: Run with `RUST_BACKTRACE=1` for a backtrace.
2016-06-22 18:00:10 Finishing work, please wait...

Some other error messages after another try:
2016-06-22 18:12:14 Starting Parity/v1.1.0-beta/x86_64-linux-gnu/rustc1.8.0
2016-06-22 18:12:14 Configured for Frontier/Homestead using "Ethash" engine
2016-06-22 18:12:14 Public node URL: enode://a4d484e1146792580199a83c71efb7667683f3c09bbdedae9291ae0fd7220e1eddfc43ba0ebb02f843c658a35d8c761bec7b52a67fb61f61e7d9c6f0bed54c19@0.0.0.0:30303
2016-06-22 18:12:16 Using a conversion rate of Ξ1 = US$14.43 (16500015000 wei/gas)
[ #1751510 2af8…d546 ]---[ 0 blk/s | 0 tx/s | 0 gas/s  //··· 0/0 peers, #1751510, 0+0 queued ···// mem: 49 MiB db, 195 KiB chain, 2 KiB queue, 7 KiB sync ]
[ #1751510 2af8…d546 ]---[ 0 blk/s | 0 tx/s | 0 gas/s  //··· 1/1 peers, #1751510, 0+0 queued ···// mem: 49 MiB db, 195 KiB chain, 2 KiB queue, 29 KiB sync ]
thread 'IO Worker #3' panicked at 'Not found! 82bcfcb3d2c37878844f0af0c8b5bc39f707c84de6db08f37ef6f52d94070b41', util/src/trie/triedb.rs:215
note: Run with `RUST_BACKTRACE=1` for a backtrace.
2016-06-22 18:12:31 Stage 3 block verification failed for #1751512 (0f8c…c280)
Error: Block(UnknownParent(79a58d56253bcca3b8394e6e24e86dc4ec11e68ba4ad187656feb21e87cdf6c7))
2016-06-22 18:12:31 Finishing work, please wait...
thread 'Verifier #0' panicked at 'Error sending BlockVerified message: Mio(Error { repr: Custom(Custom { kind: ConnectionAborted, error: StringError("Network IO notification error") }) })', ../src/libcore/result.rs:746
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: "PoisonError { inner: .. }"', ../src/libcore/result.rs:746
@gavofyork
Copy link
Contributor

did you run with any --pruning option at any point?

@AlbieC
Copy link
Author

AlbieC commented Jun 22, 2016

Nope. It all went well actually in the beginning. Had it running for 2 days after syncing on its own on newly installed linux machine to function as dedicated client to eventually serve all my miners on the LAN. Then today started to connect my ethminer clients to it. Had few issues with geth compatible flags it seems at beginning, but after I changed it to the parity flags, it started and ran smoothly. Went out for 2 hours and came back to this issue.

@arkpar
Copy link
Collaborator

arkpar commented Jun 22, 2016

Looking at the log it panics right after start. Was there the same panic message when it happened for the first time? Could it be that it ran out of disk space before?

@AlbieC
Copy link
Author

AlbieC commented Jun 22, 2016

As I said, no panic/issues the first time. Ethminer Clients just could not connect when I used some geth compatible flags at first on the parity start command line, but after I changed to parity flags, the miners connected and it ran smoothly , or at least until I left.

@AlbieC
Copy link
Author

AlbieC commented Jun 22, 2016

Also, it is newly fomatted 500 GB hard drive. I also have 2 GB RAM on machine, could it be too little ?

@rphmeier
Copy link
Contributor

What's causing the panic is that there is a node in the storage trie which references another node that should be in the DB, but for some reason isn't. This is a symptom of some other bug. A temporary fix would be to delete the chain's databases (stored in ~/.parity/<genesis-hash>) and re-sync.

@gavofyork
Copy link
Contributor

500GB / 2GB should be plenty in terms of specs; if possible, please use the beta branch.

@AlbieC
Copy link
Author

AlbieC commented Jun 23, 2016

@gavofyork Use beta branch without deleting chain database ? Do you know if the beta branch is free of this bug ?

@rphmeier If this is a common issue, should I not just wait for a fix before using Parity again? I'm just worried I do a resync and then have the bug appear again ....

@rphmeier
Copy link
Contributor

Your key-value database is missing a key-value pair that other items in the database refer to. Unfortunately, it doesn't matter which version of the software you run it with, because that lookup will fail regardless. The bug here is not that the lookup is failing, it is that something earlier caused that database entry not to be written out. In order to get a coherent database, all transactions need to be re-run to regenerate the state database. You can do that by re-syncing; as far as I see it there is no other option. Maybe there is another way that I'm not considering.

That said, there have been numerous improvements since the 1.1 release. Although I don't know for sure if the scenario which led to the database error here has been solved, the issue is certainly present in the 1.1 release. Trying with the 1.2 beta definitely can't hurt.

@rphmeier rphmeier added the F2-bug 🐞 The client fails to follow expected behavior. label Jun 23, 2016
@mista66
Copy link

mista66 commented Jun 23, 2016

@AlbieC Could you give me the parity and miner flags you used, so that i can test this?

i see
parity -jw --jsonrpc-interface 192.168.127.2 --jsonrpc-port 8545 --author 0x3e089b6dF17ad25019488fA3665252Bd70E6292F --identity Eilandia

what are your miner flags and ethminer version?

ethminer --farm-recheck 500 -G -F 192.168.127.2 :8545

@AlbieC
Copy link
Author

AlbieC commented Jun 23, 2016

@mista66 After installation (great easy 1 line installation BTW !) , and running just ( parity -j ) after, I left it running for 2 days to sync, not knowing how fast it will sync. After 2 days I got back to the machine and stopped Parity. I was then going to try to connect the miners to it, so I think I started the test with command line something like ( parity --geth --rpc --rpcaddr "192.168.127.2" -jw ) . I then tried to connect the miners by typing/(running bat file) the following : (ethminer -G -F 192.168.127.2 :8545 ) This did not work as I was getting rpc json connection errors if I remember correctly. I then changed the parity start commandline to pure parity flags like the following: ( parity -jw --jsonrpc-interface 192.168.127.2 --jsonrpc-port 8545 --author 0x3e089b6dF17ad25019488fA3665252Bd70E6292F --identity Eilandia )

This worked and the miners all connected and started to mine correctly. I think I had to delete and recreate 1 DAG file on only one of the miner PC's. I was feeling chuffed and left it all running while I left and when I came back 2 hours later all the clients was disconnected and Parity was exited. I could then not run Parity again as I was getting the panic messages every time as soon as it started after few seconds.

@AlbieC
Copy link
Author

AlbieC commented Jun 23, 2016

@rphmeier
Hi Robert. Am I understanding you correctly here ? By key-value pair and database, do you mean it has anything to do with my account key linked to the account I'm mining to in the Parity command line?
I ask because I synced the first time without any account created on the new PC. I also did not import or copy my account key into this machine yet as I did not think it was necessary for mining to any account to have the key on the machine in the Parity directory.

@mista66
Copy link

mista66 commented Jun 23, 2016

@AlbieC ok with no account keys on the mining machine/node ! what if you create an account first?

@rphmeier
Copy link
Contributor

rphmeier commented Jun 23, 2016

@AlbieC

This is a database which holds all storage values of all accounts on the blockchain, as well as account balances, nonces, and code. It is populated naturally through the process of syncing, executing transactions, and importing blocks. Under the hood, we use rocksdb to implement this. For some reason, one of the key-value pairs that is expected to be in this database is not. It could be a bug in parity or rocksdb. The symptom you are observing is the failed query, but the cause would be more interesting to reproduce.

Please try a fresh sync with the beta branch. At best, the problem resolves itself. At worst, we get another data point towards solving the issue. A full sync usually takes around 3-4 hours on my laptop, so it will almost definitely be faster than 2 days on your machine.

@AlbieC
Copy link
Author

AlbieC commented Jun 23, 2016

@rphmeier Ok, I'll start it tonight, but I live in SA with not too fast broadband (5 mbps) and not too many nodes nearby I think. Anyway, lets see how it goes. Is the beta branch also so easy to install or must I compile something first from source. Not too clued up with that !:)

@rphmeier
Copy link
Contributor

rphmeier commented Jun 23, 2016

You will have to build from source, but it doesn't take an absurd amount of time.
There are some build instructions on the main page -- you need to install git if you don't have it already.

To build on the beta branch instead of master, run git fetch origin beta && git checkout beta after the cd parity command. Then proceed as normal with the instructions and it will build.

Once it is done building, it will output a parity binary in the target/release directory. Run that, and it will start syncing.

@AlbieC
Copy link
Author

AlbieC commented Jun 23, 2016

@rphmeier Ok thanks, got the development version compiled and it's busy syncing. 23H00 here now so I'm going to sleep and see if it's finished tomorrow morning. :) 👍

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

Just an update. Last night the Parity Dev Version has not finished syncing yet and when I woke up this morning I saw the "killed" message in console and my next line command prompt.

so I tried again ...

albie@linux3 ~/parity/target/release $ parity
thread '

' panicked at 'Low-level database error.: "Corruption: Snappy not supported or corrupted Snappy compressed block contents"', ../src/libcore/result.rs:746
note: Run with RUST_BACKTRACE=1 for a backtrace.
albie@linux3 ~/parity/target/release $

but I get the same repsonse every time now.

I now went back to the V1.2 1 line installer to try the latest version, it installed very fast and is now unresponsive after the "parity -j " command, but I can see it is frantically busy doing something, so I'll wait till it's finished to see if V1.2 took over where the development version left off with the syncing of the database.

@gavofyork
Copy link
Contributor

prior to the killed message, could you paste the console output?

parity was apparently killed by the system during a database commit - this likely means the database is corrupted and, you guessed it, a resync is required.

2GB shouldn't be too little memory - i've synced on a lot less - but one of the reasons a process can be killed is running out of memory. it may be worth adding a swap file. also check the amount of disk space free to ensure that there really is 500GB free.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

@gavofyork nothing unsual above the "killed" message except the normal lines of previously imported blocks.

I expected another databse corrupt ... losing my steam now with parity.
I was thinking for some reason parity is suddenly using a lot more resources. Just my feeling. Or it's maybe my hardware ? But then again I read other people are posting the same kind of error messages as I got the first time.

PS. I also got an error with the Parity V1.2 resync earlier. Have given up. Going to try installing Parity on another linux machine. It's busy downloading but it's taking very slow for some reason, maybe too many people downloading Parity now ? :)

@gavofyork gavofyork changed the title Problem with farm mining on a Parity Node Node gets "killed" and leaves corrupt database Jun 25, 2016
@gavofyork
Copy link
Contributor

it's difficult to tell regarding hardware, but a simple killed message does point to a lack of resources of some sort.

it would be useful to know if a decently-sized swap partition fixes the problem.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

@gavofyork Is there anyway to know where the database got corrupted? The reason why I ask because then one could probably export till there, and then reimport that part and resync from there ?
I'll try to read how to do swap partitioning on linux. I only know Windows, Linux very new to me.

@gavofyork
Copy link
Contributor

gavofyork commented Jun 25, 2016

yes - you should be able to export more or less freely.

try parity export dump.rlp

also, you can try dmesg | egrep -i -B100 'killed process' to see why parity got killed. reading around the web it very much looks as though your machine has too little memory, which is really strange since parity syncs fine on a 1GB rasberry pi 2.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

Thanks for the help and suggestions @gavofyork .
Busy with the export on that machine but it is taking forever and unresponsive in the mean time.
I'll try that when it's finished or calmed down. In the mean time I'm trying to resync on another Linux Mint machine with Parity 1.2 just installed. Wil probably take me another 2 days ...

@gavofyork
Copy link
Contributor

do you know how far the syncing got before being killed?

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

around the 1.4 mil blocks

@gavofyork
Copy link
Contributor

according to one of the answers on http://stackoverflow.com/questions/726690/who-killed-my-process-and-why

If you are not the sysadmin, the sysadmin may have set up quotas on CPU, RAM, ort disk usage and auto-kills processes that exceed them.

i've never used linux mint, but i guess it is possible that it is configured to kill something that uses as much resources while syncing as parity does. if possible i'd recommend ubuntu; it is well tested and no problems have been reported.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

So far as I know Mint is very much based on/the same as Ubuntu. Normally I use tutorials for Ubuntu for Mint. Never had a problem with compatibility .

@gavofyork
Copy link
Contributor

A recent commit that tweaked the database configuration substantially increased the I/O load for the database.

Such a large amount of I/O could perhaps have caused the kernel to kill parity; we'll re-tweak this now.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

That will explain it.
It also seems certain linux distro's (and maybe older hardware too ?) can be more susceptible to OOM killer:
http://stackoverflow.com/questions/726690/who-killed-my-process-and-why

As dwc and Adam Jaskiewicz have stated, the culprit is likely the OOM Killer. However, the next question that follows is: How do I prevent this?

There are several ways:

Give your system more RAM if you can (easy if its a VM)
Make sure the OOM killer chooses a different process.
Disable the OOM Killer
Choose a Linux distro which ships with the OOM Killer disabled.

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

I've also since read that people say increasing swap size dont help with the OOM problem.

Where can I find logs of OOMKiller?

Typically in /var/log directory. Either /var/log/kern.log or /var/log/dmesg

Hope this will help you.
Some typical solutions:

Increase memory (not swap)
Find the memory leaks in your program and fix them
Restrict memory any process can consume (for example JVM memory can be restricted using JAVA_OPTS)
See the logs and google :)

@AlbieC
Copy link
Author

AlbieC commented Jun 25, 2016

Also good info here on OOM issues, http://lwn.net/Articles/317814/

@AlbieC
Copy link
Author

AlbieC commented Jun 27, 2016

Just another update. Finished syncing on V1.2 installed on another linux machine. Now I'll just first copy/backup the databse and start testing the mining again. Holding thumbs.

Just one other thing. I am currently running 2 live nodes, 1 with Geth 1.4.8 (windows 10) and now the 2nd with Parity V1.2 on different machines. I've noticed the Parity client seems to be a bit behind the Geth client downloading new blocks from network, and sometimes even "misses" a block (or don't show it in the console) ? Anyone else seen this ?

@arkpar
Copy link
Collaborator

arkpar commented Jun 27, 2016

The console log gets updated every 5 seconds and not with every new block.

@AlbieC
Copy link
Author

AlbieC commented Jun 27, 2016

ok, thanks, that's probably where the difference is between Parity and Geth then.

@AlbieC
Copy link
Author

AlbieC commented Jun 27, 2016

ok, hold thumbs, here goes testing my miners on my newly installed Parity V1.2 with synced (and backed up ) database ...:)

@AlbieC
Copy link
Author

AlbieC commented Jun 27, 2016

I noticed very fast resyncing with setting --bootnodes to a live geth node I am running on my LAN :)
Question, though, does this only influence nodes it connect to at startup. or will it be the only node it stays connected too ?

@arkpar
Copy link
Collaborator

arkpar commented Jun 27, 2016

Only affects startup. You can have it connected all the time with --reserved-peers option

@AlbieC
Copy link
Author

AlbieC commented Jun 27, 2016

Thanks @arkpar . So if I set --bootnodes pointing to a local geth node it should stay connected to that node as long as the geth node stays up/online, but the parity node will also connect to other nodes on network and not stop working/communicating/syncing if my local geth node fails or drops connection ? Did I get it right ?

@arkpar
Copy link
Collaborator

arkpar commented Jun 27, 2016

Right, except that after reaching the peer limit, connection with the local geth node might get replaced by another peer. --reserved-peers guarantees that this won't happen.

@AlbieC
Copy link
Author

AlbieC commented Jun 28, 2016

just another update. Was out for whole day just came back to another instance of "killed" on the Parity node. Had 1 miner mining to it for 24 hours now, and others on and off. No issues except the last killed message when I got back home. However it seems the database was not corrupted this time, as I just started Parity up again and it resynced and the miner resumed mining.

It seems to be reources/memory related as I had another console window open when I left this morning busy compiling genoil's ethminer on the same machine and that process also ran into errors and stopped.

@gavofyork
Copy link
Contributor

Are you running with an HDD or SSD?

@gavofyork gavofyork changed the title Node gets "killed" and leaves corrupt database Linux OOM kills Parity due to excessive I/O Jul 6, 2016
@gavofyork gavofyork changed the title Linux OOM kills Parity due to excessive I/O Linux OOM-killer kills Parity due to excessive I/O Jul 6, 2016
@gavofyork gavofyork added F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue. and removed F2-bug 🐞 The client fails to follow expected behavior. labels Jul 6, 2016
@gavofyork gavofyork modified the milestone: 1.3 Acuity Jul 6, 2016
@AlbieC
Copy link
Author

AlbieC commented Jul 6, 2016

HDD, but it may be older hardware related too. Don't know. Although, everything else seems to work fine. Busy mining with it using Genoil's ethminer now after new clean Linux install. Haven't tried installing Parity on it again.

@gavofyork gavofyork modified the milestones: 1.3 Acuity, 1.4 Civility Jul 28, 2016
@gavofyork
Copy link
Contributor

probably using --db-compaction=hdd would have alleviated the problem. closing for now since it seems there is little else we can do and nobody else can confirm. please reopen if it becomes an issue again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue.
Projects
None yet
Development

No branches or pull requests

5 participants