bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

radfish · 2016-07-10T00:53:41Z

My box had some unexpected unclean hard shutdowns due to hardware problems.

Now bitmonerod fails to start due to this segfault. The blockchain data probably was not closed/writtent to disk cleanly.

Expected behavior: upon encountering corruption in the blockchain DB on disk, bitmonerod should report it without crashing.

I have the lmdb data and the core for this. @hyc if you want it.

Monero 'Hydrogen Helix' (v0.9.4.0-18dd507

Core was generated by `bitmonerod --config-file /etc/bitmonerod.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb6e17ff8 in memcpy () from /usr/lib/libc.so.6
[Current thread is 1 (Thread 0xb6f7b000 (LWP 2775))]
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0xb6f7b000 (LWP 2775) 0x00327f00 in mdb_cursor_put.part ()
  2    Thread 0xb2fff450 (LWP 2786) 0xb6eebb10 in pthread_cond_timedwait@@GLIBC_2.4 ()
   from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xb6e17ff8 in memcpy () from /usr/lib/libc.so.6
#1  0x00324ab0 in mdb_node_add ()
#2  0x00327f00 in mdb_cursor_put.part ()
#3  0x00329620 in mdb_txn_commit ()
#4  0x00274bf8 in cryptonote::mdb_txn_safe::commit(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#5  0x00252554 in cryptonote::BlockchainLMDB::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) ()
#6  0x001cb818 in cryptonote::core::init(boost::program_options::variables_map const&, cryptonote::test_options const*) ()
#7  0x000caf14 in daemonize::t_daemon::run(bool) ()
#8  0x00156fc0 in daemonize::t_executor::run_interactive(boost::program_options::variables_map const&) ()
#9  0x0008cea8 in main ()

The text was updated successfully, but these errors were encountered:

hyc · 2016-07-10T01:12:02Z

Yes, save a copy of the LMDB data file please. I probably won't get to look at it any time soon though. Your backtrace appears to be a non-debug build, can you get a trace from a debug build?

radfish · 2016-07-10T04:08:10Z

Core was generated by `/home/redfish/dev/bitmonero/build/bin/bitmonerod --data-dir /mnt/flext/sys/bitmo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb68f0000 in memcpy () from /usr/lib/libc.so.6
[Current thread is 1 (Thread 0xb6f82000 (LWP 20141))]
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0xb6f82000 (LWP 20141) 0xb68f0000 in memcpy () from /usr/lib/libc.so.6
  2    Thread 0xb312f450 (LWP 20152) 0xb69c3b10 in pthread_cond_timedwait@@GLIBC_2.4 ()
   from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xb68f0000 in memcpy () from /usr/lib/libc.so.6
#1  0x00493108 in mdb_node_add (mc=0xbe98f930, indx=124, key=0xbe98f920, data=0xbe98f918,
    pgno=0, flags=65536) at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:7890
#2  0x0049204c in mdb_cursor_put (mc=0xbe98f930, key=0xbe98f920, data=0xbe98f918, flags=65536)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:7531
#3  0x00489f14 in mdb_freelist_save (txn=0x23fc030)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:3363
#4  0x0048b534 in mdb_txn_commit (txn=0x23fc030)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:3835
#5  0x00460880 in cryptonote::mdb_txn_safe::commit (this=0xbe98fd9c, this@entry=0xbe98fd94,
    message="Failed to commit a transaction to the db")
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:325
#6  0x0047e5bc in cryptonote::BlockchainLMDB::open (this=0x23fbe80, filename=...,
    mdb_flags=<optimized out>)
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:1190
#7  0x003aa2a8 in cryptonote::core::init (this=this@entry=0x23f2230, vm=...,
    test_options=0xbe9905aa, test_options@entry=0xbe9912f4)
    at /home/redfish/dev/bitmonero/src/cryptonote_core/cryptonote_core.cpp:387
#8  0x001a7b2c in daemonize::t_core::run (this=0x23f2230)
    at /home/redfish/dev/bitmonero/src/daemon/core.h:72
#9  daemonize::t_daemon::run (this=0xbe9912f4, this@entry=0xbe9912ec,
    interactive=interactive@entry=true)
    at /home/redfish/dev/bitmonero/src/daemon/daemon.cpp:119
#10 0x00313068 in daemonize::t_executor::run_interactive (this=this@entry=0xbe9923b8, vm=...)
    at /home/redfish/dev/bitmonero/src/daemon/executor.cpp:68
#11 0x0031badc in daemonizer::daemonize<daemonize::t_executor>(int, char const**, daemonize::t_executor&&, boost::program_options::variables_map const&) (argc=<optimized out>,
    argv=<optimized out>,
    executor=executor@entry=<unknown type in /home/redfish/dev/bitmonero/build/bin/bitmonerod, CU 0xbfaa1f, DIE 0xcc8630>, vm=...)
    at /home/redfish/dev/bitmonero/src/daemonizer/posix_daemonizer.inl:85
---Type <return> to continue, or q <return> to quit---
#12 0x00318598 in main (argc=<optimized out>, argv=<optimized out>) at /home/redfish/dev/bitmonero/src/daemon/main.cpp:280

hyc · 2016-07-10T10:04:45Z

Can you also check, in frame #6, print m_height

radfish · 2016-07-10T17:28:01Z

(gdb) up
#6  0x0047e5bc in cryptonote::BlockchainLMDB::open (this=0x23f7998, filename=..., 
    mdb_flags=<optimized out>)
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:1190
1190      txn.commit();
(gdb) p m_height
$1 = 0

hyc · 2016-07-12T13:39:58Z

That's kind of what I expected. This says that it never read any valid block count from the DB when first opening it. I think some earlier function must have failed, before reaching here, and we didn't catch the error code.

hyc · 2016-08-22T13:45:07Z

We should think about a way to toggle from the default "--db_sync_mode fastest:async:1000" back down to "--db_sync_mode safe" after the daemon gets fully sync'd. After the daemon has caught up to the network, we know that new blocks will only commit ~1 every 2 minutes so running in fully synchronous mode won't be generating a lot of disk flushes.

iamsmooth · 2016-08-22T23:36:47Z

I definitely agree with switching to safe mode once synced, but there is another case to consider. You already have gigabytes of blockchain downloaded but are offline for a time. When you come online you are in sync mode, but corruption there means you lose your whole DB.

I think any unsafe DB modes should only be used on initial sync, or if specified as a non default (can be used by advanced users to speed up later partial syncs)

hyc · 2016-08-22T23:58:43Z

Yeah, definitely unsafe modes should only be used if specified explicitly.

For your intermediate case, I think we could use NOMETASYNC by itself. That is still synchronous, but unlike full sync mode which does 2 fsyncs per commit, it only does 1 fsync per commit. In this case, a crash cannot lose integrity, but it could lose the last committed txn. It's a compromise setting; faster than fully sync'd mode with a 1 txn possible loss.

iamsmooth · 2016-08-23T02:57:24Z

Losing any number of transactions is okay here, as long as there is no corruption. I guess if the failure case loses one, then we also want batching of blocks during a bulk sync to maximize performance safely (may already occur; I'm not sure).

hyc · 2016-08-25T23:19:08Z

Bulk syncing batches 200 blocks at a time.

hyc mentioned this issue Aug 28, 2016

Change default db-sync-mode to fast, not fastest #999

Merged

radfish closed this as completed Sep 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

radfish commented Jul 10, 2016

hyc commented Jul 10, 2016

radfish commented Jul 10, 2016

hyc commented Jul 10, 2016

radfish commented Jul 10, 2016

hyc commented Jul 12, 2016

hyc commented Aug 22, 2016

iamsmooth commented Aug 22, 2016

hyc commented Aug 22, 2016

iamsmooth commented Aug 23, 2016

hyc commented Aug 25, 2016

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

Comments

radfish commented Jul 10, 2016

hyc commented Jul 10, 2016

radfish commented Jul 10, 2016

hyc commented Jul 10, 2016

radfish commented Jul 10, 2016

hyc commented Jul 12, 2016

hyc commented Aug 22, 2016

iamsmooth commented Aug 22, 2016

hyc commented Aug 22, 2016

iamsmooth commented Aug 23, 2016

hyc commented Aug 25, 2016