Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

Closed
radfish opened this issue Jul 10, 2016 · 10 comments
Closed

bitmonerod: segfaults on (probably) corrupt lmdb blockchain data #898

radfish opened this issue Jul 10, 2016 · 10 comments

Comments

@radfish
Copy link
Contributor

radfish commented Jul 10, 2016

My box had some unexpected unclean hard shutdowns due to hardware problems.

Now bitmonerod fails to start due to this segfault. The blockchain data probably was not closed/writtent to disk cleanly.

Expected behavior: upon encountering corruption in the blockchain DB on disk, bitmonerod should report it without crashing.

I have the lmdb data and the core for this. @hyc if you want it.

Monero 'Hydrogen Helix' (v0.9.4.0-18dd507

Core was generated by `bitmonerod --config-file /etc/bitmonerod.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb6e17ff8 in memcpy () from /usr/lib/libc.so.6
[Current thread is 1 (Thread 0xb6f7b000 (LWP 2775))]
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0xb6f7b000 (LWP 2775) 0x00327f00 in mdb_cursor_put.part ()
  2    Thread 0xb2fff450 (LWP 2786) 0xb6eebb10 in pthread_cond_timedwait@@GLIBC_2.4 ()
   from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xb6e17ff8 in memcpy () from /usr/lib/libc.so.6
#1  0x00324ab0 in mdb_node_add ()
#2  0x00327f00 in mdb_cursor_put.part ()
#3  0x00329620 in mdb_txn_commit ()
#4  0x00274bf8 in cryptonote::mdb_txn_safe::commit(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#5  0x00252554 in cryptonote::BlockchainLMDB::open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) ()
#6  0x001cb818 in cryptonote::core::init(boost::program_options::variables_map const&, cryptonote::test_options const*) ()
#7  0x000caf14 in daemonize::t_daemon::run(bool) ()
#8  0x00156fc0 in daemonize::t_executor::run_interactive(boost::program_options::variables_map const&) ()
#9  0x0008cea8 in main ()
@hyc
Copy link
Collaborator

hyc commented Jul 10, 2016

Yes, save a copy of the LMDB data file please. I probably won't get to look at it any time soon though. Your backtrace appears to be a non-debug build, can you get a trace from a debug build?

@radfish
Copy link
Contributor Author

radfish commented Jul 10, 2016

Core was generated by `/home/redfish/dev/bitmonero/build/bin/bitmonerod --data-dir /mnt/flext/sys/bitmo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb68f0000 in memcpy () from /usr/lib/libc.so.6
[Current thread is 1 (Thread 0xb6f82000 (LWP 20141))]
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0xb6f82000 (LWP 20141) 0xb68f0000 in memcpy () from /usr/lib/libc.so.6
  2    Thread 0xb312f450 (LWP 20152) 0xb69c3b10 in pthread_cond_timedwait@@GLIBC_2.4 ()
   from /usr/lib/libpthread.so.0
(gdb) bt
#0  0xb68f0000 in memcpy () from /usr/lib/libc.so.6
#1  0x00493108 in mdb_node_add (mc=0xbe98f930, indx=124, key=0xbe98f920, data=0xbe98f918,
    pgno=0, flags=65536) at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:7890
#2  0x0049204c in mdb_cursor_put (mc=0xbe98f930, key=0xbe98f920, data=0xbe98f918, flags=65536)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:7531
#3  0x00489f14 in mdb_freelist_save (txn=0x23fc030)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:3363
#4  0x0048b534 in mdb_txn_commit (txn=0x23fc030)
    at /home/redfish/dev/bitmonero/external/db_drivers/liblmdb/mdb.c:3835
#5  0x00460880 in cryptonote::mdb_txn_safe::commit (this=0xbe98fd9c, this@entry=0xbe98fd94,
    message="Failed to commit a transaction to the db")
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:325
#6  0x0047e5bc in cryptonote::BlockchainLMDB::open (this=0x23fbe80, filename=...,
    mdb_flags=<optimized out>)
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:1190
#7  0x003aa2a8 in cryptonote::core::init (this=this@entry=0x23f2230, vm=...,
    test_options=0xbe9905aa, test_options@entry=0xbe9912f4)
    at /home/redfish/dev/bitmonero/src/cryptonote_core/cryptonote_core.cpp:387
#8  0x001a7b2c in daemonize::t_core::run (this=0x23f2230)
    at /home/redfish/dev/bitmonero/src/daemon/core.h:72
#9  daemonize::t_daemon::run (this=0xbe9912f4, this@entry=0xbe9912ec,
    interactive=interactive@entry=true)
    at /home/redfish/dev/bitmonero/src/daemon/daemon.cpp:119
#10 0x00313068 in daemonize::t_executor::run_interactive (this=this@entry=0xbe9923b8, vm=...)
    at /home/redfish/dev/bitmonero/src/daemon/executor.cpp:68
#11 0x0031badc in daemonizer::daemonize<daemonize::t_executor>(int, char const**, daemonize::t_executor&&, boost::program_options::variables_map const&) (argc=<optimized out>,
    argv=<optimized out>,
    executor=executor@entry=<unknown type in /home/redfish/dev/bitmonero/build/bin/bitmonerod, CU 0xbfaa1f, DIE 0xcc8630>, vm=...)
    at /home/redfish/dev/bitmonero/src/daemonizer/posix_daemonizer.inl:85
---Type <return> to continue, or q <return> to quit---
#12 0x00318598 in main (argc=<optimized out>, argv=<optimized out>) at /home/redfish/dev/bitmonero/src/daemon/main.cpp:280

@hyc
Copy link
Collaborator

hyc commented Jul 10, 2016

Can you also check, in frame #6, print m_height

@radfish
Copy link
Contributor Author

radfish commented Jul 10, 2016

(gdb) up
#6  0x0047e5bc in cryptonote::BlockchainLMDB::open (this=0x23f7998, filename=..., 
    mdb_flags=<optimized out>)
    at /home/redfish/dev/bitmonero/src/blockchain_db/lmdb/db_lmdb.cpp:1190
1190      txn.commit();
(gdb) p m_height
$1 = 0

@hyc
Copy link
Collaborator

hyc commented Jul 12, 2016

That's kind of what I expected. This says that it never read any valid block count from the DB when first opening it. I think some earlier function must have failed, before reaching here, and we didn't catch the error code.

@hyc
Copy link
Collaborator

hyc commented Aug 22, 2016

We should think about a way to toggle from the default "--db_sync_mode fastest:async:1000" back down to "--db_sync_mode safe" after the daemon gets fully sync'd. After the daemon has caught up to the network, we know that new blocks will only commit ~1 every 2 minutes so running in fully synchronous mode won't be generating a lot of disk flushes.

@iamsmooth
Copy link
Contributor

I definitely agree with switching to safe mode once synced, but there is another case to consider. You already have gigabytes of blockchain downloaded but are offline for a time. When you come online you are in sync mode, but corruption there means you lose your whole DB.

I think any unsafe DB modes should only be used on initial sync, or if specified as a non default (can be used by advanced users to speed up later partial syncs)

@hyc
Copy link
Collaborator

hyc commented Aug 22, 2016

Yeah, definitely unsafe modes should only be used if specified explicitly.

For your intermediate case, I think we could use NOMETASYNC by itself. That is still synchronous, but unlike full sync mode which does 2 fsyncs per commit, it only does 1 fsync per commit. In this case, a crash cannot lose integrity, but it could lose the last committed txn. It's a compromise setting; faster than fully sync'd mode with a 1 txn possible loss.

@iamsmooth
Copy link
Contributor

Losing any number of transactions is okay here, as long as there is no corruption. I guess if the failure case loses one, then we also want batching of blocks during a bulk sync to maximize performance safely (may already occur; I'm not sure).

@hyc
Copy link
Collaborator

hyc commented Aug 25, 2016

Bulk syncing batches 200 blocks at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants