Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database corruption after SEGV_MAPERR #4128

Closed
bios-seiji opened this issue Jan 31, 2017 · 13 comments
Closed

Database corruption after SEGV_MAPERR #4128

bios-seiji opened this issue Jan 31, 2017 · 13 comments

Comments

@bios-seiji
Copy link

bios-seiji commented Jan 31, 2017

Hello,

We're experiencing high incidents of SIGSEGV SEGV_MAPERR crashes on recent versions of Realm. We can reliably reproduce these crashes by opening realms on multiple threads and conducting operations. Crashing multiple times (1-5) will usually cause the database to corrupt. Most of our tests have been with encrypted realms, however, I was able to reproduce the same results using an unencrypted realm with much effort.

Observations

  • The probability of crash increases with the number of global realm instances open (N > 2 usually leads to crash).
  • copyFromRealm increases probability of crashes
  • Crashes usually occur during GC
  • Crashes eventually lead to database corruption
  • Corruption spreads, once a realm has corruption, doing operations on that realm will crash and eventually prevent the realm from opening entirely.
  • Write operations on main thread increase crashes
  • Encryption increases the probability but does not appear to be the cause

Results

This is a small set of crashes from production. I have hundreds, but they're much of the same.

signal 11 (SIGSEGV), code 1, fault addr 0x674db176 in tid 27384 (m.messenger.app)
Revision: '0'
ABI: 'arm'
pid: 27384, tid: 27384, name: m.messenger.app  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x674db176
    r0 674db172  r1 b7989f78  r2 00000861  r3 0000004e
    r4 676e6172  r5 b7900f90  r6 00000068  r7 bef3c350
    r8 00000001  r9 bef3c37c  sl 9ec858c0  fp b7fd5e78
    ip b6cb05dc  sp bef3c348  lr b3b7152d  pc b3b5e6c2  cpsr 800e0030

Stack Trace:
  RELADDR   FUNCTION                                                                             FILE:LINE
  000846c2  realm::BpTreeNode::get_bptree_leaf(unsigned int) const+82                            unwind-c.c:?
  00085b0b  realm::BpTree<realm::util::Optional<long long> >::get(unsigned int) const+20         unwind-c.c:?
  0008da9f  realm::TimestampColumn::get(unsigned int) const+14                                   unwind-c.c:?
  000babe9  realm::TimestampNode<realm::Equal>::find_first_local(unsigned int, unsigned int)+64  unwind-c.c:?
  000c3215  realm::ParentNode::find_first(unsigned int, unsigned int)+44                         unwind-c.c:?
  000c940f  realm::Query::count(unsigned int, unsigned int, unsigned int) const+166              unwind-c.c:?
  0003cd51  Java_io_realm_internal_TableQuery_nativeCount+88                                     unwind-c.c:?
  010092f3  offset 0xd10000                                                                      /data/app/com.messenger.app-1/oat/arm/base.odex

-----------------------------------------------------

signal 11 (SIGSEGV), code 1, fault addr 0x4 in tid 6685 (IncomingMessage)
Revision: '0'
ABI: 'arm'
pid: 6606, tid: 6685, name: IncomingMessage  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
    r0 b80ccf60  r1 ffff97a0  r2 b80cd26c  r3 00000000
    r4 b80ccf60  r5 ffff97a0  r6 00000000  r7 00000037
    r8 00000001  r9 00000000  sl 000e0000  fp b3d62ab0
    ip 0000000c  sp a0427cb0  lr b3c89e2b  pc b3c40540  cpsr 600f0030

Stack Trace:
  RELADDR   FUNCTION                                                                                                               FILE:LINE
  00097540  realm::SlabAlloc::do_translate(unsigned int) const+448                                                                 unwind-c.c:?
  000e0e29  realm::ArrayStringLong::init_from_mem(realm::MemRef)+178                                                               unwind-c.c:?
  000e0f5d  realm::StringColumn::StringColumn(realm::Allocator&, unsigned int, bool, unsigned int)+288                             unwind-c.c:?
  000da4f1  realm::Table::refresh_column_accessors(unsigned int)+932                                                               unwind-c.c:?
  000a7f4b  realm::Group::do_get_table(unsigned int, bool (*)(realm::Spec const&))+698                                             unwind-c.c:?
  000da40b  realm::Table::refresh_column_accessors(unsigned int)+702                                                               unwind-c.c:?
  000a7f4b  realm::Group::do_get_table(unsigned int, bool (*)(realm::Spec const&))+698                                             unwind-c.c:?
  000a9cd5  realm::Group::do_get_or_add_table(realm::StringData, bool (*)(realm::Spec const&), void (*)(realm::Table&), bool*)+52  unwind-c.c:?
  000243ed  Java_io_realm_internal_SharedRealm_nativeGetTable+172                                                                  unwind-c.c:?
  00fff4e9  offset 0xd10000                                                                                                        /data/app/com.messenger.app-1/oat/arm/base.odex

-----------------------------------------------------

signal 11 (SIGSEGV), code 1, fault addr 0xb766dfbf in tid 6722 (IncomingMessage)
Revision: '0'
ABI: 'arm'
pid: 6695, tid: 6722, name: IncomingMessage  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xb766dfbf
    r0 b766dfc0  r1 b766dfbf  r2 ffcbf0af  r3 0000001c
    r4 a027ad10  r5 00000114  r6 00000000  r7 00015b48
    r8 00015b48  r9 00000001  sl 00000001  fp 00000000
    ip e0000000  sp a027a998  lr b3c03539  pc b6cb24ac  cpsr a00d0010

Stack Trace:
  RELADDR   FUNCTION                                                                                                                                                                       FILE:LINE
  000184ac  memmove+444                                                                                                                                                                    bionic/libc/arch-arm/denver/bionic/memmove.S:210
  000e1535  realm::ArrayBlob::replace(unsigned int, unsigned int, char const*, unsigned int, bool)+640                                                                                     unwind-c.c:?
  000e2745  realm::ArrayStringLong::bptree_leaf_insert(unsigned int, realm::StringData, realm::TreeInsertBase&)+68                                                                         unwind-c.c:?
  000de4e5  realm::StringColumn::leaf_insert(realm::MemRef, realm::ArrayParent&, unsigned int, realm::Allocator&, unsigned int, realm::BpTreeNode::TreeInsert<realm::StringColumn>&)+1052  unwind-c.c:?
  000de897  unsigned int realm::BpTreeNode::bptree_append<realm::StringColumn>(realm::BpTreeNode::TreeInsert<realm::StringColumn>&)+94                                                     unwind-c.c:?
  000dea9d  realm::StringColumn::bptree_insert(unsigned int, realm::StringData, unsigned int)+240                                                                                          unwind-c.c:?
  000cb343  realm::StringColumn::insert_rows(unsigned int, unsigned int, unsigned int, bool)+54                                                                                            unwind-c.c:?
  000cc28d  realm::Table::insert_empty_row(unsigned int, unsigned int)+64                                                                                                                  unwind-c.c:?
  000269a9  Java_io_realm_internal_Table_nativeAddEmptyRow+164                                                                                                                             unwind-c.c:?
  010012c7  offset 0xd10000  

-----------------------------------------------------

signal 11 (SIGSEGV), code 1, fault addr 0x20 in tid 28460 (m.messenger.app)
Revision: '0'
ABI: 'arm'
pid: 28460, tid: 28460, name: m.messenger.app  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x20
    r0 b79431f0  r1 00000008  r2 b3818008  r3 00000000
    r4 b79431f0  r5 b77efef8  r6 00000008  r7 b79431f0
    r8 00000020  r9 b79431f0  sl 00000008  fp 00000000
    ip b6ca45dc  sp bedcfa40  lr b3b9bca3  pc b3b9a0fa  cpsr 40070030

Stack Trace:
  RELADDR   FUNCTION                                                                                     FILE:LINE
  000cc0fa  realm::Table::get_column_base(unsigned int)+18                                               unwind-c.c:?
  000cdc9f  realm::Table::connect_opposite_link_columns(unsigned int, realm::Table&, unsigned int)+10    unwind-c.c:?
  000da4dd  realm::Table::refresh_column_accessors(unsigned int)+912                                     unwind-c.c:?
  000a7f4b  realm::Group::do_get_table(unsigned int, bool (*)(realm::Spec const&))+698                   unwind-c.c:?
  0005bb9b  realm::ObjectSchema::ObjectSchema(realm::Group const&, realm::StringData, unsigned int)+810  unwind-c.c:?
  00061641  realm::ObjectStore::schema_from_group(realm::Group const&)+160                               unwind-c.c:?
  0006a445  realm::Realm::init(std::shared_ptr<realm::_impl::RealmCoordinator>)+268                      unwind-c.c:?
  0006d9fb  realm::_impl::RealmCoordinator::get_realm(realm::Realm::Config)+542                          unwind-c.c:?
  00068eb3  realm::Realm::get_shared_realm(realm::Realm::Config)+46                                      unwind-c.c:?
  00024e95  Java_io_realm_internal_SharedRealm_nativeGetSharedRealm+224                                  unwind-c.c:?
  00fff521  offset 0xd10000                                                                              /data/app/com.messenger.app-1/oat/arm/base.odex

-----------------------------------------------------

signal 11 (SIGSEGV), code 1, fault addr 0x41 in tid 6212 (m.messenger.app)
Revision: '0'
ABI: 'arm'
pid: 6212, tid: 6212, name: m.messenger.app  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41
    r0 b8b885a4  r1 00000001  r2 00000001  r3 00000000
    r4 be9b6410  r5 b8b88598  r6 b3beeef7  r7 00000002
    r8 00000001  r9 b88439a0  sl 00000000  fp 9ee2b9d0
    ip b3bc76dd  sp be9b63d0  lr b3c44863  pc b3b8f9d0  cpsr 200e0030

Stack Trace:
  RELADDR   FUNCTION                                                                                     FILE:LINE
  0001f9d0  realm::BpTree<long long>::get(unsigned int) const+8                                          unwind-c.c:?
  000d485f  realm::StringData realm::Table::get<realm::StringData>(unsigned int, unsigned int) const+84  unwind-c.c:?
  000d4897  realm::Table::get_string(unsigned int, unsigned int) const+4                                 unwind-c.c:?
  00057701  Java_io_realm_internal_UncheckedRow_nativeGetString+36                                       unwind-c.c:?
  00ff98bb  offset 0xd10000                                                                              /data/app/com.messenger.app-2/oat/arm/base.odex

-----------------------------------------------------

signal 11 (SIGSEGV), code 1, fault addr 0xfffd7dfc in tid 11074 (RxComputationSc)
Revision: '0'
ABI: 'arm'
pid: 11041, tid: 11074, name: RxComputationSc  >>> com.messenger.app <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xfffd7dfc
    r0 b842d560  r1 9f3950d8  r2 00000001  r3 00000030
    r4 fffd7df8  r5 b842d560  r6 00024000  r7 b814c4e8
    r8 9f3950d8  r9 b83dca8c  sl 1303d8b0  fp 00000001
    ip 0000000c  sp 9f3950c0  lr b3bfb147  pc b3bf9e4c  cpsr 000f0030

Stack Trace:
  RELADDR   FUNCTION                                                                          FILE:LINE
  00086e4c  realm::Array::init_from_mem(realm::MemRef)+14                                     unwind-c.c:?
  00088143  realm::Array::update_from_parent(unsigned int)+62                                 unwind-c.c:?
  000d48eb  realm::Table::update_from_parent(unsigned int)+76                                 unwind-c.c:?
  000a6bcb  realm::SharedGroup::commit_and_continue_as_read()+218                             unwind-c.c:?
  00074af5  realm::_impl::transaction::commit(realm::SharedGroup&, realm::BindingContext*)+8  unwind-c.c:?
  000685cd  realm::Realm::commit_transaction()+32                                             unwind-c.c:?
  0002302b  Java_io_realm_internal_SharedRealm_nativeCommitTransaction+50                     unwind-c.c:?
  00fd6555  offset 0xd10000                                                                   /data/app/com.messenger.app-1/oat/arm/base.odex

-----------------------------------------------------

Steps & Code to Reproduce

Encryption: On. Minimum 1 contact, 1 message, with body.length ~= 1000.
At least 1 write thread, and 1 other thread. Global instance N=2 will eventually crash, N=3+ will crash faster.

This code will often produce UTF-16 crash for us as well. We can provide the project if necessary, but this is the relevant code.

public class Contact extends RealmObject {
	@PrimaryKey
	private String id;

	@Index
	@Required
	private String username;

	private String number;

	@Nullable
	private RealmList<Message> mMessageList;
        ...
}
public class Message extends RealmObject {
	@PrimaryKey
	private String id;

	@Index
	private String contactId;

	@Required
	private String type;

	private String body;

	@Required
	private Date dateCreated;
}
Thread mWriteThread = new Thread(new Runnable() {
	@Override
	public void run() {
		final AtomicInteger i = new AtomicInteger(0);
		while(mRunThreads) {
			try (Realm realm = Realm.getInstance(REALM_CONFIG)) {
				realm.executeTransaction(new Realm.Transaction() {
					@Override
					public void execute(Realm realm) {
						Contact contact = realm.where(Contact.class).equalTo("id", CONTACT_ID).findFirst();
						contact.setUsername("User" + i.incrementAndGet());
					}
				});
			}
		}
	}
});
Thread mReadThread = new Thread(new Runnable() {
	@Override
	public void run() {
		while(mRunThreads) {
			try (Realm realm = Realm.getInstance(REALM_CONFIG)) {
				try (Realm realm = Realm.getInstance(REALM_CONFIG)) {
					realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username");
					realm.copyFromRealm(realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username"),0);
				}
			}
		}
	}
});	

Mitigations

Reducing global realm instances to N<=2 and reducing as much load on Realm as possible decreases the chances of crash and corruption.

Version of Realm and tooling

Realm version(s): 2.2.1, 2.2.2, 2.3.0

Realm sync feature enabled: no

Encryption: yes

Android Studio version: 2.2.2

Which Android version and device: Android 6.0.1 (CM13) / CAF 6.0.1, OnePlus One / OnePlus X

Next steps

This appears to be similar to issue reported here: realm/realm-core#2383

How can we further debug and fix these issues?

Do you have any way to detect and/or recover from corruption? ie open realm in RO and verify integrity.

@cmelchior
Copy link
Contributor

Hi @bios-seiji Thank you for a very detailed bug report 🎉 . If you can send the project to help@realm.io we would be extremely grateful.

@Zhuinden
Copy link
Contributor

Can you try to change the following to see if it helps stabilizing the issue?

        Thread mWriteThread = new Thread(new Runnable() {
            @Override
            public void run() {
                final AtomicInteger i = new AtomicInteger(0);
                while(mRunThreads) {
                    try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                        realm.executeTransaction(new Realm.Transaction() {
                            @Override
                            public void execute(Realm realm) {
                                Contact contact = realm.where(Contact.class).equalTo("id", CONTACT_ID).findFirst();
                                contact.setUsername("User" + i.incrementAndGet());
                            }
                        });
                    }
                }
            }
        });
        Thread mReadThread = new Thread(new Runnable() {
            @Override
            public void run() {
                while(mRunThreads) {
                    try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                        while(mRunThreads) {
                            try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                                realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username");
                                realm.copyFromRealm(realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username"), 0);
                            }
                        }
                    }
                }
            }
        });

to

        Thread mWriteThread = new Thread(new Runnable() {
            @Override
            public void run() {
                final AtomicInteger i = new AtomicInteger(0);
                try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                    while(mRunThreads) {
                        try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                            realm.executeTransaction(new Realm.Transaction() {
                                @Override
                                public void execute(Realm realm) {
                                    Contact contact = realm.where(Contact.class).equalTo("id", CONTACT_ID).findFirst();
                                    contact.setUsername("User" + i.incrementAndGet());
                                }
                            });
                        }
                    }
                }
            }
        });
        Thread mReadThread = new Thread(new Runnable() {
            @Override
            public void run() {
                try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                    while(mRunThreads) {
                        try(Realm realm = Realm.getInstance(REALM_CONFIG)) {
                            realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username");
                            realm.copyFromRealm(realm.where(Contact.class).equalTo("id", CONTACT_ID).findAllSorted("username"), 0);
                        }
                    }
                }
            }
        });

@bios-seiji
Copy link
Author

@Zhuinden apologies, that was a typo in my C&P (I corrected the original post). I have already been conducting tests with the change you suggest.

@bios-seiji
Copy link
Author

Here's the code we've been using to test load (and generate crashes):
https://github.com/bios-seiji/realm-crash/

@kneth
Copy link
Contributor

kneth commented Feb 1, 2017

@bios-seiji Thanks!

@bios-seiji
Copy link
Author

@kneth were you able to reproduce the crashes with the project I provided? Do you have any suggestions for further debugging or mitigation?

Once a user has a corrupted database, is there any way we can detect and triage? Currently, the application will try to run and crash unexpectedly and uncontrollably. At the least we should be able to detect corruption on boot and provide steps for the user to get re-setup.

Thanks

@kneth
Copy link
Contributor

kneth commented Feb 3, 2017

@bios-seiji Sorry for not getting back earlier. The code described above is similar to the code found in #4114. Which isn't surprising as #4114 was written with realm/realm-core#2383 in mind. But the reason for the crash in #4114 might come from the fact that some threads are starved (22 threads and only single core in my emulator) so a thread might hold an old version of database and the device runs out of physical memory (bus error indicates that).

Do you see the crash if you disable encryption?

By the way, the test app - does it require API 23 to fail? How many iterations do it typically take to crash?

@ironage You might wish to take a look at the test app while debugging realm/realm-core#2383.

@bios-seiji
Copy link
Author

@kneth I can reliably reproduce these crashes on 4 and 8 core devices running as few as 3 or 4 threads (see example cases https://github.com/bios-seiji/realm-crash/blob/master/app/src/main/java/io/binarysolutions/realmtest/MainActivity.java#L29). The devices report ample physical memory available (dumpsys meminfo), and running the app with largeHeap enabled does not seem to help. I also dumped the heap an looked at it in MAP and did not see any obvious leaks. What are you using to monitor when "device runs out of physical memory"? This was my initial thought also, but I have not seen it.

In our production app, I disabled encryption and was able to reproduce the issue with much more effort. I have not done extensive testing however, since encryption is necessary for us.

I recompiled to Android 5.1 (API 22) and was able to reproduce the issues. I did not get immediate crash described in my test cases, but it crashed between 1-10k write iterations reliably.

@kneth
Copy link
Contributor

kneth commented Feb 6, 2017

@bios-seiji The investigations have so far lead to realm/realm-core#2426. We need to test more.

@bios-seiji
Copy link
Author

@kneth thats great. Do you have a testing build that I can run with our production tests to see if I can still trigger corruption?

@kneth
Copy link
Contributor

kneth commented Feb 7, 2017

@bios-seiji Currently I am getting ready to test with your test app (no reason to involve you before I have done that).

@kneth
Copy link
Contributor

kneth commented Feb 7, 2017

I am currently running your app in a x86 emulator. So far, thread 1 is at 34k, thread 2 is at 13k and duration at 1102k.

@bios-seiji If you wish to try my custom build, please send an email to help@realm.io.

@kneth
Copy link
Contributor

kneth commented Feb 23, 2017

@bios-seiji Your detailed bug report made it possible for us to reproduce a bug in Realm Core. It has now been fixed (see realm/realm-core#2465), and we will release a fixed very soon. I am closing the issue, and we are very thankful for your original report.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants