Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decryption failed - page zero has wrong checksum #5810

Closed
BlueCobold opened this issue Apr 22, 2022 · 43 comments · Fixed by #5993
Closed

Decryption failed - page zero has wrong checksum #5810

BlueCobold opened this issue Apr 22, 2022 · 43 comments · Fixed by #5993

Comments

@BlueCobold
Copy link

BlueCobold commented Apr 22, 2022

How frequently does the bug occur?

Seen once

Description

A customer of my app reported suddenly being unable to launch my app. It terminates on first access of the database and it turns out that it is broken for some reason.

It might have broken during a realm migration, but this is uncertain. Newly created files work just fine. I might possibly be allowed to share the db file to a developer for analysis in private, but not in public. I tried to open it with Realm Studio as well and also tried upgrading to Realm 10.25.1, but the file still cannot be decrypted.

Stacktrace & log output

libc++abi: terminating with uncaught exception of type realm::util::DecryptionFailed: Decryption failed
Exception backtrace:
0   Realm          0x000000010b0d349b _ZN5realm4util16DecryptionFailedC2Ev + 107
1   Realm          0x000000010b0b9987 _ZN5realm4util10AESCryptor4readEixPcm + 519
2   Realm          0x000000010b0ba63e _ZN5realm4util20EncryptedFileMapping12refresh_pageEm + 110
3   Realm          0x000000010b0bafee _ZN5realm4util20EncryptedFileMapping12read_barrierEPKvmPFmPKcE + 126
4   Realm          0x000000010ab8e250 _ZN5realm4util26do_encryption_read_barrierEPKvmPFmPKcEPNS0_20EncryptedFileMappingE + 64
5   Realm          0x000000010b0a1822 _ZN5realm11StringIndexC2EmPNS_11ArrayParentEmRKNS_13ClusterColumnERNS_9AllocatorE + 338
6   Realm          0x000000010b08a6b0 _ZN5realm5Table23refresh_index_accessorsEv + 608
7   Realm          0x000000010af533c7 _ZN5realm5Group21create_table_accessorEm + 871
8   Realm          0x000000010af53006 _ZN5realm5Group12do_get_tableEm + 102
9   Realm          0x000000010b1e6287 _ZN5realm12ObjectSchemaC2ERKNS_5GroupENS_10StringDataENS_8TableKeyE + 391
10  Realm          0x000000010b1f0194 _ZN5realm11ObjectStore17schema_from_groupERKNS_5GroupE + 132
11  Realm          0x000000010b2594bb _ZN5realm5Realm32read_schema_from_group_if_neededEv + 187
12  Realm          0x000000010b259268 _ZN5realm5RealmC2ENS0_6ConfigENS_4util8OptionalINS_9VersionIDEEENSt3__110shared_ptrINS_5_impl16RealmCoordinatorEEENS0_13MakeSharedTagE + 456
13  Realm          0x000000010b1b7c2c _ZN5realm5Realm17make_shared_realmENS0_6ConfigENS_4util8OptionalINS_9VersionIDEEENSt3__110shared_ptrINS_5_impl16RealmCoordinatorEEE + 220
14  Realm          0x000000010b1b6294 _ZN5realm5_impl16RealmCoordinator12do_get_realmENS_5Realm6ConfigERNSt3__110shared_ptrIS2_EENS_4util8OptionalINS_9VersionIDEEERNS8_17CheckedUniqueLockE + 532
15  Realm          0x000000010b1b5eaf _ZN5realm5_impl16RealmCoordinator9get_realmENS_5Realm6ConfigENS_4util8OptionalINS_9VersionIDEEE + 495
16  Realm          0x000000010b259ce7 _ZN5realm5Realm16get_shared_realmENS0_6ConfigE + 135
17  Realm          0x000000010ae4d71a +[RLMRealm realmWithConfiguration:queue:error:] + 2314
18  RealmSwift     0x00000001085c3a72 $sSo8RLMRealmC13configuration5queueABSo0A13ConfigurationC_So012OS_dispatch_C0CSgtKcfCTO + 146
19  RealmSwift     0x000000010863fc2f $s10RealmSwift0A0V5queueACSo012OS_dispatch_C0CSg_tKcfC + 127

Can you reproduce the bug?

Yes, always

Reproduction Steps

The database file seems corrupted and cannot even be opened with Realm Studio. I cannot publicly share the file due to the user's privacy, but I might be able to send to a dev in private.

Version

10.10.0 (also tried 10.25.1)

What SDK flavour are you using?

Local Database only

Are you using encryption?

Yes, using encryption

Platform OS and version(s)

iOS 15.4.0, 15.4.1, 15.2.0, 15.2.1

Build environment

ProductName: macOS
ProductVersion: 12.0.1
BuildVersion: 21A559

/Applications/Xcode.app/Contents/Developer
Xcode 13.3.1
Build version 13E500a

/usr/local/bin/pod
1.10.0
Realm (10.10.0)
RealmSwift (10.10.0)
RealmSwift (= 10.10.0)

/bin/bash
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin21)

(not in use here)

/usr/local/bin/git
git version 2.26.0

@leemaguire
Copy link
Contributor

Hi @BlueCobold Can you send the Realm file to realm-help@mongodb.com so we can investigate? The latest version of Realm (10.25.1) contains a fix that should not let this happen again in the future.

@sync-by-unito sync-by-unito bot added the Waiting-For-Reporter Waiting for more information from the reporter before we can proceed label Apr 22, 2022
@BlueCobold
Copy link
Author

I submitted the file in question.

@github-actions github-actions bot added Needs-Attention Reporter has responded. Review comment. and removed Waiting-For-Reporter Waiting for more information from the reporter before we can proceed labels Apr 22, 2022
@leemaguire
Copy link
Contributor

@jedelbo successfully recovered the Realm file, @BlueCobold I have sent it to you via email.

@BlueCobold
Copy link
Author

Super awesome! The customer will be very happy and so am I. I'll upgrade all app versions out there to realm 10.25.1 and hope for the issue to never return. Thanks!

@sync-by-unito sync-by-unito bot removed the Needs-Attention Reporter has responded. Review comment. label Apr 22, 2022
@jedelbo
Copy link
Contributor

jedelbo commented Apr 25, 2022

Forensic report:
When trying to decrypt the received file, the following showed up:

Checksum failed: 0x90000 0x90000 expected: 0x93 actual: 0x92
Checksum failed: 0x91000
Checksum failed: 0x92000
Checksum failed: 0x93000

Checksum failed: 0xa0000
Checksum failed: 0xa1000
Checksum failed: 0xa2000
Checksum failed: 0xa3000

Checksum failed: 0xa8000

Checksum failed: 0x138000
Checksum failed: 0x139000 0x13900 expected: 0xc0 actual: 0xc3
Checksum failed: 0x13a000
Checksum failed: 0x13b000

Restore old IV: 0x18c000
Restore old IV: 0x18d000
Restore old IV: 0x18e000
Restore old IV: 0x18f000

Restore old IV: 0x198000
Restore old IV: 0x199000
Restore old IV: 0x19a000
Restore old IV: 0x19b000

Restore old IV: 0x1a0000
Restore old IV: 0x1a1000
Restore old IV: 0x1a2000
Restore old IV: 0x1a3000

Restore old IV: 0x1a8000
Restore old IV: 0x1a9000
Restore old IV: 0x1aa000
Restore old IV: 0x1ab000

Restore old IV: 0x1ac000
Restore old IV: 0x1ad000
Restore old IV: 0x1ae000
Restore old IV: 0x1af000

In spite there were checksum errors the content seemed to be consistent except for the 2 cases where a byte value was not as expected. After changing the values back, the file was consistent.

@jedelbo
Copy link
Contributor

jedelbo commented Apr 25, 2022

@tgoyne the fact that it is the first byte in a 4k block that is modified, does it make us any wiser? An why does the checksum differ if the content apparently is ok?

@tgoyne
Copy link
Member

tgoyne commented Apr 25, 2022

Could possibly be an out-of-bounds write somewhere? The first byte in a buffer is the thing that'll be overwritten if some other piece of code has an off-by-one error when writing to something that happens to land immediately before that buffer in memory. The hmac and actual page data are stored in separate blocks of memory so corrupting one but not the other wouldn't be hard to have happen.

If that is actually the problem I'm not sure what action we can really take. Reread all the encryption code and hope to spot something suspicious that could be writing one past the end? I think the use of MAP_ANONYMOUS for the decrypted buffers unfortunately means that asan doesn't work for them, and it might not even be a bug in our code.

@BlueCobold BlueCobold reopened this May 13, 2022
@BlueCobold
Copy link
Author

BlueCobold commented May 13, 2022

The issue has returned. Again, I have a customer with a database that cannot be decrypted. Since this is on Android, I don't have a proper native stack trace and can only assume it is related to the same incorrect checksum in the native code both systems are based on. I can provide the realm-file, so you can check if it's the same problem. The customer's app version is using the latest Android-Realm implementation, which uses the same native code as Realm-Swift 10.25.1, from what I understand. No migration was involved when the realm file got corrupted.

@jedelbo
Copy link
Contributor

jedelbo commented May 19, 2022

@BlueCobold it would be nice if we would have the possibility to check the realm file to see if the corruption is similar to the first one.

@BlueCobold
Copy link
Author

The customer stopped replying and stopped using my app. So I'm afraid, I cannot provide the file.

@BlueCobold
Copy link
Author

@jedelbo I submitted another customer's realm file with the same symptoms to realm-help@mongodb.com for analysis.

@BlueCobold BlueCobold reopened this Jul 18, 2022
@BlueCobold
Copy link
Author

BlueCobold commented Jul 23, 2022

Using the decrypt-tool in the exec directory, I'm getting the following output:
Checksum failed: 0x0
Block never written: 0x55e000
Block never written: 0x55f000

So looks like the first block has issues. The resulting output file is unusable. I have no idea how to get the "actual" and "expected" values that @jedelbo printed in his report, or how to correct possibly faulty bytes to see if the remaining file would be operational.
My customer is massively dependent on his data and currently can't access it.

The ticket-bot also seems not to flag this bug-report any longer accordingly. @leemaguire

@BlueCobold
Copy link
Author

In the meantime, I checked the decrypted content with a hex editor. Even the damaged first block contains readable strings and thus seems to be decrypted correctly. I imagine there's some header meta-data which is damaged and which makes the RealmBrowser/library believe the file to be still encrypted / unreadable. All other blocks after the first seem to be valid. There are a lot of blocks with readable strings and UUID-tables. From what I assume, the file can be recovered, but I still do not have gathered enough understanding of the internal data structure to make that happen by myself.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 8, 2022

I have restored the header with a reference to the top_ref and table_names_ref, but it seems the data is partly scrambled. Some objects have invalid strings which crash Realm when trying to load these objects. Some have fields set to null, which cannot be null (like object-UUIDs for example), but seem to be ok, if I only read this column/field in sequence for the entire table.
I wonder, can this potentially be a result of a parallel realm-access which did an automatic compactionOnLaunch?

@BlueCobold
Copy link
Author

In further deeper data analysis, I realised some realm-object-keys to be huge. Like '3,402,167,040,181,607,100'. How come they grew so large? Is it possible there's an issue with keys and they spill over at some point or something? Still guessing what could be the reason for badly written pages and wrongly aligned arrays.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 9, 2022

@BlueCobold I have been away on holiday, and did not see this until now. I can see that you have sent another file for analysis, but I am not sure which key to use for decrypting.

@BlueCobold
Copy link
Author

I thought so. I have replied via email to send you the decryption-key. Did you receive it?

@jedelbo
Copy link
Contributor

jedelbo commented Aug 9, 2022

To which email address should the key have been sent to? I have not received anything.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 9, 2022

To which email address should the key have been sent to? I have not received anything.

Sorry, I thought there was a forwarded-reply feature on github-mails. Doesn't look like. I had sent the file and key to realm-help with my mail from 18.07., but I can send you another, including some findings so far - including the partly restored file-header.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 9, 2022

Great. To be sure that I receive it, you can also send it to jorgen.edelbo@mongodb.com

@BlueCobold
Copy link
Author

The duplicated data starts originally at 0x98EC0, a valid array. And then is "duplicated" into the header, making the file unusable.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 10, 2022

Those are great findings. I am a bit embarrassed that I did not spot the zeroes. I hope it can help us further with this issue. It is very common to have duplicated data. Whenever some part of an array is modified, a new version of the array is created by copying the whole array. I will try to see if I can find the "true" top ref.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 10, 2022

It is very common to have duplicated data. Whenever some part of an array is modified, a new version of the array is created by copying the whole array.

Yea, I figured that much. It makes sense from a transaction perspective.

I will try to see if I can find the "true" top ref.

That would be great.

Also, if you don't mind, I pointed out the very large object-keys for many objects above. (a few objects have two-digit-keys which seem to be auto-increment style, so the big ones make me wonder what's going on) Is it normal for objects to have such large keys or does that indicate a problematic way of using Realm? Can keys accidentally overflow or does Realm auto-detect free keys during object creation when the max value is reached?

@BlueCobold
Copy link
Author

I found the following cluster-tree, related to table realm/realm-swift#10 at offset 0x1192A0:
41414141 4700000C 40870800 00000000 00000000 00000000 D80C0000 00000000 15000000 00000000 A8481700 00000000 03000000 00000000 A1520000 00000000 38650200 00000000 90600200 00000000 6950CC0E 1FBFCF7A 00000000 00000000 01000408 05000000

It contains a lot of very suspicious refs like 03000000, 05000000 or 15000000
These refs would mean they are within the header-bytes for the Realm-file when they get written to! This makes me worry a lot about data consistency.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 10, 2022

What you have found here is the table top array. It contains both refs and numbers. If the entry has the LSB set (like 0x15) it is a number. You get the value by shifting down one bit so in this case it is 10, which matches table number 10.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 10, 2022

I am somewhat convinced that the first 24 bytes of the file should be

00000000  80 6c 51 00 00 00 00 00  f0 53 51 00 00 00 00 00  |.lQ......SQ.....|
00000010  54 2d 44 42 16 16 00 00                           |T-DB....|

making the top ref 0x516c80

@jedelbo
Copy link
Contributor

jedelbo commented Aug 10, 2022

I am pretty sure that the problem is that the first 0x1000 bytes have been overwritten with a page that should have been written somewhere else. Unfortunately a lot of refs points into this area, so recreating meaningful data in this area would be some major puzzle.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 10, 2022

I am currently trying so solve this puzzle already by skipping invalid data. Table realm/realm-swift#10 seems to be majorly affected by it, but I could probably skip it. I "fixed" some other table entries already by detecting invalid string-offsets and nulling invalid references into the first 0x1000 bytes. It still means losing a lot of data that cannot be restored. My major concern is now to prevent this from happening again in the future by all means, because it affects not just one customer by now - I only have access to his file though, because the others didn't report to me, I just received their crash reports and bad customer feedback in the app/playstore. I don't know if I could accidentally have caused this myself, but from a developer perspective, using the API should never result in corrupt file like this.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 11, 2022

Table realm/realm-swift#10 is set as:
41414141 4700000C 40870800 00000000 00000000 00000000 D80C0000
However, the 0xCD8 is broken due to the overwritten first page.
But 0x088740 is also invalid, I think. It points to some structure, but it will never find a valid table-name-ref, which is located at 0xAC790.

@jedelbo
Copy link
Contributor

jedelbo commented Aug 12, 2022

0x088740 seems to be ok. 0xAC790 (the column names) are linked from index 1. It will be hard to guess how the cluster that should be at 0xcd8 should look like.

@BlueCobold
Copy link
Author

Oh, my bad.

I think found the array which contains the object-keys for table realm/realm-swift#10 at offset 0x11D8:
41414141 07000008 D87850E2 0EBD5900 4B494198 246F8312 AC98E903 046B281B 6B8647F6 DEA6CA21 9AE058AC B3B86125 F496A0FD B0240B32 275D4BC8 12EA8532 33286687 8FDF673D
And I think this could be the array containing the uuid column values for table realm/realm-swift#10 at offset 0x787e8
41414141 11000128 65613733 32653461 2D666466 392D3435 63342D39 6563612D 64393735 62633064 36353063 00663062 64323564 612D6532 30652D34 3332312D 39336531 2D333864 31353439 39363533 30006662 37303331 32632D32 3863332D 34623663 2D616466 312D3461 66663963 35643763 32360034 38376434 3963372D 33393463 2D346564 612D3933 64632D65 33393135 34313537 34346300 61323334 39636635 2D346566 372D3465 33352D39 3363622D 66383133 31633533 38386365 00323233 30343431 312D6435 30342D34 6532642D 39663133 2D646139 66646439 32373534 63006135 33653961 36612D62 3636322D 34343233 2D626536 612D3164 63636234 36396265 61640063 33333261 6462662D 31343236 2D343365 622D6264 61322D38 36336461 32623735 33373300
Looks like all other data I can extract 2 JSON at this point (except for a few minor strings and entries I skipped). Except table realm/realm-swift#10. It only contains 8 entries of which only the "color" column would be important to restore.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 12, 2022

I believe I also found the array which contains the "color" column of table realm/realm-swift#10 at offset 0x107BB8
41414141 06000008 1B6392FF 253E3AFF 528498FF AADDEEFF 03AEECFF 415052FF 27332DFF 9D6300FF

Edit: Nope, I think this one is related to table realm/realm-swift#14, sadly. So maybe 'color' column for table realm/realm-swift#10 is lost, cause it should start with 41414141 06000008 and the highest byte of each integer should be FF or 00 and this is the only array which adheres to this. But since it is referenced in a 14-column array of which one points to strings I recognise only from table realm/realm-swift#14, I assume it's table realm/realm-swift#14 instead. Much sad.

@BlueCobold
Copy link
Author

BlueCobold commented Aug 13, 2022

@jedelbo Another customer sent me a realm file which causes this when trying to write or delete a specific value to/from it and I worry it may be related:
`Build fingerprint: 'google/sdk_gphone_x86/generic_x86:9/PSR1.180720.012/4923214:user/release-keys'
Revision: '0'
ABI: 'x86'
pid: 24949, tid: 25253, name: DefaultDispatch >>> de.game_coding.trackmytime <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4

Cause: null pointer dereference
eax 00000000 ebx cd561880 ecx 00000000 edx cd5618a0
edi c73d0468 esi 00000000
ebp cb681f28 esp cb681ef0 eip caeea57c

backtrace:
#00 pc 005db57c /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#1 pc 005db905 /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#2 pc 005dc124 /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#3 pc 0065b985 /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#4 pc 006b004c /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#5 pc 005d399c /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#6 pc 005d2af3 /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#7 pc 006c6b6d /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so
realm/realm-swift#8 pc 003db034 /data/app/de.game_coding.trackmytime-1v0SjzkI5T6jngcrjrqMqQ==/lib/x86/librealm-jni.so (Java_io_realm_internal_Table_nativeSetLong+372)
realm/realm-swift#9 pc 0006e100 /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.internal.Table.nativeSetLong+224)
realm/realm-swift#10 pc 00061470 /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.de_game_coding_trackmytime_storage_inventory_ProductDbRealmProxy.insertOrUpdate+2512)
realm/realm-swift#11 pc 00066507 /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.de_game_coding_trackmytime_storage_inventory_ProductCategoryDbRealmProxy.insertOrUpdate+2103)
realm/realm-swift#12 pc 0006ed20 /dev/ashmem/dalvik-jit-code-cache (deleted) (io.realm.DefaultRealmModuleMediator.insertOrUpdate+2560)
`

@jedelbo
Copy link
Contributor

jedelbo commented Aug 15, 2022

@BlueCobold It might be related, but the stack trace does not make us any wiser.

@BlueCobold
Copy link
Author

@jedelbo Thought so, but I thought I provide what I can. Do you want that file for analysis? (It doesn't need recovery, I made a re-import of its data into a fresh one, but I can offer it to you if it may help to identify bugs.)

@jedelbo
Copy link
Contributor

jedelbo commented Aug 18, 2022

@BlueCobold All files are welcome. Maybe it contains that piece of information that can help us further.

@jedelbo jedelbo transferred this issue from realm/realm-swift Sep 1, 2022
@jedelbo
Copy link
Contributor

jedelbo commented Sep 1, 2022

There seems to be two kinds of problems related to this issue. One is that some refs are not updated correctly. This is probably happening above the encryption layer. Another problem is that an encrypted page is written in wrong location resulting in that the first page in the decrypted file contains data that should have been somewhere else.

@BlueCobold
Copy link
Author

Sounds like some serious issue with multithreading then and/or with internal reference/pointer handling in realm_core. Doesn't it?

@sync-by-unito sync-by-unito bot changed the title terminating with uncaught exception of type realm::util::DecryptionFailed: Decryption failed Decryption failed - page zero has wrong checksum Sep 9, 2022
@sync-by-unito sync-by-unito bot assigned finnschiermer and unassigned jedelbo Oct 19, 2022
@BlueCobold
Copy link
Author

@nicola-cab
The release notes in realm-core say the PR fixes an issue which exists since v11.8.0. However, this bug-report was already made in v10.10.0. So either the commit doesn't fix it or the release-notes are incorrect.

@BlueCobold
Copy link
Author

@nicola-cab @jedelbo
Just to update this a bit: Since I single-threaded queued all my Realm-operations, the issue seems to be gone in both the Android and iOS versions of my apps in production (about 10k users in total). Since the issue existed on both platforms, I don't think the iOS-only fix #5993 can have solved this issue. First of all, because it is iOS-only and also because it seems to be related to crashing platforms. This does not go hand in hand with my multi-thread/single-thread observation, because the amount of crashes should be the same and the number of my corruptions should not have been reduced by my change.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants