Skip to content
This repository has been archived by the owner on Jun 20, 2023. It is now read-only.

PRIO - ENS Timeout? Outdated exposure checks: CWA uses the same diagnosis key package several times in ProvideDiagnosisKeys to ENS #2880

Closed
3 tasks done
vaubaehn opened this issue Apr 20, 2021 · 36 comments
Assignees
Labels
bug Something isn't working mirrored-to-jira This item is also tracked internally in JIRA

Comments

@vaubaehn
Copy link
Contributor

vaubaehn commented Apr 20, 2021

Avoid duplicates

  • Bug is not mentioned in the FAQ
  • Bug is specific for Android only, for general issues / questions that apply to iOS and Android please raise them in the documentation repository
  • Bug is not already reported in another issue

Technical details

  • Device name: LG-G4 (LGH-815)
  • Android version: 6.0
  • App version: 1.15.2 (but also other versions had been affected before)
  • ENS version: 18210915000 (but also other versions had been affected before)

Describe the bug

At some days, CWA is providing the same previously downloaded diagnosis key package to ENS, resulting in outdated exposure checks. This already happened to me last year, and is now occuring again.
My assumption is, that a timeout of ENS and an insufficient handling of the error in CWA is the reason, because I'm experiencing the problem only in times with high incidence rates, and at the moment the DK packages contain 500k+ diagnosis keys.
So, probably many people might be affected currently without knowing it (unless they don't check the logs in ENS UI carefully).
As you can see from my attached ENS logs (see below) this happened on 10 days since April 1st.
Also, on April 18th at around 9:00 am, CWA correctly indicated, that exposure matching failed (for more that 36 hours):
Screenshot_2021-04-18-08-57-47 Screenshot_2021-04-18-08-58-15
When tapping on "Erneut Starten", the spinning wheel of downloading was visible for <1s, then the green card appeared, but exposure matching didn't start actually (as you can see from the logs).

Steps to reproduce the issue

  1. Have an old, slow phone, preferrably Android 6, as there is only one single exposure check per day currently. Wait for rising incidence rates.
  2. Let CWA do its job for a couple of days.
  3. Go to ENS UI via CWA > Settings > App-Information > tap on ENF version number > select "Überprüfungen auf mögliche Begegenungen" from the 3-dot-menu up right in the screen [ > enter your screen lock pin/password if you have one]
  4. Check the checksums of the matched diagnosis keys packages. They will be the same often.

Expected behaviour

  1. When exposure matching fails, the user should be informed via a notification or via an error reporting card in main screen (like Incompatibility warning card on main screen (COMMUNITY) #2481 for incompatibility issues and like it was proposed here in a very raw draft: Feature/new error message #864 (comment))
  2. When exposure matching fails, CWA should always download/use the very latest DK package and try again after some time. Is it possible to register a worker for this?

Possible Fix

Re-evaluate and fix error handling and reporting for timeouts.

Additional context

Here are the latest logs of exposure matchings from ENS UI. I tried to separate the "double checks" and marked them with an asterisk for easier finding them. (The logs have been compiled from two exports on different dates!)
20210401-20210518_assembled_all-exposure-checks.txt
(Edit: updated exposure check logs 2021-05-18)


Internal Tracking-ID: EXPOSUREAPP-6654

@vaubaehn
Copy link
Contributor Author

Again no exposure checks since yesterday:
Screenshot_2021-04-21-14-17-11 Screenshot_2021-04-21-14-19-01
Tapping "Erneut Starten" -> download wheel for one second -> green card shows up -> but nothing happened.
2 hours later, a new exposure check with a new DK package was registered in the ENS exposure check log, but I don't know, if this was really successful...

There is an increasing number of reports from users in Google Play Store reviews with similar problems.
Would be good, if someone could have a look into it! Thank you in advance!

@heinezen
Copy link
Member

@vaubaehn

Thanks! We have reported it to the development team as EXPOSUREAPP-6654. Let's see what they can do about this.


Corona-Warn-App Open Source Team

@vaubaehn
Copy link
Contributor Author

Hi @rugk, inviting you to here from #2941 :
Could you do me a favor and look up, whether in your case also CWA uses outdated diagnosis keys packages? See 'steps to reproduce...' in OP above. If you find same checksums at least twice, I'd say it's the same problem.

@vaubaehn
Copy link
Contributor Author

Some additional information:
I did some logcatting from Thursday night to Friday. From the logs it looks like that ENS is actually doing its job right! It takes over the downloaded diagnosis keys from CWA and matches them with the recorded RPI scans. After it's finished, it sends an intent to CWA, that CWA is also receiving successfully, because it then starts a worker for risk calculation.
The formerly 'ENS timeout' hypothesis can most likely be discarded, as the matching of diagnosis keys with RPIs actually takes 8 to 9 seconds (sic!) with a package of 500k+ diagnosis keys and around 1300 stored RPIs (all information available in logcat). Actually, most time is spent by ENS in verifying the signature of the downloaded diagnosis keys package (up to one minute). The fast matching seems to be possible by using a two-step approach: first the diagnosis keys are quickly scanned if they could match with RPIs in a process, that is called 'native pre-filter' in logcat. After pre filtering, it looks like a second process is matching pre filtered diagnosis keys with RPIs (JNI matcher).

So it looks like, there is something going wrong with CWA itself. Obviously after ENS notified CWA of having finished, in the following flow the successful completion of diagnosis key matching or the result of risk calculation is not persisted correctly, so that CWA tries again with the previously downloaded diagnosis key package (instead of downloading and providing a new one) at a later time.
Interestingly, this obviously only becomes a problem, when the size of the diagnosis keys package exceeds a certain value (around 450-470k in my case).
So there may be a timing problem or a race condition with the tasks/workers after ProvideDiagnosisKeys.

@vaubaehn
Copy link
Contributor Author

And some more additional Information: after updating from CWA 1.15.2 to CWA 2.0.3 and an accidently nearly simultanous update of the ENS to v1.8 (211213000), the difference in the timestamps of CWA risk card and the ENS check log was 20 minutes last night! Before it usually was 1-2 minutes. I try to gather new information from logcat, as soon I find time.

@rugk
Copy link

rugk commented Apr 24, 2021

Could you do me a favor and look up, whether in your case also CWA uses outdated diagnosis keys packages? See 'steps to reproduce...' in OP above. If you find same checksums at least twice, I'd say it's the same problem.

As the apps now work correctly, I guess I cannot do that.
But thanks, I’ll keep these steps in mind, when I’ll see this later. 😄

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Apr 24, 2021

@rugk The question is, if the app indeed runs correctly now. For r1.15.2 it was not doing for me since around April 1st, but the failing error card came just 2 or 3 times. The ENS log holds entries of the last 16 days, so it could really help if you looked up the checksums there to verify you have not been affected. Thank you in advance! :)

@vaubaehn
Copy link
Contributor Author

Short update:
New version 2.0 3 with refactored DiagnosisKeyRetrievalWorker didn't change anything for me: 2 exposure checks today, 1st 3:21am, 2nd 9:10pm - both with package that has hash macdc6RUhFnobBP1Zx3WchR6oJY6Q/5ajUu6uXniYms= .

@d4rken
Copy link
Member

d4rken commented Apr 25, 2021

I'm wondering whether this is related to your #2880 ticket (likely fixed by #2934) 🤔.

The UNMETERED check is not working, the hourly package sync is being skipped, there will only be one day package per day....
If we provide the ENS with the previous keys again, do we get a calculation with the same checksum?

Within the DownloadDiagnosisKeysTask we have:

private fun hasRecentDetectionAndNoNewFiles(
now: Instant,
keySyncResult: KeyPackageSyncTool.Result,
trackedDetections: Collection<TrackedExposureDetection>
): Boolean {
// One forced detection every 24h, ignoring the sync results
val lastSuccessfulDetection = trackedDetections.filter { it.isSuccessful }.maxByOrNull { it.startedAt }
val nextForcedDetectionAt = lastSuccessfulDetection?.startedAt?.plus(Duration.standardDays(1))
val hasRecentDetection = nextForcedDetectionAt != null && now.isBefore(nextForcedDetectionAt)
return (hasRecentDetection && keySyncResult.newKeys.isEmpty()).also {
if (it) Timber.tag(TAG).w("Aborting. Last detection is recent (<24h) and no new keyfiles.")
}
}

I would expect keySyncResult.newKeys.isEmpty() to be true, but according to your logs (https://github.com/corona-warn-app/cwa-app-android/files/6343389/20210401-20_assembled_all-exposure-checks.txt), this should also return true, as you have duplicate checks checksums within 24h:

 val nextForcedDetectionAt = lastSuccessfulDetection?.startedAt?.plus(Duration.standardDays(1))  
 val hasRecentDetection = nextForcedDetectionAt != null && now.isBefore(nextForcedDetectionAt) 

Only if the last provideDiagnosisKeys didn't end up with a successful notification (broadcast) from the ENS? So this will perform better after in 2.1 due to #2934, but the timestamps / notifications from the ENS after providing diagnosis keys still seems suspicious...

@rugk
Copy link

rugk commented Apr 25, 2021

Go to ENS UI via CWA > Settings > App-Information > tap on ENF version number > select "Überprüfungen auf mögliche Begegenungen"

Okay, gone there and checked some keys around the error time and the hashes are all different.
I could not compare those with the keys used by CCTG as CCTG does not have the same screen (with their built-in microg version), and I could only export the whole .db there, which would really be too much work.

So my guess is, no it is not related.

And, as said, I can only confirm so far that the issue is gone. I have never seen it again. If I should, I'll notice you.

@vaubaehn
Copy link
Contributor Author

@rugk
Great, thanks for your help!

Okay, gone there and checked some keys around the error time and the hashes are all different.
[...] So my guess is, no it is not related.

I now also think so, that the "Risiko-Überprüfung fehlgeschlagen" card is not related to the outdated diagnosis keys that are provided to ENS, and that's a different issue.
The reason why I thought it's related, is, because I found this card after the diagnosis key package was older than 36 hours and CWA might somehow checked that it's outdated without downloading a new package (I still suppose the message on the card is wrong and not necessarily related to a failed download. Anyway, the text of this card is wrong, as the RPIs ("Zufalls-IDs") are not matched with "the server" but with the diagnosis keys).
Also, today my diagnosis keys are outdated for more than 36 hours, but the card is not shown.

So with your help, we could track down the bug a little bit better.

@vaubaehn
Copy link
Contributor Author

Short update:
The issue is persisting for me. See latest exposure check logs:
20210401-27_assembled_all-exposure-checks.txt
I updated to 2.0.3 on April 23rd, and the problem is still the same.
There was one interesting point:
On April 25th, I pulled CWA to foreground at 2:21am, and downloading DKs and matching worked just fine.
On April 26th, CWA was downloading/checking in background, and it looked like it worked.
But on April 27th (today), the DKs were again checked too early (0:28am instead of after 2:00am), and it was again obviously the same package like on 26th.

@d4rken

I'm wondering whether this is related to your #2880 ticket (likely fixed by #2934) 🤔.

The UNMETERED check is not working, the hourly package sync is being skipped, there will only be one day package per day....

If I'm not wrong, until (including) CWA 1.15.2, there were two workers for downloading and provision of DKs to ENS: the DiagnosisKeyRetrievalOneTimeWorker and the DiagnosisKeyRetrievalPeriodicWorker . Because of #2934 , the DiagnosisKeyRetrievalPeriodicWorker should not have done anything, should it? And the DiagnosisKeyRetrievalOneTimeWorker should have been responsible for falsely providing an outdated DK package to ENS.
Also, it really seems to be related to the package size of DKs, as the problem only occurs with bigger packages. In times with less keys per package, everything worked fine. Or is this some mysterious coincidence?

If we provide the ENS with the previous keys again, do we get a calculation with the same checksum?

I'm quite sure that this is the case.

Within the DownloadDiagnosisKeysTask we have:

Valid from CWA 2.0.3 on

private fun hasRecentDetectionAndNoNewFiles(
now: Instant,
keySyncResult: KeyPackageSyncTool.Result,
trackedDetections: Collection<TrackedExposureDetection>
): Boolean {
// One forced detection every 24h, ignoring the sync results
val lastSuccessfulDetection = trackedDetections.filter { it.isSuccessful }.maxByOrNull { it.startedAt }
val nextForcedDetectionAt = lastSuccessfulDetection?.startedAt?.plus(Duration.standardDays(1))
val hasRecentDetection = nextForcedDetectionAt != null && now.isBefore(nextForcedDetectionAt)
return (hasRecentDetection && keySyncResult.newKeys.isEmpty()).also {
if (it) Timber.tag(TAG).w("Aborting. Last detection is recent (<24h) and no new keyfiles.")
}
}

I would expect keySyncResult.newKeys.isEmpty() to be true, but according to your logs

now: 20210401-27_assembled_all-exposure-checks.txt

this should also return true, as you have duplicate checks checksums within 24h:

 val nextForcedDetectionAt = lastSuccessfulDetection?.startedAt?.plus(Duration.standardDays(1))  
 val hasRecentDetection = nextForcedDetectionAt != null && now.isBefore(nextForcedDetectionAt) 

Only if the last provideDiagnosisKeys didn't end up with a successful notification (broadcast) from the ENS? So this will perform better after in 2.1 due to #2934, but the timestamps / notifications from the ENS after providing diagnosis keys still seems suspicious...

I guess there may be unexpected inconsistencies in these vals.
For understanding the bug, it's probably more easy, if we only take the (old) DiagnosisKeyRetrievalOneTimeWorker in combination with the 'hourly bug' into account:
Usually in this set up, only once per day DKs should be downloaded and checked.
But the DiagnosisKeyRetrievalOneTimeWorker was called several times per day in my case, e.g., at 23'rd of April it was called 3 times (whereas the third time, it was already the new DiagnosisKeyRetrievalWorker). This would mean, that the DiagnosisKeyRetrievalOneTimeWorker didn't run successful, because it set a new nextForcedDetectionAt for the same day, but using still the same DK package.

I wonder, if it could be, that the worker was interrupted/stopped unexpectedly, and the persisted vals used in the worker became inconsistent in that case (and should have been reset probably in such a case).
Are now ALL workers started with submitBlocking ? (Didn't check that) Maybe that can already help?

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Apr 27, 2021

Here's also a logcat from April 23rd, 08:06am - 08:09am (checking exposures again with same keys) with CWA's pid:10658 and uid:10200, related CWA workers, network requests and ENS responses (I tried not to disclose personal data ;). Interestingly, there was obviously another DNS request from CWA that led to nothing, after everything was finished. Wonder, if the workers are a little bit confused.

04-23 08:06:42.054 10658 10658 D WM-SystemJobService: onStartJob for c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.058 10658 10658 D WM-SystemJobService: onStartJob for 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.059 10658 10658 D WM-SystemJobService: Job is already being executed by SystemJobService: c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.063 10658 10658 D WM-SystemJobService: Job is already being executed by SystemJobService: 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.071 10658 10720 D WM-Processor: Processor: processing c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.072 10658 10720 D WM-Processor: Processor: processing 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.113 10658 10658 D WM-WorkerWrapper: Starting work for de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalPeriodicWorker
04-23 08:06:42.133 10658 10658 D WM-WorkerWrapper: Starting work for de.rki.coronawarnapp.deadman.DeadmanNotificationPeriodicWorker
04-23 08:06:42.174 10658 10697 D WM-PackageManagerHelper: androidx.work.impl.background.systemalarm.RescheduleReceiver enabled

=> here our DiagnosisKeyRetrievalOneTimeWorker gets scheduled, AND IT GETS 2 JOBS in one single worker-ID. It will do its work at 04-23 08:07:47.045
04-23 08:06:42.191 10658 10697 D WM-SystemJobScheduler: Scheduling work ID 4e3da071-a1cb-4261-83ab-02077ca6a365 Job ID 49601
04-23 08:06:42.203 10658 10697 D WM-SystemJobScheduler: Scheduling work ID 4e3da071-a1cb-4261-83ab-02077ca6a365 Job ID 49602
04-23 08:06:42.216 10658 10697 D WM-Processor: Processor cancelling ece59a84-d99b-4e5f-a2ca-ef8228ca5003
04-23 08:06:42.216 10658 10697 D WM-Processor: WorkerWrapper could not be found for ece59a84-d99b-4e5f-a2ca-ef8228ca5003
04-23 08:06:42.232 10658 10697 D WM-GreedyScheduler: Cancelling work ID ece59a84-d99b-4e5f-a2ca-ef8228ca5003
04-23 08:06:42.249 10658 10697 D WM-PackageManagerHelper: androidx.work.impl.background.systemalarm.RescheduleReceiver enabled
04-23 08:06:42.258 10658 10697 D WM-SystemJobScheduler: Scheduling work ID b5c72ab9-8fa6-4e2d-94e1-58860de85ab0 Job ID 49603
04-23 08:06:42.264 10658 10697 D WM-SystemJobScheduler: Scheduling work ID b5c72ab9-8fa6-4e2d-94e1-58860de85ab0 Job ID 49604
04-23 08:06:42.273 10658 10697 D WM-WorkerWrapper: de.rki.coronawarnapp.deadman.DeadmanNotificationPeriodicWorker returned a Success {mOutputData=Data {}} result.
04-23 08:06:42.274 10658 10697 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=35076de1-16ab-40bc-a5f6-ef78661a71f2, tags={ de.rki.coronawarnapp.deadman.DeadmanNotificationPeriodicWorker } ]
04-23 08:06:42.276 10658 10658 D WM-Processor: Processor 35076de1-16ab-40bc-a5f6-ef78661a71f2 executed; reschedule = false
04-23 08:06:42.276 10658 10658 D WM-SystemJobService: 35076de1-16ab-40bc-a5f6-ef78661a71f2 executed on JobScheduler
04-23 08:06:42.282 10658 10697 D WM-GreedyScheduler: Cancelling work ID 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.292 10658 10697 D WM-SystemJobScheduler: Scheduling work ID 35076de1-16ab-40bc-a5f6-ef78661a71f2 Job ID 49561
04-23 08:06:42.298 10658 10697 D WM-SystemJobScheduler: Scheduling work ID 35076de1-16ab-40bc-a5f6-ef78661a71f2 Job ID 49605
04-23 08:06:42.305 10658 10697 D WM-WorkerWrapper: de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalPeriodicWorker returned a Success {mOutputData=Data {}} result.
04-23 08:06:42.307 10658 10697 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=c23cb48f-107f-4a45-a85d-1c8b208d3656, tags={ de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalPeriodicWorker, DIAGNOSIS_KEY_PERIODIC_WORKER } ]
04-23 08:06:42.309 10658 10658 D WM-Processor: Processor c23cb48f-107f-4a45-a85d-1c8b208d3656 executed; reschedule = false
04-23 08:06:42.309 10658 10658 D WM-SystemJobService: c23cb48f-107f-4a45-a85d-1c8b208d3656 executed on JobScheduler
04-23 08:06:42.317 10658 10697 D WM-GreedyScheduler: Cancelling work ID c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.326 10658 10697 D WM-SystemJobScheduler: Scheduling work ID c23cb48f-107f-4a45-a85d-1c8b208d3656 Job ID 49599
04-23 08:06:42.333 10658 10697 D WM-SystemJobScheduler: Scheduling work ID c23cb48f-107f-4a45-a85d-1c8b208d3656 Job ID 49606
04-23 08:06:42.342 10658 10697 D WM-Processor: Processor stopping background work ece59a84-d99b-4e5f-a2ca-ef8228ca5003
04-23 08:06:42.342 10658 10697 D WM-Processor: WorkerWrapper could not be found for ece59a84-d99b-4e5f-a2ca-ef8228ca5003
04-23 08:06:42.342 10658 10697 D WM-StopWorkRunnable: StopWorkRunnable for ece59a84-d99b-4e5f-a2ca-ef8228ca5003; Processor.stopWork = false
04-23 08:06:42.343 10658 10697 D WM-Processor: Processor stopping background work 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.343 10658 10697 D WM-Processor: WorkerWrapper could not be found for 35076de1-16ab-40bc-a5f6-ef78661a71f2
04-23 08:06:42.343 10658 10697 D WM-StopWorkRunnable: StopWorkRunnable for 35076de1-16ab-40bc-a5f6-ef78661a71f2; Processor.stopWork = false
04-23 08:06:42.344 10658 10697 D WM-Processor: Processor stopping background work c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.344 10658 10697 D WM-Processor: WorkerWrapper could not be found for c23cb48f-107f-4a45-a85d-1c8b208d3656
04-23 08:06:42.344 10658 10697 D WM-StopWorkRunnable: StopWorkRunnable for c23cb48f-107f-4a45-a85d-1c8b208d3656; Processor.stopWork = false
----
Wifi lost: 04-23 08:07:31
LTE connected: 04-23 08:07:32
----

=> here the DiagnosisKeyRetrievalOneTimeWorker starts its work.
=> looks like, first job is started
04-23 08:07:47.045 10658 10658 D WM-SystemJobService: onStartJob for 4e3da071-a1cb-4261-83ab-02077ca6a365

=> looks like, second job is started, but detected as already running. But does "detection" mean "stopping"?
04-23 08:07:47.046 10658 10658 D WM-SystemJobService: Job is already being executed by SystemJobService: 4e3da071-a1cb-4261-83ab-02077ca6a365

04-23 08:07:47.052 10658 10721 D WM-Processor: Processor: processing 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:07:47.087 10658 10658 D WM-WorkerWrapper: Starting work for de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalOneTimeWorker

=> Looks like this could be CWA trying to download DKs uid:10200/CWA's pid:10658
04-23 08:07:47.206   437  1310 D BandwidthController: [LG DATA] No such appUid: 10200 /* This happens with all uids of all apps
04-23 08:07:47.206   437  1310 D DnsProxyListener: App 10200 tries DNS query. Accept family:0 protocol:0
04-23 08:07:51.415  1536  1585 D ConnectivityService: getActiveNetwork nai=NetworkAgentInfo{ ni{[type: MOBILE[LTE], state: CONNECTED/CONNECTED, reason: connected, extra: ####### blanked for privacy reasons ####### }
04-23 08:07:51.424  1536  3372 D ConnectivityService: listenForNetwork for Listen from uid/pid:10200/10658 for NetworkRequest [ id=128, legacyType=-1, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN] ]
04-23 08:07:51.425  1536  3015 D ConnectivityService: rematching NetworkAgentInfo [MOBILE (LTE) - 123]
04-23 08:07:51.426  1536  3015 D ConnectivityService: handleRegisterNetworkRequest checking NetworkAgentInfo [MOBILE (LTE) - 123]
04-23 08:07:51.426  1536  3015 D ConnectivityService: apparently satisfied.  currentScore=50

=> Network request ID 123 and ID 128 are both from/for CWA. This could point to a double job collusion, but I think it's not related, as I saw it later with other apps/processes in a similar manner.
04-23 08:07:51.426  1536  3015 E ConnectivityService: BUG: NetworkAgentInfo [MOBILE (LTE) - 123] already has NetworkRequest [ id=128, legacyType=-1, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN] ]
04-23 08:07:51.427  1536  1715 I am_wtf  : [0,1536,system_server,-1,ConnectivityService,BUG: NetworkAgentInfo [MOBILE (LTE) - 123] already has NetworkRequest [ id=128, legacyType=-1, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN] ]]
04-23 08:07:51.429  1536  3015 D ConnectivityService: releasing NetworkRequest NetworkRequest [ id=128, legacyType=-1, [ Capabilities: INTERNET&NOT_RESTRICTED&TRUSTED&NOT_VPN] ]
04-23 08:07:51.429 10658 26814 D ConnectivityManager.CallbackHandler: CM callback handler got msg 524290
04-23 08:07:51.446 10658 26814 D ConnectivityManager.CallbackHandler: CM callback handler got msg 524290
04-23 08:07:51.449  1536  4190 D ConnectivityService: getActiveNetwork nai=NetworkAgentInfo{ ni{[type: MOBILE[LTE], state: CONNECTED/CONNECTED, reason: connected, extra: ###### }
04-23 08:07:51.451 10658 26814 D ConnectivityManager.CallbackHandler: CM callback handler got msg 524296
04-23 08:07:51.451  1536  4189 D ConnectivityService: getActiveNetwork nai=NetworkAgentInfo{ ni{[type: MOBILE[LTE], state: CONNECTED/CONNECTED, reason: connected, extra: ###### }

===> HERE ENS got the keys provided via API by CWA - but the download didn't start yet! Is this the SECOND job, that got skipped due to collision detection in DownloadTask?
04-23 08:07:52.420  3606 26046 I ExposureNotification: Found flag-defined keys de.rki.coronawarnapp:262-v1,MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEc7DEstcUIRcyk35OYDJ95/hTg3UVhsaDXKT0zK7NhHPXoyzipEnOp3GyNXDVpaPi3cAfQmxeuFMZAIX2+6A5Xg== for package de.rki.coronawarnapp [CONTEXT service_id=236 ]

=> And a DELAYED DiagnosisKeyRetrievalOneTimeWorker gets already scheduled again?
04-23 08:08:13.722 10658 10658 D WM-DelayedWorkTracker: Scheduling work 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:08:13.722 10658 10658 D WM-GreedyScheduler: Starting tracking for [4e3da071-a1cb-4261-83ab-02077ca6a365]

=> This looks like, the DiagnosisKeyRetrievalOneTimeWorker is now validating the network constraints - BUT ENS is already runnung?! Could this be the FIRST job which does its work?
04-23 08:08:13.733 10658 10658 D WM-ConstraintTracker: NetworkStateTracker: initial state = [ Connected=true Validated=true Metered=true NotRoaming=true ]
04-23 08:08:13.733 10658 10658 D WM-NetworkStateTracker: Registering broadcast receiver
04-23 08:08:13.743 10658 10658 D WM-WorkConstraintsTrack: Constraints met for 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:08:13.744 10658 10658 D WM-GreedyScheduler: Constraints met: Scheduling work ID 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:08:13.744 10658 10658 D WM-NetworkStateTracker: Network broadcast received

=> This could also be a hint, that the first job just did its work, and the second one was already scheduled again:
04-23 08:08:13.744 10658 10697 D WM-Processor: Work 4e3da071-a1cb-4261-83ab-02077ca6a365 is already enqueued for processing
04-23 08:08:14.615  1536  1697 I am_pss  : [10658,10200,de.rki.coronawarnapp,75247616,74264576]

=> Now the ENS finished verification of the public key signature of the DK file! It took around 70 seconds!
04-23 08:09:02.034  3606 26046 I ExposureNotification: Signature verification succeeded [CONTEXT service_id=236 ]

=> ENS starts exposure matching now.
04-23 08:09:02.182  3606 26892 I ExposureNotification: [MatchingTracer] Starting matching job. [CONTEXT service_id=236 ]
04-23 08:09:02.210  3606 26923 I ExposureNotification: [MatchingTracer] Native pre-filter started. [CONTEXT service_id=236 ]

=> DiagnosisKeyRetrievalOneTimeWorker is now finished, and unregistered.
04-23 08:09:02.229 10658 10720 D WM-WorkerWrapper: de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalOneTimeWorker returned a Success {mOutputData=Data {}} result.
04-23 08:09:02.231 10658 10720 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=4e3da071-a1cb-4261-83ab-02077ca6a365, tags={ DIAGNOSIS_KEY_ONE_TIME_WORKER, de.rki.coronawarnapp.worker.DiagnosisKeyRetrievalOneTimeWorker } ]
04-23 08:09:02.234 10658 10658 D WM-Processor: Processor 4e3da071-a1cb-4261-83ab-02077ca6a365 executed; reschedule = false
04-23 08:09:02.234 10658 10658 D WM-GreedyScheduler: Stopping tracking for 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:09:02.235 10658 10658 D WM-NetworkStateTracker: Unregistering broadcast receiver
04-23 08:09:02.236 10658 10658 D WM-SystemJobService: 4e3da071-a1cb-4261-83ab-02077ca6a365 executed on JobScheduler
04-23 08:09:02.237  3606 26923 I ExposureNotificationJni: PrefixIdMap loaded 1297 scan records
04-23 08:09:02.238  3606 26923 I ExposureNotification: MatchingJni get native ptr 546730083840 [CONTEXT service_id=236 ]
04-23 08:09:02.240  3606 26923 I ExposureNotificationJni: matchingNative get 14 files
04-23 08:09:02.241  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export0.bin

=> but here again DiagnosisKeyRetrievalOneTimeWorker is stopped/cancelled?
04-23 08:09:02.246 10658 10720 D WM-GreedyScheduler: Cancelling work ID 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:09:02.250  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export0.bin (process_revised_keys = 1)
04-23 08:09:02.251 10658 10721 D WM-Processor: Processor stopping background work 4e3da071-a1cb-4261-83ab-02077ca6a365

=> and DiagnosisKeyRetrievalOneTimeWorker could not be found, because it was already unregistered before?
04-23 08:09:02.251 10658 10721 D WM-Processor: WorkerWrapper could not be found for 4e3da071-a1cb-4261-83ab-02077ca6a365
04-23 08:09:02.251 10658 10721 D WM-StopWorkRunnable: StopWorkRunnable for 4e3da071-a1cb-4261-83ab-02077ca6a365; Processor.stopWork = false
04-23 08:09:03.029  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export1.bin
04-23 08:09:03.030  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export1.bin (process_revised_keys = 1)
04-23 08:09:03.540  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export2.bin
04-23 08:09:03.540  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export2.bin (process_revised_keys = 1)
04-23 08:09:03.817  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export3.bin
04-23 08:09:03.817  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export3.bin (process_revised_keys = 1)
04-23 08:09:04.207  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export4.bin
04-23 08:09:04.207  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export4.bin (process_revised_keys = 1)
04-23 08:09:04.935  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export5.bin
04-23 08:09:04.936  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export5.bin (process_revised_keys = 1)
04-23 08:09:05.624  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export6.bin
04-23 08:09:05.625  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export6.bin (process_revised_keys = 1)
04-23 08:09:06.228  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export7.bin
04-23 08:09:06.229  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export7.bin (process_revised_keys = 1)
04-23 08:09:06.829  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export8.bin
04-23 08:09:06.829  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export8.bin (process_revised_keys = 1)
04-23 08:09:07.347  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export9.bin
04-23 08:09:07.348  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export9.bin (process_revised_keys = 1)
04-23 08:09:07.635  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export10.bin
04-23 08:09:07.635  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export10.bin (process_revised_keys = 1)
04-23 08:09:08.088  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export11.bin
04-23 08:09:08.089  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export11.bin (process_revised_keys = 1)
04-23 08:09:08.844  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export12.bin
04-23 08:09:08.845  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export12.bin (process_revised_keys = 1)
04-23 08:09:09.612  3606 26923 I ExposureNotificationJni: Matching with /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export13.bin
04-23 08:09:09.613  3606 26923 I ExposureNotificationJni: Created iterator for /data/user/0/com.google.android.gms/app_en_diagnosis_keys/1619158071578/export13.bin (process_revised_keys = 1)
04-23 08:09:10.241  3606 26923 I ExposureNotificationJni: Matching done, total 529032 keys, no key matches
04-23 08:09:10.242  3606 26923 I ExposureNotification: MatchingJni get nullable key set from native. [CONTEXT service_id=236 ]
04-23 08:09:10.253  3606 26923 I ExposureNotification: [MatchingTracer] Native pre-filter found 0 (529032) keys with sightings out of 1297 scan records. Spent time: 8.042s [CONTEXT service_id=236 ]
04-23 08:09:10.259  3606 26923 I ExposureNotification: [MatchingTracer] Java tracing started. [CONTEXT service_id=236 ]
04-23 08:09:10.429  3606 26923 I ExposureNotification: [MatchingTracer] Traced 0 diagnosis keys and found 0 matches. [CONTEXT service_id=236 ]
04-23 08:09:10.429  3606 26923 I ExposureNotification: [MatchingTracer] Persisting matching request record. [CONTEXT service_id=236 ]
04-23 08:09:10.460  3606 26892 I ExposureNotification: [MatchingTracer] Matching job complete. Run time=8.262s. Matches not found. [CONTEXT service_id=236 ]

=> ENS finished exposure matching and is sending its intent to CWA.
04-23 08:09:10.460  3606 26892 I ExposureNotification: [MatchingTracer] Sending exposure status update with no new exposures to client. [CONTEXT service_id=236 ]
04-23 08:09:10.473  3606 26892 I ExposureNotification: [MatchingTracer] Successfully removed request. Matching job complete. [CONTEXT service_id=236 ]

=> This looks like, CWA received the intent from ENS, and is preparing to start ExposureStateUpdateWorker.
04-23 08:09:10.503 10658 10720 D WM-PackageManagerHelper: androidx.work.impl.background.systemalarm.RescheduleReceiver enabled

=> And again, 2 jobs are scheduled?
04-23 08:09:10.515 10658 10720 D WM-SystemJobScheduler: Scheduling work ID 730f60d8-f129-4941-a729-da05465f962e Job ID 49607
04-23 08:09:10.531 10658 10720 D WM-SystemJobScheduler: Scheduling work ID 730f60d8-f129-4941-a729-da05465f962e Job ID 49608
04-23 08:09:10.545 10658 10720 D WM-GreedyScheduler: Starting work for 730f60d8-f129-4941-a729-da05465f962e
04-23 08:09:10.547 10658 10720 D WM-Processor: Processor: processing 730f60d8-f129-4941-a729-da05465f962e
04-23 08:09:10.565 10658 10658 D WM-WorkerWrapper: Starting work for de.rki.coronawarnapp.nearby.ExposureStateUpdateWorker
04-23 08:09:10.569 10658 10704 D WM-WorkerWrapper: de.rki.coronawarnapp.nearby.ExposureStateUpdateWorker returned a Success {mOutputData=Data {}} result.

=> But the exposure state update was a success 🎉 
04-23 08:09:10.570 10658 10704 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=730f60d8-f129-4941-a729-da05465f962e, tags={ de.rki.coronawarnapp.nearby.ExposureStateUpdateWorker } ]
04-23 08:09:10.572 10658 10658 D WM-Processor: Processor 730f60d8-f129-4941-a729-da05465f962e executed; reschedule = false
04-23 08:09:10.583 10658 10704 D WM-GreedyScheduler: Cancelling work ID 730f60d8-f129-4941-a729-da05465f962e
04-23 08:09:10.589 10658 10704 D WM-Processor: Processor stopping background work 730f60d8-f129-4941-a729-da05465f962e
04-23 08:09:10.589 10658 10704 D WM-Processor: WorkerWrapper could not be found for 730f60d8-f129-4941-a729-da05465f962e
04-23 08:09:10.589 10658 10704 D WM-StopWorkRunnable: StopWorkRunnable for 730f60d8-f129-4941-a729-da05465f962e; Processor.stopWork = false

=> why is a new DNS coming from CWA, after the job was just done? 👻 
04-23 08:09:10.601   437  1310 D BandwidthController: [LG DATA] No such appUid: 10200
04-23 08:09:10.601   437  1310 D DnsProxyListener: App 10200 tries DNS query. Accept family:0 protocol:0

@harambasicluka
Copy link
Contributor

@vaubaehn please retest when 2.2 is released.

@vaubaehn
Copy link
Contributor Author

Dear all, looks like Google stopped logging from ENS to syslog (ExposureNotification) completely in recent days (logcat won't show any useful information anymore). Means, CWA's debug log feature seems to be the only thing we can rely on in the future. Triangulation (using CWA's logs together with ExposureNotification syslog) won't be possible anymore.
I hope, my phone only has a hickup after latest ENS update...

@rugk
Copy link

rugk commented May 12, 2021

ENS to syslog (ExposureNotification) completely in recent days (logcat won't show any useful information anymore).

Just as some unrelated background info: This could be related to a vulnerability they had, where they logged too much to that log, see https://www.cvedetails.com/cve/CVE-2021-31815/ (if I got the corect CVE here). It seems they decided to just drop the log completely.

@MikeMcC399
Copy link
Contributor

@rugk

It's reasonable to assume that the change in logging is due to the vulnerability response. Google doesn't seem to have published any direct public response to CVE-2021-31815. Google's release notes are quite high level and don't cover this. So we can only make assumptions.

Google seems to have informed COVID-19 app development partners though, as there is a CWA FAQ article about it under https://www.coronawarn.app/en/faq/#google_security_vulnerability.

@vaubaehn
Copy link
Contributor Author

vaubaehn commented May 18, 2021

UPDATE on this issue

The error is like Heisenberg's Uncertainty Principle: You can either observe the bug, or you log without any bug.

After CWA 2.2.1 arrived on my phone last Friday (🍔 time!), I immediately turned on debug logging. The error did not occur since.
After being on mobile only (wifi turned off) since 12th of May (on which day was the last occurence if the bug), now my quota is going to be exhausted, and I need to turn on wifi again.
This means, after the bug occurs with only one exposure check per day, and with the fix by #2881, it gets quite unlikely to log the error in the near future for me.
I'm adding my compiled exposureLogs here, with last occurence of the bug on May 12th, 14:58 CET with CWA 2.1.:
20210401-20210518_assembled_all-exposure-checks.txt

@d4rken and @harambasicluka I'm offering to upload my debug log so far, and provide the ticket number for further analysis. Please give me a hint, whether you want to look into it. If you don't want to, I'd reset the current log, due to tiny space on my internal storage.
From first self-inspection I found at least one point that might be correlated to the here reported issue:
Duration(now, nextDetectionAt) in DownloadDiagnosisKeysTask.kt is negative most of the time, as nextDetectionAt does not seem to be calculated correctly. NextDetectionAt is lastDetection plus 4 hours, but if only one detection per day takes place (because of metered connection), then this value would become negative, and I don't know what fun wasLastDetectionPerformedRecently is returning in that case...

In general it's the question whether it's worth to further inspect this bug, as it seems to

  • only appear with large diagnosis key files to be downloaded (during times with high incidence rates)
  • only appear with one exposure matching per day (for people on metered/mobile connections)

So we can (still) expect an impact of the bug for

  • people who most of the time need to be on a metered/mobile connection (e.g., refugees, farming helpers from abroad, people who can't afford landline internet with connected wifi router, etc.)
  • times, when incident rates increase again (i.e., 4th wave due to B.1.617 and B.1.618)

The impact would mean that affected people might get warned with "fresh" diagnosis keys with an unwanted long delay (of > 48 hours in my case, see April 5th to April 7th in my exposure logs, or 42 hours on April 17/18).

Please let me know how to proceed, thank you,
V.

@dsarkar
Copy link
Member

dsarkar commented May 19, 2021

@vaubaehn Thanks for detailed analysis. Forwarded your info to the internal ticket. (On a side note: Maybe this helps you to circumvent the uncertainty principle ;-) https://www.eurekalert.org/pub_releases/2021-05/au-etu050521.php and https://www.nist.gov/news-events/news/2021/05/nist-team-directs-and-measures-quantum-drum-duet )

@d4rken
Copy link
Member

d4rken commented May 25, 2021

@d4rken and @harambasicluka I'm offering to upload my debug log so far, and provide the ticket number for further analysis. Please give me a hint, whether you want to look into it. If you don't want to, I'd reset the current log, due to tiny space on my internal storage.

If you can reproduce the problematic behavior, please upload a log, i.e. unexpectedly skipped exposure checks or duplicate key submissions (according to ENS hashes).

I suspect a relationship with #3093 where the logs show that I checked show various timeouts when communicating with the ENS.
Some kind of combination of the timeout letting the CWA assume that the operation was not successful, but despite the timeout, the ENS still performed the calculation.

Not sure whether the hourly packages make the issue more likely (more provideDiagnosisKeys calls) or less likely (smaller packages) 🤔.

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Jul 14, 2021

Hi @dsarkar ,

The error is like Heisenberg's Uncertainty Principle: You can either observe the bug, or you log without any bug.

On a side note: Maybe this helps you to circumvent the uncertainty principle ;-) https://www.eurekalert.org/pub_releases/2021-05/au-etu050521.php and https://www.nist.gov/news-events/news/2021/05/nist-team-directs-and-measures-quantum-drum-duet

thanks for providing these links, they were indeed rather helpful! I was able to turn my old phone into a quantum computing mobile by inserting aluminum membranes that were cut out from some old aluminium foil into the phone. I put the phone into a black box, vacuumed and supra-cooled it. To create quantum states in the aluminium membranes, I simply used our kitchen's microwave oven.
As you guess, this all took a while, that's why I'm repsonding late.
But I was successful: My phone is faster than ever, and I could produce

  • no error log with no bugs
  • one error log with no bugs
  • no error log with bugs
  • and one error log with bugs

all at the same time! That's fantastic!
But I still couldn't figure out yet, on how to export the (no) logs...
There are also other caveats: Doing phone calls is a bit hard, as I need to shout through the vacuumed box. Also the box is quite cold what hurts on my ears. Then there is my wife complaining about the high energy costs resulting from supra-cooling and that I took her microwave to my cellar hobby room... If you also have some ideas on how to circumvent theses problems, I'd be happy to hear from you!
😉


Hi @d4rken ,

If you can reproduce the problematic behavior, please upload a log, i.e. unexpectedly skipped exposure checks or duplicate key submissions (according to ENS hashes).

Despite from what I wrote above, I was not able to reproduce the issue under the current conditions (low incidence rate and "my" CWA now doing hourly updates also on non-metered connection).

Not sure whether the hourly packages make the issue more likely (more provideDiagnosisKeys calls) or less likely (smaller packages) 🤔.

I'm rather sure that the issue is more likely on bigger packages only, as I only could observe it when incidence rates were high.

I suspect a relationship with #3093 where the logs show that I checked show various timeouts when communicating with the ENS.
Some kind of combination of the timeout letting the CWA assume that the operation was not successful, but despite the timeout, the ENS still performed the calculation.

Yes, that's very likely and fits well together with my previous observations: Most of processor time ENS is consuming is for validating the DK packages against the signature (see logcat: #2880 (comment)). In case this blocks ENS for quite a bit of time, then communication timeouts are very likely. It's likely that the time ENS spends on validating signature is proportional to DK package size: higher incidence rates -> more keys -> more time needed for validating -> timeout more likely.
I'm also still not convinced that the issue does not arise because of 2 workers started in parallel, and the one, that could successfully do the job, is cancelled as a sibling.

But if that's the case, then there is still the question, why the times CWA retries on providing DKs again are so weird? It should be one hour later (next worker starting time), but sometimes it's just a couple of minutes later and sometimes 6 hours later...
I assume still some (additional) problems with timestamp calculations/conditions, see also #2880 (comment) (the paragraph with Duration(now, nextDetectionAt)...)

In the last two weeks the number of DKs doubled because of the delta variant in Europe, we're now around 100.000 keys again. Given r-value stays same, in about 5 weeks we will be back at the times, when this issue occured (DKs > 500.000).
Do you feel like waiting until then, or should we look into it again before?

@Ein-Tim
Copy link
Contributor

Ein-Tim commented Jul 14, 2021

@vaubaehn

Unbelievable what you all do (to your phone), just to improve the Corona-Warn-App 😉

Maybe the issues with your phone you mention in #3692 (comment) don't happen because you switch between languages but because you put aluminum foil into it? 🤔😉

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Nov 3, 2021

The number of diagnosis keys is increasing much currently. Probably within a few days we'll reach >500k keys.
With hourly update schema there is seemingly no problem at the moment, though I found a 12-minute-difference between risk card timestamp and ENS log timestamp yesterday, which is surprising as the timeout limit is 8 minutes afair...

Somewhere in the next days I will run my phone on mobile data only, then let's see if the issue here is still present.

@ndegendogo
Copy link

a 12-minute-difference between risk card timestamp and ENS log timestamp yesterday, which is surprising as the timeout limit is 8 minutes afair...

... not knowing anything about the Google / Android / ENS specifications ...
... just asking stupid questions that come to my mind ...

@vaubaehn
Are you sure that this "8-minute timeout limit" is measured in true time?
An alternative could be CPU time?
So the clock would wait while the thread is suspended?

@thomasaugsten
Copy link
Member

Real time but if there are device above 8min we can increase the timeout

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Nov 5, 2021

Hi @thomasaugsten

Real time but if there are device above 8min we can increase the timeout

Thanks for taking care, but after thinking a second time, I'm not sure if increasing the timeout is necessary.
My understanding of the process (simplified) is this:

  • CWA decides that it's time to download DKs, and does so (->1. relevant timeout: download process. If timeout occurs (e.g. poor connectivity), DKs need to be downloaded again).
  • CWA provides diagnosis keys to ENS and waits for a response (-> 2. relevant timeout: provisioning process. If timout occurs (e.g., heavy system load), CWA would need to provide DKs again).
  • Then CWA sits back and waits for an intent that ENS finished to check DKs for any risky encounter (-> no timeout involved, afair. If system process is killed while ENS is still doing its work, then updated ExposureWindows should be retrieved/risk calculation should be perfomed with the next instantiating of the app? Or upon next provisioning of DKs? This could be a vulnerable point, if CWA never gets an intent that ENS is finished nor actively requests ExposureWindows for risk calculation).
  • When CWA gets the intent that ENS is finished, it requests the ExposureWindows and performs risk calculation. (-> vulnerable point: see previous bullet point).

So, if CWA can reliably get the Exposure Windows at any time, nothing would need to be changed here. At least the 2 different timeouts that I know would not change anything in the case of my reported time difference of 12 minutes between providing DKs and ENS finishing check.

@vaubaehn
Copy link
Contributor Author

Hi everybody, a short intermediate result:
There's good news and bad news... Bad news first: we broke the 1M-keys boundary for the full-day diagnosis keys package last night: at 1:20 my phone matched 1.036.393 diagnosis keys.
For the good news: since November 4, I could not reproduce the issue reported anymore while my phone was running on 24/7 mobile data.
My quota is finished for now, I will repeat again with fresh quota in middle December. If the matching works as smooth as now at that time, I'll consider the issue to be solved.
See you then!

@Ein-Tim
Copy link
Contributor

Ein-Tim commented Mar 19, 2022

I guess #4929 will also drastically improve the situation here, or @vaubaehn?

@vaubaehn
Copy link
Contributor Author

vaubaehn commented Mar 21, 2022

@Ein-Tim

I guess #4929 will also drastically improve the situation here, or @vaubaehn?

Hard to guess, but an improvement may be possible. A proof of what actually caused CWA to provide exactly the same DK packages several times again has never been found.
There are good reasons to think about that syncing all packages failed due to problems in ENS to handle them properly and to send back a response to CWA in time.
When now only the ENF delta is provided after the fix with #4929, it means that on continous mobile connection only one day package will be downloaded and provided to ENF instead of 10 (end of last year: 14). When timeouts have been the root cause for the observed issue, then yes, the hot fix may mitigate it.

Unfortunately I currently can't let my phone run on mobile connection solely for at least a week to confirm whether the issue is fixed by #4929. I may try at a later time. For now I'm closing this issue.
Thanks to everyone involved here 👋 ❤️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working mirrored-to-jira This item is also tracked internally in JIRA
Projects
None yet
Development

No branches or pull requests