Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

Closed
bassosimone opened this issue Sep 29, 2023 · 1 comment
Closed
Assignees
Labels
bug Something isn't working needs investigation This issue needs extra data and investigation ooni/probe-engine priority/high releaseBlocker This issue blocks releasing

Comments

@bassosimone
Copy link
Contributor

bassosimone commented Sep 29, 2023

TL;DR This issue is about a bug where the engineresolver performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).

Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the $HOME/.ooniprobe/engine/ directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.

In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.

We start by attempting to lookup using the cloudflare method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):

[      0.279993] <info> iplookup: using cloudflare... started

[      0.279918] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... started
[      0.303753] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... connection_refused

[      0.304111] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... started
[      0.325122] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... connection_refused

[      0.325293] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... started
[      0.352242] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... connection_refused

[      0.352597] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... started
[      0.853615] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... in progress
[      4.353289] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... generic_timeout_error

[      4.353522] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... started
[      4.854381] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... in progress
[      7.280025] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... generic_timeout_error

[      7.280284] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... started
[      7.280443] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... generic_timeout_error

[      7.280683] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... started
[      7.281053] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... generic_timeout_error

[      7.281274] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... started
[      7.281305] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... generic_timeout_error

[      7.279903] <info> iplookup: using cloudflare... Get "https://www.cloudflare.com/cdn-cgi/trace": generic_timeout_error

So, overall the lookup lasts 7 seconds. This is because we have a 7 seconds timeout, which was included as a reliability fix according to a comment, and which was introduced by ooni/probe-cli#678, addressing issues #1988 and #2013.

You see that lookup#1, lookup#2 and lookup#3 all fail with connection_refused. Instead lookup#4 fails with a generic_timeout_error caused by a four-second per-lookup timeout.

But lookup#5 and lookup#6 fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)

Also, lookup#8 and lookup#9 have timestamps after the log where the iplookup component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.

However, and this is particularly important, lookup#9 happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because the engineresolver component schedules lookups even when its context deadline has expired, this means that the engineresolver believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.

If you're curious, the bootstrap ends like this:

[     28.288537] <info> iplookup: using stun_google... sessionresolver: LookupHost failed: [ <https://mozilla.cloudflare-dns.com/dns-query> connection_refused; <https://dns.google/dns-query> connection_refused; <https://dns.quad9.net/dns-query> connection_refused; <http3://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://dns.google/dns-query> generic_timeout_error; <https://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://mozilla.cloudflare-dns.com/dns-query> generic_timeout_error; <system:///> generic_timeout_error;]

Also, this is the content of sessionresolver.state after this run:

[
  {
    "URL": "https://mozilla.cloudflare-dns.com/dns-query",
    "Score": 7.424810779254613E-13
  },
  {
    "URL": "https://dns.google/dns-query",
    "Score": 8.105952955347611E-13
  },
  {
    "URL": "https://dns.quad9.net/dns-query",
    "Score": 8.624602747864999E-13
  },
  {
    "URL": "http3://cloudflare-dns.com/dns-query",
    "Score": 5.73807306573324E-13
  },
  {
    "URL": "http3://dns.google/dns-query",
    "Score": 4.0903976870662584E-13
  },
  {
    "URL": "https://cloudflare-dns.com/dns-query",
    "Score": 1.0603838730495632E-13
  },
  {
    "URL": "http3://mozilla.cloudflare-dns.com/dns-query",
    "Score": 2.1736065526426893E-14
  },
  {
    "URL": "system:///",
    "Score": 0
  }
]

Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The model.Resolver is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.

I covered this issue more in detail in #2545.

@bassosimone bassosimone added bug Something isn't working priority/high needs investigation This issue needs extra data and investigation ooni/probe-engine releaseBlocker This issue blocks releasing labels Sep 29, 2023
@bassosimone bassosimone self-assigned this Sep 29, 2023
@bassosimone bassosimone changed the title engineresolver: don't lookup if context is canceled engineresolver: bootstrap failure because of lookup w/ canceled context Sep 29, 2023
@bassosimone bassosimone changed the title engineresolver: bootstrap failure because of lookup w/ canceled context engineresolver: bootstrap failure b/c of lookup w/ canceled context Sep 29, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Oct 9, 2023
Not logging when an operation is starting complicates reasoning
about how much time it took to fail, which is something that caused
issues investigating ooni/probe#2544.
bassosimone added a commit to ooni/probe-cli that referenced this issue Oct 9, 2023
Not logging when an operation is starting complicates reasoning about
how much time it took to fail, which is something that caused issues
investigating ooni/probe#2544.
bassosimone added a commit to ooni/probe-cli that referenced this issue Oct 9, 2023
This diff ensures that the boostrap is more robust in the scenario
described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy
refactor of the boostrap, and specifically of the iplookup part, but
I documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two
distinct conditions:

1. cold bootstrap (i.e., w/o prior knowledge), where the correct
fix is to give the system resolver intermediate priority;

2. hot bootstrap with fscked up config, where the correct fix
is to use a large coarse grained timeout.

The fix that causes a canceled context to prevent marking a resolver
as failed helps in both scenarios.

As you can see, I removed a "reliability fix", which actually was
more of an optimization. This removal means that the probe bootstrap
is now slower than it could be, hence my choice of documenting
what I'd do had I had more time to work on this topic.
bassosimone added a commit to ooni/probe-cli that referenced this issue Oct 9, 2023
This diff ensures that the boostrap is more robust in the scenario
described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy
refactor of the boostrap, and specifically of the iplookup part, but I
documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two distinct
conditions:

1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is
to give the system resolver intermediate priority (i.e., 0.5);

2. hot bootstrap with very f***** up `sessionresolver.state`, where the
correct fix is to use a larger coarse grained timeout, such that
_eventually_ we try the system resolver.

The fix that prevents marking a resolver when the external context has been
canceled is required in both of these scenarios.

As you can see, I removed a "reliability fix", which actually was more
of an optimization. This removal means that the probe bootstrap is now
slower than it could be, hence my choice of documenting in
ooni/probe#2551 what I'd do had I had more
time to work on this topic.

BTW, I had to create an overall 45 seconds timeout for IP lookups
because we have 7 DNS over HTTPS resolvers plus the system resolver.
This means that, in the worst case where the system resolver has the
least priority, we expect 7*4 = 28 seconds of timeout before eventually
using the system resolver. The rest of the 45s timeout accounts for
operations happening after the DNS lookup has completed.
@bassosimone
Copy link
Contributor Author

bassosimone commented Oct 9, 2023

I committed an hotfix in ooni/probe-cli#1351 and documented a more comprehensive fix in #2551.

We can now mark this issue as closed.

Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Not logging when an operation is starting complicates reasoning about
how much time it took to fail, which is something that caused issues
investigating ooni/probe#2544.
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff ensures that the boostrap is more robust in the scenario
described by ooni/probe#2544.

Because we're close to a release, I did not want to write a heavy
refactor of the boostrap, and specifically of the iplookup part, but I
documented what is needed at ooni/probe#2551.

This diff has been tested in a problematic network under two distinct
conditions:

1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is
to give the system resolver intermediate priority (i.e., 0.5);

2. hot bootstrap with very f***** up `sessionresolver.state`, where the
correct fix is to use a larger coarse grained timeout, such that
_eventually_ we try the system resolver.

The fix that prevents marking a resolver when the external context has been
canceled is required in both of these scenarios.

As you can see, I removed a "reliability fix", which actually was more
of an optimization. This removal means that the probe bootstrap is now
slower than it could be, hence my choice of documenting in
ooni/probe#2551 what I'd do had I had more
time to work on this topic.

BTW, I had to create an overall 45 seconds timeout for IP lookups
because we have 7 DNS over HTTPS resolvers plus the system resolver.
This means that, in the worst case where the system resolver has the
least priority, we expect 7*4 = 28 seconds of timeout before eventually
using the system resolver. The rest of the 45s timeout accounts for
operations happening after the DNS lookup has completed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs investigation This issue needs extra data and investigation ooni/probe-engine priority/high releaseBlocker This issue blocks releasing
Projects
None yet
Development

No branches or pull requests

1 participant