Regular windows test failures #2351

terriko · 2022-11-16T17:32:51Z

We're getting a lot of windows tests fails on PRs now, mostly with errors like the following:

ClientOSError: [WinError 10053] An established connection was aborted by the 
software in your host machine

Which honestly looks similar to what happens when a job times out, but it's happening after 3 minutes so that's not it. It doesn't look like our usual NVD problem where the rate limit gets exceeded, but it could be related to NVD in a different way, I don't really know yet.

Not sure what's up, but I'm filing the issue in case anyone else has any insights or recognizes the message, and to remind me that this needs further investigation.

The text was updated successfully, but these errors were encountered:

BreadGenie · 2022-11-16T17:36:01Z

Another error that I have seen multiple times was

ServerDisconnectedError: Server disconnected

And from a quick Google search it seems like NVD might be rate limiting us. But this occurring only from windows is very strange.

terriko · 2022-11-16T17:41:03Z

My current theory about why windows has NVD problems more often is

There may be fewer windows instances available (so more scanning by others happens on the same node)
Folk using windows CI may just scan for vulnerabilities more often (not unreasonable given that people may be preparing for the US executive order that requires government suppliers to prove that their code has no known vulnerabilities, and many suppliers to the US government would be windows-based)

Neither of those is fixable by us directly and I'm probably not going to stand up non-shared runners, so I'm still thinking we either cache in a way that public PRs can use it or use synthetic test data.

I need to go chat with our licensing folk about public caches...

terriko · 2022-11-16T17:44:54Z

That said, if this is an NVD timeout, we should see if we can detect is and throw a better error message and let the rest of the tests run. So there's still something for us to do short-term to improve the windows test experience.

I think @anthonyharrison updated the code in there so I'm sort of surprised it's not firing here.

anthonyharrison · 2022-11-16T17:59:07Z

@terriko I have been seeing a lot of NVD failures of Linux as well. I tried backing off longer (we currently back off for 3 seconds) to 10 seconds and then 30 seconds to see if it helps (not much!). Watching a full download of the NVD data showed that after about 20 requests, the failures increased. I was wondering if we try to limit the number of parallel requests to a much smaller number to see if that helps when downloading a full copy of the database.

terriko · 2022-11-16T19:00:51Z

That actually gives us a potential path forwards:

Flag all NVD-related tests. Most of these are already marked as long tests, but they'd need an NVD_TEST flag.
Move them into a separate run of pytest. For anything not required for regular test runs, put them into a separate name so this becomes linux-nvd and windows-nvd or somesuch separate from the other longtests. This will be especially good on windows because you wouldn't need the other special dependencies (conda is slow).
Make sure those tests are run in series or with low parallel threads (e.g. n=2 or something)
Slowly replace all of those tests to use synthetic or cached data only, or remove if they're no longer needed.

terriko · 2022-11-16T19:07:34Z

Worth noting: most of our test runners start with a single run of the tool that may hit nvd if the cache isn't available. We do in fact want to see that the tool can be installed and run on each platform, so if that turns out to be what's failing after 3 minutes on windows that may not help.

anthonyharrison · 2022-11-17T15:46:23Z

@terriko Looking at the RateLimiter code and the copy on Github, I note that the RATE variable is 1 not 10.

I also note that just before RateLimiter is set up, the following code exists

connector = aiohttp.TCPConnector(limit_per_host=19)

19 seems a strange number! Should this be aligned with the number of tokens in the RateLimiter?

Maybe if we should introduce environment variables for the RATE. MAX_TOKENS and LIMIT_PER_HOST and then see if we can track down the issue.

I am sure having the 4 additional data sources, all with a RateLimiter will also be having some impact.

** UPDATE **

I have found an issue :-)

I deleted the database from the cache to force a reload of all of the data. I just got 404 errors from all of the requests using the API. Manually tried the URL and I got the error

{
"message": "Date range cannot exceed 120 days."
}

So becuase we now default to incremental update, we look for the date of the database. If the database doesn't exist in the call to get_db_update_date(), it sets a default date of 1st January 2000 (I wasn't expecting it to get to this bit of the code...). This data is then used in the call to the NVD API which provides a data range which is too large. So there needs to be a check in the nist_fetch_using_api() function to check the database exists before passing on the incremental update flag (if it doesn't exist, don't do incremental updates)

** UPDATE **

Update to nist_fetch_using_api() function works. However, with debug on, I see 96 requests issued to NVD (which represent the number of calls to get the 190000 entries at 2000 entries per request) followed by lots of 403 errors (forbidden) and Failed to connect to NVD Server disconnected messages. After 15 minutes, the progress bar reported 16% complete.

** UPDATE **

Looked at NVD website. I think we should be backing off at least 6 seconds.

I am also seeing 'payload is not completed' as an error. There is a github issue for this and it looks like it is still an active issue.

** FIX **

There was little bug introduced when NVD 2.0 API was added which prevents the API key being passed to the API. Fxing this gets a full download of the data in around 90 seconds.

#2355 contains the fixes.

There is now a new issue to be aware of if we have a very old database which is more than 120 days out of date as incremental update won't work. (See #2356)

terriko · 2022-11-17T17:51:29Z

Phew, thanks for debugging this @anthonyharrison !

terriko · 2022-11-17T17:55:20Z

Oh, re: RateLimiter. I believe the 19 was empirically defined (which is research paper speak for "someone experimented and that was the number that worked"). Given how much NVD has changed about the rate limits, I will not be shocked if it is no longer the correct number.

terriko added the CI Related to our continuous integration service (GitHub Actions) label Nov 16, 2022

terriko mentioned this issue Nov 16, 2022

Group all tests that use NVD network connections into a separate CI job #2353

Closed

terriko closed this as completed in 29eb9dd Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular windows test failures #2351

Regular windows test failures #2351

terriko commented Nov 16, 2022

BreadGenie commented Nov 16, 2022

terriko commented Nov 16, 2022 •

edited

Loading

terriko commented Nov 16, 2022 •

edited

Loading

anthonyharrison commented Nov 16, 2022

terriko commented Nov 16, 2022 •

edited

Loading

terriko commented Nov 16, 2022

anthonyharrison commented Nov 17, 2022 •

edited

Loading

terriko commented Nov 17, 2022

terriko commented Nov 17, 2022

Regular windows test failures #2351

Regular windows test failures #2351

Comments

terriko commented Nov 16, 2022

BreadGenie commented Nov 16, 2022

terriko commented Nov 16, 2022 • edited Loading

terriko commented Nov 16, 2022 • edited Loading

anthonyharrison commented Nov 16, 2022

terriko commented Nov 16, 2022 • edited Loading

terriko commented Nov 16, 2022

anthonyharrison commented Nov 17, 2022 • edited Loading

terriko commented Nov 17, 2022

terriko commented Nov 17, 2022

terriko commented Nov 16, 2022 •

edited

Loading

terriko commented Nov 16, 2022 •

edited

Loading

terriko commented Nov 16, 2022 •

edited

Loading

anthonyharrison commented Nov 17, 2022 •

edited

Loading