-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pagination has inconsistent results with both missing data and duplicated data #920
Comments
+1 for fixing this. Even trying to compile a list of all reserved CVE IDs that we published over all years produces wildly inconsistent lists: $ cve list --state published | tail -n +2 | cut -d' ' -f1 > ~/temp/redhat_cves.txt
$ sort -u ~/temp/redhat_cves.txt | wc -l
6113
$ wc -l ~/temp/redhat_cves.txt
9612 |
slubar
added a commit
that referenced
this issue
Nov 22, 2022
jdaigneau5
added a commit
that referenced
this issue
Nov 22, 2022
#920 turn on debug mode for mongoose
brettp
added a commit
that referenced
this issue
Nov 30, 2022
brettp
added a commit
that referenced
this issue
Dec 2, 2022
brettp
added a commit
that referenced
this issue
Dec 2, 2022
brettp
added a commit
that referenced
this issue
Dec 5, 2022
brettp
added a commit
that referenced
this issue
Dec 5, 2022
brettp
added a commit
that referenced
this issue
Dec 5, 2022
brettp
added a commit
that referenced
this issue
Dec 6, 2022
brettp
added a commit
that referenced
this issue
Dec 6, 2022
slubar
added a commit
that referenced
this issue
Dec 6, 2022
#920 chore: remove code no longer necessary to correctly sort
brettp
added a commit
that referenced
this issue
Dec 6, 2022
jdaigneau5
added a commit
that referenced
this issue
Dec 6, 2022
#920 chore: change sorting to use cveId instead of _id for /cve endpoint
brettp
added a commit
that referenced
this issue
Dec 6, 2022
brettp
added a commit
that referenced
this issue
Dec 6, 2022
brettp
added a commit
that referenced
this issue
Dec 6, 2022
brettp
added a commit
that referenced
this issue
Dec 6, 2022
jdaigneau5
added a commit
that referenced
this issue
Dec 6, 2022
#920 chore: correct sort field for cve endpoint
brettp
added a commit
that referenced
this issue
Dec 7, 2022
jdaigneau5
added a commit
that referenced
this issue
Dec 7, 2022
#920 chore: change order of aggregate query for better performance on /cve/
brettp
added a commit
that referenced
this issue
Dec 9, 2022
jdaigneau5
added a commit
that referenced
this issue
Dec 9, 2022
#920 chore: remove debugging settings for mongoose
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary: The production CVE Services endpoints that use pagination, such as GET /cve, produce substantially incorrect results for many realistic API calls. The root cause of the problem is probably not yet understood. The problem has major consequences for multiple Secretariat use cases, and also may disrupt the ability of large CNAs to retrieve a list of their CVE IDs via the GET /cve-id endpoint.
Note that pagination anomalies can also be encountered by people who don't understand the time values for CVE Record pagination. That is a different issue; the issue being reported here is distinct and much worse. The time_modified.lt and time_modified.gt parameters for the GET /cve endpoint are intended to find CVE Records matching values in these data fields:
cve-services/src/model/cve.js
Lines 16 to 18 in 4eab157
These aren't necessarily the same as fields such as datePublished and dateUpdated within:
cve-services/src/model/cve.js
Line 20 in 4eab157
Accordingly, it is typically only useful to select date/time values after JSON 5.0 started to be used in production. If the CVE Record was created by mongoimport but not touched after that, then it does not have a useful time.modified date.
Because of this, one might expect that
and
should find largely the same set of CVE Records (i.e., all records from the Soft Deploy period). In other words, this "gt" series of GET requests:
should collect the same set of CVE Records as this "lt" series of GET requests:
At first glance, the results look approximately correct, because the last valid page in each series is page 8 and they both find exactly 3764 CVE IDs. The first problem is that neither the gt series nor the lt series finds 3764 unique CVE IDs. The number of unique CVE IDs varies on each attempt. For example, one time the lt series had 3539 unique CVE Records, which is 235 fewer than 3764. At that time, the page=7 response had 192 CVE Records that were also part of the page=8 response, the page=2 response had 33 CVE Records that were also part of the page=3 response, etc.
The second problem is that two sets of 3764 CVE IDs aren't the same. Specifically, when the lt series had 3539 unique CVE IDs and the gt series had 3674 unique CVE IDs (i.e., 90 fewer than 3764), there were 63 CVE Records only found by the lt requests, and 198 CVE IDs only found by the gt requests. There does not seem to be a clear pattern. For example, one CVE Record was published on 2022-11-01 and then updated on 2022-11-10: it was found only by the gt series. Another CVE Record was published on 2022-11-08 and then updated on 2022-11-09: it was found only by the lt series.
The extent of the problem varies across test runs. For example, it is possible to have 3764 CVE Records but less than 2000 unique ones.
This means that there is apparently no way to use https://cveawg.mitre.org/api/cve?time_modified.gt (accompanied by a later page=2 request) that will guarantee that all CVE Records after a certain data are captured.
Any set of found CVE Records may include a few with dateUpdated values before Soft Deploy. This occurs because clients with Secretariat privileges can use PUT /cve/{id} and modify data without bothering to supply correct dateUpdated values (e.g., the manual fix to CVE-2022-32170 during deployment because it didn't comply with the JSON 5 schema). Nobody is doing that routinely.
All of this data was collected at a time of low usage of production CVE Services, and it seems extremely unlikely that someone else created or modified a CVE Record at the moment that the traversal through the eight pages was occurring.
It appears that some or all of the problem also affects GET /cve-id pagination. (GET /org was not tested.) When testing GET /cve-id pagination, it may be necessary to add parameters to avoid a 500 Internal Server Error from CVE Services, such as:
For example, in one case, the same CVE ID was part of the response for seven different page= values. It is possible that, on average, effects on GET /cve-id pagination are less dramatic than effects on GET /cve pagination but this has not been confirmed. Of course, GET /cve-id pagination is very important in the sense that it is available to CNAs, whereas GET /cve pagination is only for the Secretariat (but the effects on Secretariat operations are substantial).
It is possible that GET requests (that cause transactions on the same database) are somehow responsible for the incorrect pagination behavior. There is a substantial volume of GET requests to the production server 24x7x365. Because of this, it may be difficult or impossible to reproduce the problem in a non-production environment (e.g., test or prod-staging) without generating similar fake traffic.
In any case, the current software for implementing pagination (as a way to split up large data requests) clearly does not work correctly, and no part of the CVE Program should be relying on it. It is possible that the problem is in the package mongoose-aggregate-paginate-v2 itself, in how mongoose-aggregate-paginate-v2 is used by CVE Services, in how these interact with DocumentDB (rather than MongoDB), or in another area (e.g., database corruption).
Temporary workarounds might include:
The text was updated successfully, but these errors were encountered: