Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from NVD json feeds to API #328

Merged
merged 30 commits into from
Apr 24, 2023

Conversation

adamjanovsky
Copy link
Collaborator

@adamjanovsky adamjanovsky commented Mar 30, 2023

Closes #324

TODO

  • Resolve matching to CVEs with configurations -- currently, CVEDataset only contains matching criteria for such configurations. Special look up dictionary must be built to address this
  • Check that the workflow for fetching the datasets, for CPE and CVE matching actually works
  • Start using cached CPEs again (makes no sense since CVEDataset no longer works with them)
  • Apply one full run for sanity check. Compare number of detected CPEs and CVEs
    • Just download certs, process auxiliary datasets, compute_cpe_heuristics, compte_related_cves
  • Unify logging during downloads and heuristics processiing
  • Rewrite old tests to account for new fields in the objects
  • Write new tests to test the dataset builder
  • Run all notebooks -- update necessary methods
  • Update docs with the NVD API key description. Describe how the data is being pulled, etc.
  • Investigate RoCA CVEs and also other cases, see note below.
  • Write docstrings
  • Profile import times. Can we improve?

Also, it may be valuable to put up a list of expected CVEs and there matches. Maybe we could collect it on Trello. I don't think that we want to run these tests on each commit (so I'll disable them in CI/CD), but it may be good idea to run them when touching CVE/CPE matching.

Endpoints to use:

New tests

  • Some requests with API handler
  • Matching of complex criteria
  • Prunning to CPEs of interest
  • Parsing dictionary of vulnerable configurations
  • CVEDataset correctly handles criteria configuration
  • Datasets can be downloaded from seccerts.org

Notes:

  • This doesn't seem to hurt peak RAM usage, we still peak at ~8GB when CVE matching
  • Serialized datasets can take up to 2GB uncompressed. Compression ratio is approx. 10.
  • In total, we identified 21060 vulnerabilities in 348 vulnerable certificates.
  • The following snippet runs approx 26 minutes on my laptop (all is downloaded)
cc_dset = CCDataset(root_dir="/Users/adam/phd/projects/certificates/sec-certs/datasets/cc")
cc_dset.get_certs_from_web()
cc_dset._prepare_cpe_dataset()
cc_dset._prepare_cve_dataset()
cc_dset._prepare_cpe_match_dict()
cc_dset.compute_cpe_heuristics()
cc_dset.compute_related_cves()

@codecov
Copy link

codecov bot commented Mar 30, 2023

Codecov Report

Patch coverage: 74.42% and project coverage change: +0.84 🎉

Comparison is base (5893352) 76.61% compared to head (d4825d1) 77.44%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #328      +/-   ##
==========================================
+ Coverage   76.61%   77.44%   +0.84%     
==========================================
  Files          51       52       +1     
  Lines        6372     6572     +200     
==========================================
+ Hits         4881     5089     +208     
+ Misses       1491     1483       -8     
Impacted Files Coverage Δ
src/sec_certs/sample/fips.py 86.34% <0.00%> (-0.27%) ⬇️
src/sec_certs/utils/pandas.py 0.00% <ø> (ø)
src/sec_certs/dataset/dataset.py 52.22% <21.43%> (-9.34%) ⬇️
src/sec_certs/dataset/cpe.py 73.98% <64.11%> (+18.71%) ⬆️
src/sec_certs/utils/nvd_dataset_builder.py 82.68% <82.68%> (ø)
src/sec_certs/sample/cve.py 84.04% <85.30%> (+32.25%) ⬆️
src/sec_certs/dataset/cve.py 91.09% <85.49%> (+6.58%) ⬆️
src/sec_certs/sample/cpe.py 91.51% <89.84%> (-1.08%) ⬇️
src/sec_certs/serialization/json.py 84.91% <90.48%> (+0.70%) ⬆️
src/sec_certs/configuration.py 92.46% <100.00%> (+0.79%) ⬆️
... and 6 more

... and 6 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Apr 14, 2023

@J08nY Could you pls expose CPEDataset, CVEDataset` and json of CPE Match feed somewhere on seccerts.org in compressed form?

The URLs are in the settings:

cpe_latest_snapshot: AnyHttpUrl = Field(
"https://seccerts.org/cpe/cpe_dataset.json.gz", description="URL for the latest snapshot of CPEDataset."
)
cve_latest_snapshot: AnyHttpUrl = Field(
"https://seccerts.org/cve/cve_dataset.json.gz", description="URL for the latest snapshot of CVEDataset."
)
cpe_match_latest_snapshot: AnyHttpUrl = Field(
"https://seccerts.org/cpe/cpe_match_dataset.json.gz",
description="URL for the latest snapshot of cpe match json.",
, feel free to change them as you find fitting.

The CPEDataset and CVEDataset instances can be compressed with to_json(compress=True). CPEMatch feed is just a json, so it has to be handled separately.

Basically, now we just have to decide the URLs. Can you do that and change settings keys accordingly?

@J08nY
Copy link
Member

J08nY commented Apr 14, 2023

Where do I get the CPE match feed? What processing do I need to do to obtain it?

@adamjanovsky adamjanovsky self-assigned this Apr 15, 2023
@adamjanovsky adamjanovsky added enhancement New feature or request fips Related to FIPS 140 certification cc Related to CC certification labels Apr 15, 2023
@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Apr 15, 2023

Where do I get the CPE match feed? What processing do I need to do to obtain it?

If you have processed dataset available, the json should sit in auxiliary_datasets directory. Otherwise, you can obtain it with _prepare_cpe_match_dict():

def _prepare_cpe_match_dict(self, download_fresh: bool = False) -> dict:

You can either copy the contents of the method, or just create new dataset at some path and call the method right away. E.g.,

from sec_certs.dataset import CCDataset
cc_dset = CCDataset(root_dir="/whatever/path")
cpe_match_dict = cc_dset._prepare_cpe_match_dict()

with gzip.open("/path/to/store/cpe_match_dict.json", "w") as handle:
    json_str = json.dumps(cpe_match_dict, indent=4)
    handle.write(json_str.encode("utf-8"))

To get the datasets from NVD, you need to obtain the NVD API key and set the following two keys in your yaml settings:

nvd_api_key: <actual-api-key>
preferred_source_nvd_datasets: "api"

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Apr 16, 2023

@J08nY

Regarding import time optimization, this post has a nice summary of different approachis that you can use to adress this: https://adamj.eu/tech/2023/03/02/django-profile-and-improve-import-time/

I did some profiling. As of now:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.dataset'
python -c 'import sec_certs.dataset'  3.28s user 0.54s system 111% cpu 3.413 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample' 
python -c 'import sec_certs.sample'  1.79s user 0.34s system 125% cpu 1.700 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model' 
python -c 'import sec_certs.model'  3.38s user 0.53s system 111% cpu 3.493 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 93% cpu 0.041 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs'      
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.043 total

I deferred few imports, see: 88f4630

Profiling after:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.datas
et'
python -c 'import sec_certs.dataset'  1.48s user 0.28s system 131% cpu 1.343 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample'
python -c 'import sec_certs.sample'  1.47s user 0.29s system 131% cpu 1.336 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model'
python -c 'import sec_certs.model'  1.50s user 0.29s system 131% cpu 1.365 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 92% cpu 0.044 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python  -c 'import sec_certs'    
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.041 total

So, from 3.3 seconds we go to 1.5. Any further reduction would require:

  • Deferring numpy (doable, but needs some thinking, saves 0.1s or 10%)
  • Deferring BS4 (doable, but nneeds some thinking, saves 0.1 or 10%)
  • Deferring pandas (undoable, used in typing quite a bit, saves 0.4 or 30%)
  • Ditching imports from __init__.py files, they eat approx. 35% on their own, i.e., without pandas etc.

I did the profiling with python -X importtime yourfile.py 2> import.log and https://pypi.org/project/tuna/.

I consider this to be an OK result and I will invest no more effort into this unless you promote the issue.

Edit: Also note that the imports called from functions should be called only once AFAIK.

@J08nY J08nY force-pushed the issue/324-Switch-from-NVD-data-feeds-to-API branch from fd241ea to 0c181f6 Compare April 21, 2023 12:59
@J08nY J08nY marked this pull request as ready for review April 24, 2023 18:10
@J08nY J08nY merged commit 8b0600e into main Apr 24, 2023
@J08nY J08nY deleted the issue/324-Switch-from-NVD-data-feeds-to-API branch April 24, 2023 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cc Related to CC certification enhancement New feature or request fips Related to FIPS 140 certification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch from NVD data feeds to API
2 participants