Switch from NVD json feeds to API #328

adamjanovsky · 2023-03-30T08:58:06Z

Closes #324

TODO

Also, it may be valuable to put up a list of expected CVEs and there matches. Maybe we could collect it on Trello. I don't think that we want to run these tests on each commit (so I'll disable them in CI/CD), but it may be good idea to run them when touching CVE/CPE matching.

Endpoints to use:

CPE API to recover all CPEs: https://services.nvd.nist.gov/rest/json/cpes/2.0
CVE API to recover all CVEs: https://services.nvd.nist.gov/rest/json/cves/2.0
CPEMatch API to map CPE criteria to CPE names: https://services.nvd.nist.gov/rest/json/cpematch/2.0

New tests

Some requests with API handler
Matching of complex criteria
Prunning to CPEs of interest
Parsing dictionary of vulnerable configurations
CVEDataset correctly handles criteria configuration
Datasets can be downloaded from seccerts.org

Notes:

This doesn't seem to hurt peak RAM usage, we still peak at ~8GB when CVE matching
Serialized datasets can take up to 2GB uncompressed. Compression ratio is approx. 10.
In total, we identified 21060 vulnerabilities in 348 vulnerable certificates.
The following snippet runs approx 26 minutes on my laptop (all is downloaded)

cc_dset = CCDataset(root_dir="/Users/adam/phd/projects/certificates/sec-certs/datasets/cc")
cc_dset.get_certs_from_web()
cc_dset._prepare_cpe_dataset()
cc_dset._prepare_cve_dataset()
cc_dset._prepare_cpe_match_dict()
cc_dset.compute_cpe_heuristics()
cc_dset.compute_related_cves()

codecov · 2023-03-30T09:07:18Z

Codecov Report

Patch coverage: 74.42% and project coverage change: +0.84 🎉

Comparison is base (5893352) 76.61% compared to head (d4825d1) 77.44%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #328      +/-   ##
==========================================
+ Coverage   76.61%   77.44%   +0.84%     
==========================================
  Files          51       52       +1     
  Lines        6372     6572     +200     
==========================================
+ Hits         4881     5089     +208     
+ Misses       1491     1483       -8

Impacted Files	Coverage Δ
src/sec_certs/sample/fips.py	`86.34% <0.00%> (-0.27%)`	⬇️
src/sec_certs/utils/pandas.py	`0.00% <ø> (ø)`
src/sec_certs/dataset/dataset.py	`52.22% <21.43%> (-9.34%)`	⬇️
src/sec_certs/dataset/cpe.py	`73.98% <64.11%> (+18.71%)`	⬆️
src/sec_certs/utils/nvd_dataset_builder.py	`82.68% <82.68%> (ø)`
src/sec_certs/sample/cve.py	`84.04% <85.30%> (+32.25%)`	⬆️
src/sec_certs/dataset/cve.py	`91.09% <85.49%> (+6.58%)`	⬆️
src/sec_certs/sample/cpe.py	`91.51% <89.84%> (-1.08%)`	⬇️
src/sec_certs/serialization/json.py	`84.91% <90.48%> (+0.70%)`	⬆️
src/sec_certs/configuration.py	`92.46% <100.00%> (+0.79%)`	⬆️
... and 6 more

... and 6 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

adamjanovsky · 2023-04-14T13:39:54Z

@J08nY Could you pls expose CPEDataset, CVEDataset` and json of CPE Match feed somewhere on seccerts.org in compressed form?

The URLs are in the settings:

sec-certs/src/sec_certs/configuration.py

Lines 72 to 80 in fc638a8

    
           cpe_latest_snapshot: AnyHttpUrl = Field( 
        
               "https://seccerts.org/cpe/cpe_dataset.json.gz", description="URL for the latest snapshot of CPEDataset." 
        
           ) 
        
           cve_latest_snapshot: AnyHttpUrl = Field( 
        
               "https://seccerts.org/cve/cve_dataset.json.gz", description="URL for the latest snapshot of CVEDataset." 
        
           ) 
        
           cpe_match_latest_snapshot: AnyHttpUrl = Field( 
        
               "https://seccerts.org/cpe/cpe_match_dataset.json.gz", 
        
               description="URL for the latest snapshot of cpe match json.",

, feel free to change them as you find fitting.

The CPEDataset and CVEDataset instances can be compressed with to_json(compress=True). CPEMatch feed is just a json, so it has to be handled separately.

Basically, now we just have to decide the URLs. Can you do that and change settings keys accordingly?

J08nY · 2023-04-14T13:51:20Z

Where do I get the CPE match feed? What processing do I need to do to obtain it?

adamjanovsky · 2023-04-15T20:14:36Z

Where do I get the CPE match feed? What processing do I need to do to obtain it?

If you have processed dataset available, the json should sit in auxiliary_datasets directory. Otherwise, you can obtain it with _prepare_cpe_match_dict():

sec-certs/src/sec_certs/dataset/dataset.py

Line 400 in d3d470e

def _prepare_cpe_match_dict(self, download_fresh: bool = False) -> dict:

You can either copy the contents of the method, or just create new dataset at some path and call the method right away. E.g.,

from sec_certs.dataset import CCDataset
cc_dset = CCDataset(root_dir="/whatever/path")
cpe_match_dict = cc_dset._prepare_cpe_match_dict()

with gzip.open("/path/to/store/cpe_match_dict.json", "w") as handle:
    json_str = json.dumps(cpe_match_dict, indent=4)
    handle.write(json_str.encode("utf-8"))

To get the datasets from NVD, you need to obtain the NVD API key and set the following two keys in your yaml settings:

nvd_api_key: <actual-api-key>
preferred_source_nvd_datasets: "api"

adamjanovsky · 2023-04-16T17:36:30Z

@J08nY

Regarding import time optimization, this post has a nice summary of different approachis that you can use to adress this: https://adamj.eu/tech/2023/03/02/django-profile-and-improve-import-time/

I did some profiling. As of now:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.dataset'
python -c 'import sec_certs.dataset'  3.28s user 0.54s system 111% cpu 3.413 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample' 
python -c 'import sec_certs.sample'  1.79s user 0.34s system 125% cpu 1.700 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model' 
python -c 'import sec_certs.model'  3.38s user 0.53s system 111% cpu 3.493 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 93% cpu 0.041 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs'      
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.043 total

I deferred few imports, see: 88f4630

Profiling after:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.datas
et'
python -c 'import sec_certs.dataset'  1.48s user 0.28s system 131% cpu 1.343 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample'
python -c 'import sec_certs.sample'  1.47s user 0.29s system 131% cpu 1.336 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model'
python -c 'import sec_certs.model'  1.50s user 0.29s system 131% cpu 1.365 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 92% cpu 0.044 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python  -c 'import sec_certs'    
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.041 total

So, from 3.3 seconds we go to 1.5. Any further reduction would require:

Deferring numpy (doable, but needs some thinking, saves 0.1s or 10%)
Deferring BS4 (doable, but nneeds some thinking, saves 0.1 or 10%)
Deferring pandas (undoable, used in typing quite a bit, saves 0.4 or 30%)
Ditching imports from __init__.py files, they eat approx. 35% on their own, i.e., without pandas etc.

I did the profiling with python -X importtime yourfile.py 2> import.log and https://pypi.org/project/tuna/.

I consider this to be an OK result and I will invest no more effort into this unless you promote the issue.

Edit: Also note that the imports called from functions should be called only once AFAIK.

…s-to-API

add nist_api_key key in config

6373128

adamjanovsky added 11 commits April 7, 2023 14:16

WiP new cve and cpe dataset handling

b98b491

update docs

6d2900c

fix pre-commit problems outside of tests

7de99a3

random fixes

bf3888b

fix some bugs

9e8198d

fix cpe eval notebook

1798592

fix references notebook

d9e14ca

fix vulnerabilities notebook

e4583af

fix key case in CVEDataset

277c475

add end-to-end CVE matching test

fc638a8

fix typing problem in test_cve_matching.py

286a34c

adamjanovsky added 2 commits April 14, 2023 20:39

fix cpe and cpe tests

ec66440

fix fips tests

45dc9c2

adamjanovsky self-assigned this Apr 15, 2023

adamjanovsky added enhancement New feature or request fips Related to FIPS 140 certification cc Related to CC certification labels Apr 15, 2023

adamjanovsky added 3 commits April 15, 2023 21:39

fix cve tests

8265fe0

fix cc tests

9d9fbb1

fix cpe tests

d3d470e

adamjanovsky added 3 commits April 15, 2023 22:19

fix cpe tests again

9030f99

rename some attributes in nvd api handler

c4a256b

defer few imports

88f4630

replace some ifs with dictionaray setdefault

e08e690

adamjanovsky added 5 commits April 20, 2023 15:26

tests for dataset builder

b0ca54f

share some fixtures

9545cae

share some code in tests, use importlib.resources

fec6334

some more cve/cpe tests

ce3857d

fix typehint

0c181f6

J08nY force-pushed the issue/324-Switch-from-NVD-data-feeds-to-API branch from fd241ea to 0c181f6 Compare April 21, 2023 12:59

J08nY and others added 4 commits April 21, 2023 15:33

Merge branch 'fix/dup-dedup' into issue/324-Switch-from-NVD-data-feed…

c04a2fb

…s-to-API

Fix test fixtures.

6b892c5

Fix black issue.

98163ce

fix cwe parsing in cvedataset

d4825d1

J08nY marked this pull request as ready for review April 24, 2023 18:10

J08nY merged commit 8b0600e into main Apr 24, 2023

J08nY deleted the issue/324-Switch-from-NVD-data-feeds-to-API branch April 24, 2023 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from NVD json feeds to API #328

Switch from NVD json feeds to API #328

adamjanovsky commented Mar 30, 2023 •

edited

Loading

codecov bot commented Mar 30, 2023 •

edited

Loading

adamjanovsky commented Apr 14, 2023 •

edited

Loading

J08nY commented Apr 14, 2023

adamjanovsky commented Apr 15, 2023 •

edited

Loading

adamjanovsky commented Apr 16, 2023 •

edited

Loading

Switch from NVD json feeds to API #328

Switch from NVD json feeds to API #328

Conversation

adamjanovsky commented Mar 30, 2023 • edited Loading

TODO

New tests

codecov bot commented Mar 30, 2023 • edited Loading

Codecov Report

adamjanovsky commented Apr 14, 2023 • edited Loading

J08nY commented Apr 14, 2023

adamjanovsky commented Apr 15, 2023 • edited Loading

adamjanovsky commented Apr 16, 2023 • edited Loading

adamjanovsky commented Mar 30, 2023 •

edited

Loading

codecov bot commented Mar 30, 2023 •

edited

Loading

adamjanovsky commented Apr 14, 2023 •

edited

Loading

adamjanovsky commented Apr 15, 2023 •

edited

Loading

adamjanovsky commented Apr 16, 2023 •

edited

Loading