Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DB v6 schema #2128

Open
7 tasks done
wagoodman opened this issue Sep 17, 2024 · 2 comments · May be fixed by #2245
Open
7 tasks done

Add DB v6 schema #2128

wagoodman opened this issue Sep 17, 2024 · 2 comments · May be fixed by #2245
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@wagoodman
Copy link
Contributor

wagoodman commented Sep 17, 2024

DB v6 is meant to cover several use cases (you can safely ignore this link). The high level goals are:

  • Make the distributed DB smaller
    • Via removing necessary information: for instance, "namespace" is a hold over from earlier projects and is no longer strictly necessary.
    • Via minimizing redundant information: in order to optimize the matching process, the v5 Vulnerability table is filled with redundant information (rows are duplicated for every unique fix-in found in upstream data).
  • Make adding new use cases to the DB easier (e.g. add EPSS or KEV data sources for grype to functionally use them without needing to make structural changes to grype)

The high level design is as follows:

  • Add a table-per-search-usecase. The bare minimum information needed to search records are in these tables, which are indexed. These tables allow you to identify a record ID. We'll refer to these tables as "Search" or "*Hanadle" tables.
  • Have a pseudo content addressable (single) blobs table that maps record ID to JSON blobs. When writing to the DB, these record IDs are 1:1 to JSON content digests, so that we cheaply don't write duplicate information. These digests should be stripped away before distribution, since they are not strictly needed at runtime. These JSON blobs represent the business elements (e.g. Vulnerability, Severity, AffectedPackage, etc). All payloads are self-describing with a named/version "kind" field used for deserialization concerns. We'll refer to these as "*Blob" models.
  • Have auxiliary tables related to search tables to aid in refinement / augmentation to searches.

Here are a list of the “Handle” tables to search against:

  • AffectedPackageHandles: search by package name within an ecosystem
  • AffectedCPEHandles: search by CPE fields
  • VulnerabilityHandles: search by vulnerability ID

These are related to two other auxiliary tables:

  • Packages: listing of all unique pairings of package name and ecosystem
  • OperatingSystems: listing of all unique tuples of distro name, major version, minor version (with optional codename)

Here's how they AffectedPackageHandle table would relate to auxiliary tables:

CREATE TABLE AffectedPackageHandle (
    id INTEGER PRIMARY KEY,
    vulnerability_id INTEGER NOT NULL,
    operating_system_id INTEGER,
    package_id INTEGER NOT NULL,
    blob_id INTEGER NOT NULL,
    FOREIGN KEY(vulnerability_id) REFERENCES VulnerabilityHandle(id),
    FOREIGN KEY(operating_system_id) REFERENCES OperatingSystem(id),
    FOREIGN KEY(package_id) REFERENCES Package(id),
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

# supporting tables...

CREATE TABLE VulnerabilityHandle (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    blob_id INTEGER UNIQUE,
    FOREIGN KEY(blob_id) REFERENCES Blob(id)
);

CREATE TABLE Package (
    id INTEGER PRIMARY KEY,
    ecosystem TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE(name, ecosystem)
);

CREATE TABLE OperatingSystem (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    major_version TEXT NOT NULL,
    minor_version TEXT NOT NULL,
    codename TEXT,
    UNIQUE(name, major_version, minor_version)
);

And the rather simple Blobs table:

CREATE TABLE Blobs (
    id INTEGER PRIMARY KEY,
    value TEXT NOT NULL UNIQUE
);

Here's how you might craft a search for an affected package for a specific OS:

SELECT 
    affected.blob_id AS affected_blob_id,
    vuln.blob_id AS vulnerability_blob_id
FROM 
    AffectedPackageHandles affected
JOIN 
    OperatingSystems os ON aph.operating_system_id = os.id
JOIN 
    Packages pkg ON aph.package_id = p.id
JOIN 
    VulnerabilityHandles vuln ON aph.vulnerability_id = vh.id
WHERE 
    os.name = 'ubuntu' AND 
    os.major_version = '22' AND 
    os.minor_version = '4' AND 
    pkg.name = 'openssl';

At this point you can take these blob IDs and query the blob table for the JSON payload and deserialize. This has an advantage over the existing schemas: you can conditionally inflate DB objects based on what you need, not have to inflate entire records that you end up throwing away.

In v1-5 you'd need to craft the correct namespace, which was a bespoke string --this shifts this to relations per-record.

Eventually we’d like to add additional handle tables (out of scope for v6 though):

The proposed blobs are as follows:

type VulnerabilityBlob struct {
	ID string `json:"id"` 

	// Name of the Vunnel provider (or sub processor responsible for data records from a single specific source, e.g. "ubuntu")
	ProviderName string `json:"provider"`

	// List of names, emails, or organizations who submitted the vulnerability
	Assigner []string `json:"assigner,omitempty"`

	// Description of the vulnerability as provided by the source
	Description string `json:"description"`

	// Date the vulnerability record was last modified
	ModifiedDate *time.Time `json:"modified,omitempty"`

	// Date the vulnerability record was first published
	PublishedDate *time.Time `json:"published,omitempty"`

	// Date the vulnerability record was withdrawn
	WithdrawnDate *time.Time `json:"withdrawn,omitempty"`

	// Conveys the current status of the vulnerability
	Status VulnerabilityStatus `json:"status"`

	// URLs to external resources that provide more information about the vulnerability
	References []struct {
		// External resource
		URL string `json:"url"`

		// Free-form organizational field to convey additional information about the reference
		Tags []string `json:"tags,omitempty"`
	} `json:"refs,omitempty"`

	// List of IDs of the same vulnerability in other databases, in the form of the ID field.
	Aliases []string `json:"aliases,omitempty"`

	// List of severity indications (quantitative or qualitative) for the vulnerability
	Severities []struct {
		// Describes the quantitative method used to determine the score, or indicates the qualitative value
		Scheme SeverityScheme `json:"scheme"`

		// Severity score (e.g., "7.5", "CVSS:4.0/AV:N/AC:L/AT:N/PR:H/UI:N/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N", or "high")
		Value string `json:"value"`

		// Name of the source of the severity score (e.g., "nvd@nist.gov" or "security-advisories@github.com")
		Source string `json:"source"`

		// Free-form organizational field to convey priority over other severities
		Rank int `json:"rank"`
	} `json:"severities,omitempty"`
}
type AffectedBlob struct {
	CVEs          []string `json:"cves"`
	RpmModularity string   `json:"rpm_modularity,omitempty"`
	PlatformCPEs  []string `json:"platform_cpes,omitempty"`
	Ranges        []struct {
		Version struct {
			// Type of version range, such as "semver", "rpm", "pypi", etc.
			Type string `json:"type"`

			// Allows for a version range expression, such as ">=1.0.0", "1.0.0", ">= 1.0, <2.0", etc.
			Constraint string `json:"constraint"`
		} `json:"version"`

		Fix *struct {
			Version string `json:"version"`
			State   string `json:"state"`
			Detail  *struct {
				GitCommit  string `json:"git_commit"`
				Timestamp  time.Time `json:"timestamp"`
				References []struct {
					// External resource
					URL string `json:"url"`

					// Free-form organizational field to convey additional information about the reference
					Tags []string `json:"tags,omitempty"`
				} `json:"references,omitempty"`
			} `json:"detail,omitempty"`
		} `json:"fix,omitempty"`
	} `json:"ranges,omitempty"`
}

Implied changes from this:

  1. All Blob entities…
    a. may have fields added to them over time, but not removed or modified (within the same schema version)
    b. only one “version” of a blob may exist within a DB at a time
  2. When you have additional data that fits new use cases, add a new Handle table that points to new Blob types.
  3. Remove the concept of a “Namespace” from the DB. All search data should be searched with the proper joins to auxiliary tables (or not) depending on what you’re searching for. In this way you are not searching within a namespace string pattern, you enumerate associated qualities you want to filter on.

Specific changes (see prototype models for reference):

@wagoodman wagoodman added the enhancement New feature or request label Sep 17, 2024
@wagoodman wagoodman added this to the DB v6 milestone Sep 17, 2024
@wagoodman wagoodman changed the title Add DB v6 reader/writers Add DB v6 schema foundation Sep 17, 2024
@wagoodman wagoodman changed the title Add DB v6 schema foundation Add DB v6 schema Sep 18, 2024
@TimBrown1611
Copy link

TimBrown1611 commented Sep 18, 2024

hi @wagoodman ,

  • are we expecting the scans results to change from v5 to v6?
  • will the db be stateful (can track what have changed \ added) - meaning, in case CVE was updated in a source which doesn't has a field of published \ updated, will it be mentioned in the DB?
  • flexibility to add new data - how easy it will be to add new tables (for example, EOL)?
  • will it effect grype's schema (v6 = grype 1.0.0)?
    thanks for your time!

@wagoodman
Copy link
Contributor Author

are we expecting the scans results to change from v5 to v6?

No, there is no functional change in the match results as currently planned out. There should be the same match results. This could change if we fix any matcher bugs while we're rewriting that portion of the search package though. But we would not merge changes that made results worse, only better.

will the db be stateful?

It will be more stateful than it is today. Most information will be in the Blobs table and be content addressable. We wont be keeping the value digests in the DB to save on space, but it would be possible to digest all values in that table and compare digests computed from one DB and another DB to get a rough sense on the changes. We cannot use the blob IDs themselves sense they cannot be considered stable between DB builds.

how easy it will be to add new tables?

Since there would be new Handle tables for searching and new Blob schemas for entries that are searchable by a handle table, at any point during the v6 lifetime we can add these new elements without it being a breaking change or a paradigm shift. Take your example for EOL, I imagine there would be a new EolBlob that represents a json payload with cpe, purl, version, date, etc... as well as a new EolHandle table that is indexed by cpe or purl. It might be that we add new entries into the Packages table for joining on too (optionally, depending on what kind of data we pull in from that dataset).

will it effect grype's schema (v6 = grype 1.0.0)?

grype is still in v0 so we're making breaking changes on minor releases still. However, v6 schema is a big step towards grype 1.0. We still need to determine if we want to make the DB schema part of the grype version contract (e.g. should we allow for breaking schema changes and not need to bump the major version of grype?) -- that still hasn't been decided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants