Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some UI package queries return duplicate copies of the same Package URL #1278

Closed
johnmhoran opened this issue Aug 23, 2023 · 13 comments
Closed

Comments

@johnmhoran
Copy link
Member

Some package queries using the UI return duplicate copies of the same Package URL. For example, a search for pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid returns

pkg:deb/debian/jackson-databind@2.12.1-1?distro=stretch
pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid
pkg:deb/debian/jackson-databind@2.12.1-1?distro=stretch
pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid
pkg:deb/debian/jackson-databind@2.12.1-1%2Bdeb11u1?distro=sid
pkg:deb/debian/jackson-databind@2.12.1-1%2Bdeb11u1
pkg:deb/debian/jackson-databind@2.12.1-1%2Bdeb11u1?distro=trixie
pkg:deb/debian/jackson-databind@2.12.1-1?distro=trixie

image

@johnmhoran
Copy link
Member Author

See also, e.g., pkg:deb/debian/sqlite3@3.27.2-3%2Bdeb10u1 -- but note that unlike the prior example, the numbers of affected by/fixing vulnerabilities are not 100% identical across the duplicates.

image

@johnmhoran
Copy link
Member Author

johnmhoran commented Oct 5, 2023

@pombredanne @TG1999 I think I've identified at least one reason we have some duplicate Package records in the VCIO DB.

While creating a series of test output modifications and test API queries today for issue 1287 (combining the affected and fixed by tabs in the Vulnerability details UI), I noticed that when I use the API to explore duplicates, the difference seems to be in the structure of the qualifiers value.

And BTW, this duplication also seems to mean that when these Packages are affected by a vulnerability, the DB contents are such that there are also completely different fixed by packages for each of the duplicate affected packages.

An example. pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid. I won't display the id values here, but there are 2 records for that PURL/Package, and the difference is here -- can you see it? 🙂

1 of 2:

    "purl": "pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid",
    "type": "deb",
    "namespace": "debian",
    "name": "jackson-databind",
    "version": "2.12.1-1",
    "qualifiers": "distro=sid",
    "subpath": "",

2 of 2:

    "purl": "pkg:deb/debian/jackson-databind@2.12.1-1?distro=sid",
    "type": "deb",
    "namespace": "debian",
    "name": "jackson-databind",
    "version": "2.12.1-1",
    "qualifiers": {
        "distro": "sid"
    },
    "subpath": "",

And speaking of id values, an initial review of several sets of queries suggests that these duplicates (there are others) were created at different times -- the id values look like they comprise separate groups.

@johnmhoran
Copy link
Member Author

@pombredanne @TG1999 Last week Philippe suggested I learn how to use the Django shell and search the DB for PURLs groupedby type, namespace, name and version -- hoping that the output would include patterns that might shed light on how many duplicates/bad records the DB contains.

I started with a count of packages in my local DB instance -- 596,745, a subset of the public DB.

I initially ran the following to generate the result Philippe suggested.

from django.db.models import Count

from vulnerabilities.models import Package

duplicate_packages = Package.objects.values('type','namespace','name','version').annotate(type_count=Count('type'),namespace_count = Count('namespace'),name_count = Count('name'),version_count = Count('version')).filter(type_count__gt=1,namespace_count__gt=1,name_count__gt=1,version_count__gt=1)

duplicate_packages

This gave me a large output (36,428 records in the queryset) that began like this:

Out[4]: <PackageQuerySet [{'type': 'alpine', 'namespace': '', 'name': 'advancecomp', 'version': '2.1-r2', 'type_count': 44, 'namespace_count': 44, 'name_count': 44, 'version_count': 44}, {'type': 'alpine', 'na
mespace': '', 'name': 'alpine', 'version': '2.23-r0', 'type_count': 44, 'namespace_count': 44, 'name_count': 44, 'version_count': 44}, {'type': 'alpine', 'namespace': '', 'name': 'alpine', 'version': '2.25-r0'
, 'type_count': 21, 'namespace_count': 21, 'name_count': 21, 'version_count': 21},

Despite careful study I saw no noticeable patterns. Let me know if you'd like me to upload a copy of the output here. (I also reran this with eols making it easier to read individual records.)

@johnmhoran
Copy link
Member Author

It occurred to me that a more informative result might be to check only for duplicates of the package_url field, so I ran the following.

from django.db.models import Count
from vulnerabilities.models import Package

duplicate_package_url_fields = Package.objects.values('package_url').annotate(package_url_count = Count('package_url')).filter(package_url_count__gt=1)

f = open('/home/jmh/duplicate_package_url_fields-01.txt', 'w')

for x in duplicate_package_url_fields:
    f.write(str(x))
    f.write("\n")

f.close()

This gave me a useful output of 41,293 records that began like this:

{'package_url': 'pkg:alpine/busybox@1.31.1-r22?arch=aarch64&distroversion=v3.12&reponame=main', 'package_url_count': 2}
{'package_url': 'pkg:alpine/busybox@1.31.1-r22?arch=armhf&distroversion=v3.12&reponame=main', 'package_url_count': 2}
{'package_url': 'pkg:alpine/busybox@1.31.1-r22?arch=armv7&distroversion=v3.12&reponame=main', 'package_url_count': 2}
{'package_url': 'pkg:alpine/busybox@1.31.1-r22?arch=mips64&distroversion=v3.12&reponame=main', 'package_url_count': 2}
{'package_url': 'pkg:alpine/busybox@1.31.1-r22?arch=ppc64le&distroversion=v3.12&reponame=main', 'package_url_count': 2}

@johnmhoran
Copy link
Member Author

A modest visual review suggests all have a count of 2, and the few that I examined were duplicates where the package_url is identical but the ids differ and some other attribute differs as well. Two examples:

Different qualifiers ==> identical package_url values

            "url": "http://public.vulnerablecode.io/api/packages/368146",
            "purl": "pkg:deb/debian/linux@5.15.3-1?distro=sid",
            "type": "deb",
            "namespace": "debian",
            "name": "linux",
            "version": "5.15.3-1",
            "qualifiers": {
                "distro": "sid"
            },
            "subpath": "",

and

            "url": "http://public.vulnerablecode.io/api/packages/439348",
            "purl": "pkg:deb/debian/linux@5.15.3-1?distro=sid",
            "type": "deb",
            "namespace": "debian",
            "name": "linux",
            "version": "5.15.3-1",
            "qualifiers": "distro=sid",
            "subpath": "",

This is even stranger:

Different namespace/name combinations ==> identical package_url values

{
    "id": 394293,
    "type": "pypi",
    "namespace": "",
    "name": "microweber/microweber",
    "version": "1.2.20",
    "subpath": "",
    "qualifiers": {},
    "package_url": "pkg:pypi/microweber/microweber@1.2.20",
    "plain_package_url": "pkg:pypi/microweber/microweber@1.2.20"
}

and

{
    "id": 397816,
    "type": "pypi",
    "namespace": "microweber",
    "name": "microweber",
    "version": "1.2.20",
    "subpath": "",
    "qualifiers": {},
    "package_url": "pkg:pypi/microweber/microweber@1.2.20",
    "plain_package_url": "pkg:pypi/microweber/microweber@1.2.20"
}

I don't know how representative these examples are, but if my shell scripting is correct (and it might not be), we have nearly 42,000 pairs of records that share identical package_url values.

@johnmhoran
Copy link
Member Author

I'm uploading duplicate_package_url_fields-01.txt -- 41,293 records, 40,341 of which are pkg:deb/debian. A few pkg:alpine/ at the top and some pkg:nginx/ and others at the bottom of the file.

Generated with

from django.db.models import Count
from vulnerabilities.models import Package

duplicate_package_url_fields = Package.objects.values('package_url').annotate(package_url_count = Count('package_url')).filter(package_url_count__gt=1)

f = open('/home/jmh/duplicate_package_url_fields-01.txt', 'w')

for x in duplicate_package_url_fields:
    f.write(str(x))
    f.write("\n")

f.close()

duplicate_package_url_fields-01.txt

@johnmhoran
Copy link
Member Author

@TG1999 @pombredanne In addition to the issue of duplicate PURLs, I've also noticed that at least some of these pairs of duplicates have different Affected By vulnerabilities and, when they share the same vulnerability, have different Fixed By Packages. Yes, it's true. 🙁

An example is pkg:nginx/nginx@0.8.9?os=windows. This is from the public VCIO -- 4 Affected By vulns vs. 1 Affected By vuln:

image

And this is from an API query for that PURL -- 1 Affected By vuln vs. 4 Affected By vulns, and for the one vulnerability they share, 2 sets of 2 different Fixed By Packages. Seriously. 🤯 3 results total, 2 of which are the pair of duplicate PURLs.

Here's the 1st of the duplicate PURLs, with 1 vulnerability and 2 fixed-by Packages (unlike the UI, the API calls them 'fixed_packages'):


{
    "count": 3,
    "next": null,
    "previous": null,
    "results": [
        {
            "url": "http://public.vulnerablecode.io/api/packages/402866",
            "purl": "pkg:nginx/nginx@0.8.9?os=windows",
            "type": "nginx",
            "namespace": "",
            "name": "nginx",
            "version": "0.8.9",
            "qualifiers": "os=windows",
            "subpath": "",
            "affected_by_vulnerabilities": [
                {
                    "url": "http://public.vulnerablecode.io/api/vulnerabilities/28",
                    "vulnerability_id": "VCID-6qxq-zyzf-aaar",
                    "summary": "Vulnerabilities with Windows 8.3 filename pseudonyms",
                    "references": [],
                    "fixed_packages": [
                        {
                            "url": "http://public.vulnerablecode.io/api/packages/402844",
                            "purl": "pkg:nginx/nginx@0.7.65?os=windows",
                            "is_vulnerable": false
                        },
                        {
                            "url": "http://public.vulnerablecode.io/api/packages/402843",
                            "purl": "pkg:nginx/nginx@0.8.33?os=windows",
                            "is_vulnerable": false
                        }
                    ],
                    "aliases": [
                        "CORE-2010-0121"
                    ]
                }
            ],
            "fixing_vulnerabilities": []
        },

@johnmhoran
Copy link
Member Author

This is the 2nd of the duplicate PURLs, with 4 vulnerabilities, not 1 like the other duplicate PURL. The pair of duplicate PURLs share 1 of these vulnerabilities -- VCID-6qxq-zyzf-aaar (id=28) (you can see it up above as well)-- and each has 2 fixed-by Packages for that vuln but they are completely different!

       {
            "url": "http://public.vulnerablecode.io/api/packages/161168",
            "purl": "pkg:nginx/nginx@0.8.9?os=windows",
            "type": "nginx",
            "namespace": "",
            "name": "nginx",
            "version": "0.8.9",
            "qualifiers": {
                "os": "windows"
            },
            "subpath": "",
            "affected_by_vulnerabilities": [
                {
                    "url": "http://public.vulnerablecode.io/api/vulnerabilities/28",
                    "vulnerability_id": "VCID-6qxq-zyzf-aaar",
                    "summary": "Vulnerabilities with Windows 8.3 filename pseudonyms",
                    "references": [],
                    "fixed_packages": [
                        {
                            "url": "http://public.vulnerablecode.io/api/packages/73",
                            "purl": "pkg:nginx/nginx@0.7.65?os=windows",
                            "is_vulnerable": true
                        },
                        {
                            "url": "http://public.vulnerablecode.io/api/packages/72",
                            "purl": "pkg:nginx/nginx@0.8.33?os=windows",
                            "is_vulnerable": true
                        }
                    ],
                    "aliases": [
                        "CORE-2010-0121"
                    ]
                },
                {
                    "url": "http://public.vulnerablecode.io/api/vulnerabilities/26",
                    "vulnerability_id": "VCID-83uc-4ghj-aaaq",
                    "summary": "Vulnerabilities with invalid UTF-8 sequence on Windows",

. . .

                    "aliases": [
                        "CVE-2010-2266"
                    ]
                },
                {
                    "url": "http://public.vulnerablecode.io/api/vulnerabilities/27",
                    "vulnerability_id": "VCID-fqym-j77g-aaac",
                    "summary": "Vulnerabilities with Windows file default stream",

. . .

                    "aliases": [
                        "CVE-2010-2263"
                    ]
                },
                {
                    "url": "http://public.vulnerablecode.io/api/vulnerabilities/22",
                    "vulnerability_id": "VCID-sqx7-1q2d-aaak",
                    "summary": "Vulnerabilities with Windows directory aliases",

. . .

                    "aliases": [
                        "CVE-2011-4963"
                    ]
                }
            ],
            "fixing_vulnerabilities": []
        }
    ]
}

@johnmhoran
Copy link
Member Author

Note: when I describe the 2 sets of Fixed-By Packages as different, they have identical purl values but different id values -- just like the parent pair of duplicate PURLs. Turtles all the way down....

@pombredanne
Copy link
Member

"qualifiers" is stored as a JSON field: https://github.com/nexB/vulnerablecode/blob/f3d153190fda258ba76e0a08be21b376e67f505c/vulnerabilities/models.py#L531 but it was originally stored as a encoded query string as "name=value&name=value". In hindsight, I wonder if we should not switch back to a string in the future, as I cannot think of cases where we need fine grained queries on this field.

There must have been some accident in the past where we imported and saved data using a string and did convert these saved data to a JSONfield when the model evolved.

@pombredanne
Copy link
Member

https://public.vulnerablecode.io/vulnerabilities/VCID-6qxq-zyzf-aaar is fixed in two different ranges by two different versions which is typical of nginx.

@TG1999
Copy link
Contributor

TG1999 commented Jan 3, 2024

@johnmhoran please check the latest release of public instance and feel free to re-open the issue if needed

@TG1999 TG1999 closed this as completed Jan 3, 2024
@johnmhoran
Copy link
Member Author

Will do @TG1999 -- thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants