Enrich packages from OpenSSF Security Scorecard Data #134

AyanSinhaMahapatra · 2024-03-26T19:16:47Z

I was going through the above project and related issues.

There were some few doubts I had was like right now I see export options of SBOMs in SPDX and CycloneDX but how is this imported in SCIO ?

https://github.com/ossf/scorecard#public-data
https://api.securityscorecards.dev/

are we going to use the data from above 2 APIs to enhance the SBOM json data ?

If we will be integrating this in the SBOMs data then why there is a need to use it as a pipeline because it would fetch data anyway by default?

I see mostly if there is a integration to be done 90% of it is on SCIO side.

Please correct me if I have misunderstood anything.

AyanSinhaMahapatra · 2024-03-26T19:17:14Z

AyanSinhaMahapatra
Mar 26, 2024
Maintainer Author

but how is this imported in SCIO ?

See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/load_sbom.py#L27

are we going to use the data from above 2 APIs to enhance the SBOM json data ?

Yes, that is the plan, but this needs a bit more research.
Primarily the data is keyed by source repository information, like github/gitlab repo link, so as long as detected package objects in SCIO/Purldb has this source repo data, we can use this to query the APIs to get scorecard data.

We also need some structure to store this data also, but initially this can be in the extra_data attributes.

why there is a need to use it as a pipeline

Becuase this data won't be fetched by default, when we create packages in SCIO. So if we are implementing this in SCIO, other pipelines detect or create packages, then an additional pipeline would be run to get this data fro the APIs and store it. And then also modify the SBOM output creation to use this data appropriately. That's what I was thinking if we are implementing this in the SCIO side.

In the purldb side, we are fetching metadata for and scanning source/binary packages, given a purl (see for example support for debian which I added: aboutcode-org/purldb#300, we also had maven and npm support for this previously). The scans are essentially SCIO pipelines being run, and the data imported back. So here to enrich the packages with scorecard data we have two options: Implement this in the SCIo side as discussed above, or fetch this data when we are getting metadata for a purl, based on source repo information present in the metadata. In both cases, we'd be storing the data in purldb too.

0 replies

404-geek · 2024-03-27T20:05:38Z

404-geek
Mar 27, 2024

@AyanSinhaMahapatra @pombredanne I was exploring the API
OpenSSF API

I found that we get the score card results only for those repos, which are part of scorecard workflow and I saw that only scancode-toolkit
repo from nexB is having that workflow and the scorecard is available on

but for some weird reason I could not find the entry of publish_results:true in the workflow files under the develop branch. I wonder how it is published.

Score-Card

and its report was on this commit

I also got the idea that if we are creating a pipeline out of this, it is dependent on the other pipelines (eg; scan_codebase , scan_single_package, etc) which are supposed to create package entries in the discoveredpackage table first for the projects and then fetching the scorecard details using the code_view_url or vcs_url. Now the issue here is that if the package does not have scorecard workflow present in their action there the API would be returning 404. Are we expecting to have such cases ?

And related the creating a pipeline why cant we keep it as a step which can be integrated into the existing pipeline and will run after the packages are populated.

The below code snippet from (scancode.io/scanpipe/pipelines/__init__.py) from execute funcn executes a pipeline, adding some prerequisite steps.

Can we add fetching scorecard details as a step if we are going to integrate it?

        self.log(f"Pipeline [{self.pipeline_name}] starting")

        steps = self.get_steps(groups=self.run.selected_groups)

        if self.download_inputs:
            steps = (self.__class__.download_missing_inputs,) + steps

        steps_count = len(steps)
        pipeline_start_time = timer()

        for current_index, step in enumerate(steps, start=1):
            step_name = step.__name__

            self.run.set_current_step(f"{current_index}/{steps_count} {step_name}")
            self.log(f"Step [{step_name}] starting")
            step_start_time = timer()

Regarding enriching the SBOM, I saw the process that it currently being populated on demand using the bom or spdx package. So we can do it the same way by populating the score card details during the time of request.

On the side of purlDB, I think it will be a lot easier, as you have done for the debian packages. If following along the same line we can integrate the score card API details here also for every indexed package.

Note : I just noted that the crucial thing for this OpenSSF to work effectively we need to have VCS url given as input. But in most of the cases, the VCS url is empty (in SCIO and purlDB).

Let me know what you think on this approach ?

0 replies

AyanSinhaMahapatra · 2024-03-29T06:04:28Z

AyanSinhaMahapatra
Mar 29, 2024
Maintainer Author

I found that we get the score card results only for those repos, which are part of scorecard workflow

This is not true, see https://github.com/ossf/scorecard?tab=readme-ov-file#public-data:

We run a weekly Scorecard scan of the 1 million most critical open source projects judged by their direct dependencies

We are critical OSS ;)
The data is there if a project is critical, or if they have the scorecard github action enabled, and it's data published.
You might want to try to enable this for a nexB project to get familiar with the scorecard data/conditions.

I also am not sure if the data out of the API and the data in the BigQuery dataset are same or not, and whether the BigQuery one has more repo info in it. If the BigQuery dataset has more info, I would be inclined to see if we can get data out of it, even if this is a one-time operation on the purldb side (and this is also why I mentioned purldb-side implementation)
The SCIO-side implementation of a pipeline getting data from the API seems straightforward.

We can also create/look for a small library that does this in python (import and store the data in models), and use this as a library in both SCIO/purldb, but this is fine if we are only doing this in SCIO for now. Depends on whether the purldb side implementation is useful or not.

Note : I just noted that the crucial thing for this OpenSSF to work effectively we need to have VCS url given as input. But in most of the cases, the VCS url is empty (in SCIO and purlDB).

Yes! This would be very important as we need the vcs URL, which is basically the github/gitlab repo link, and scorecard data is keyed by these links. There are efforts in purldb with purl2git: aboutcode-org/purldb#258 trying to improve this, but we would need to ramp this up in SCTK and add support for detecting these in package manifests better wherever applicable and whenever data is present in manifests. We also have some code which can be used here: https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py

A good test for the project would be making sure we are scanning ~10-15 packages from different ecosystems, detecting their vcs_url in the metadata correctly in SCTK (and thus in SCIO) and then the scorecard pipeline is correctly fetching the scorecard data back to SCIO.

This would a seperate add-on pipeline like the https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/find_vulnerabilities.py which is only run when requested specifically. so check all the implementation details there on how we get data about packages from a different data source, and then store and display these.

All of what I'm saying is food-for-thought of course, we are still very much open for discussion on exact implementation details, and this would likely be updated a couple times with community feedback as we go through this. And looking forward to seeing what you are researching and proposing on this.

1 reply

404-geek Mar 31, 2024

Hi @AyanSinhaMahapatra @pombredanne In the last 2 days. I went through all kinds of documentation and code breakdowns regarding this project.

I saw that the projects which are OSSF critical have been scanned by OSSF and have their results published in score card and REST API.

I also am not sure if the data out of the API and the data in the BigQuery dataset are same or not, and whether the BigQuery one has more repo info in it. If the BigQuery dataset has more info, I would be inclined to see if we can get data out of it, even if this is a one-time operation on the purldb side (and this is also why I mentioned purldb-side implementation)

Regarding this, I saw the Google Cloud BigQuery, and there is no such difference in data between querying an API and querying from thescorecard-v2_latest table. However, if you query table data which has their archived information,'scorecard--v2`, there we would get a lot of past data ( e.g., Just for reference, I got more than 1400 records for scancode-toolkit.). What I think is we can create a pipeline class that would fetch their current data from the API and if the user wants to see more past records we can query the bigquery tables using the python-sdk.

Now coming to the integration of OSSF, what I have thought is creating a pipeline similar to find_vulnerabilities, which is provided as an add-on extending the pipeline class. Here I can keep 3 steps

To check whether I can reach OSSF successfully or not using the API.
Scan the Github URLs for all discovered packages within the project and have a valid/reachable VCS or code_view URL containing a Github link.
Scan the Github URLs for all discovered dependencies with the same information as above.

In this, I have to use purl2url function (build_github_from_purl) from packageurl lib and
If I don't get a valid code URL with a github link, I won't be sending it for an OSSF scan.

Now, when it comes to saving data, we can have models in the same way they have been defined in the big data buckets but for the initial phases of developing and testing, I was hoping to keep them saved in a extra_data field as a JSON under their respective package entries.

Here during the 2nd and 3rd steps, there might be some github links that are not active parts of OSSF and they would be returning 404.

On the purlDB side I was actually going through the seed and run_visit on the way we fetch resourceuri objects from the upstream link. But it seems like there are some broken pieces there where the github mapper function was broken it was not able to parse any github url in the resourceurl db to make it visitable and all were failing in error.
Later I changed 'https://api\.github\.com/repos/([^/]+)/([^/]+)' to make it parse properly but it is having an issue to fetch github REST api docs

Visit error for URI: https://api.github.com/repos/mattetti/ruby-on-rails-tmbundle File "/home/ins-kali/PycharmProjects/purldb/venv/lib/python3.11/site-packages/github/Requester.py", line 378, in __check raise self.__createException(status, responseHeaders, output) github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest"}
I am making progress to fix that.

But the reason I was trying to see the run_visit function is to know whether we are able to get the vcs url of the packages from the purldb itself then we can reuse the same code base or library in purlDB that we will be developing for SCIO and this can run a separate worker fetching scorecard data and saving it to purlDB.

At last, we are populating the SBOM using the scorecard, which is a straight forward process and can be invoked through the request call.

If a purlDB scannable uri object is sent for package scanning in SCIO and after the process gets over the results are sent back to purlDB then there is no need to run the scorecard pipeline once more for the same project in SCIO, we can instead use the package results which are sent back to us and query OSSF directly from PurlDB end.

Let me know what do you think on this high-level plan. I will be crafting it with more details in the proposal.
But I still think there will be some loose ends that needs to be tied up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich packages from OpenSSF Security Scorecard Data #134

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Enrich packages from OpenSSF Security Scorecard Data #134

AyanSinhaMahapatra Mar 26, 2024 Maintainer

Replies: 3 comments · 1 reply

AyanSinhaMahapatra Mar 26, 2024 Maintainer Author

404-geek Mar 27, 2024

AyanSinhaMahapatra Mar 29, 2024 Maintainer Author

404-geek Mar 31, 2024

AyanSinhaMahapatra
Mar 26, 2024
Maintainer

Replies: 3 comments 1 reply

AyanSinhaMahapatra
Mar 26, 2024
Maintainer Author

404-geek
Mar 27, 2024

AyanSinhaMahapatra
Mar 29, 2024
Maintainer Author