Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GITBOM_RECORD_HASH_MODE to support use case of not embedding gitoid into artifacts for reproducible build #22

Closed
yonhan3 opened this issue Sep 20, 2022 · 12 comments
Labels
c-spec Category: Improvements or additions to the OmniBOR specification s-needs-info Status: Further information is requested

Comments

@yonhan3
Copy link

yonhan3 commented Sep 20, 2022

This GITBOM_RECORD_HASH_MODE is orthogonal and complementary to existing gitBOM gitoid-embedding mode specified in gitBOM spec.

In this GITBOM_RECORD_HASH_MODE, during software build, we only compute and record the git-hashes of input and output artifacts for all build steps, but we do not create gitBOM documents nor compute the bom-id of output artifacts.

Later after software build, these recorded git-hashes are processed by bomsh_create_bom.py script to generate all gitBOM docs and create the external artifact-to-bomid mapping database. This post-processing can occur on a remote host.

A lot of benefits with this GITBOM_RECORD_HASH_MODE:

  1. the artifact file size is smaller, without the embedded gitoid (which is 136 bytes overhead for ELF note section of SHA1).
  2. it works well with reproducible build.
  3. many potential uses of gitBOM are now possible, like truncating/grafting of gitBOM sub-trees, support of multiple versions of ADGs for the same artifact/package, different dependency criteria, etc.
  4. it works, no matter whether gitoid is embedded into artifacts or not.
  5. no need to specify gitoid embedding format (like issue Distinguishing between artifact ID types in elf file section names #16 for ELF) for each file format, the mapping database works for all file formats (ELF/JAVA/COFF, etc.).
  6. The recorded hashes info for the relevant build steps can greatly help SPDX SBOM document generation.

I think our gitBOM SPEC should support this GITBOM_RECORD_HASH_MODE, and provide some explicit guidelines for tool developers to follow. This GITBOM_RECORD_HASH_MODE is perhaps easier for software industry to accept gitBOM, especially the reproducible-build people.

The purpose of this GITBOM_RECORD_HASH_MODE is to achieve the same functionality as bomsh, while eliminating the high performance overhead of bomsh.

I have modified the binutils and gcc/llvm compiler with a few dozens of code line changes only, to implement this GITBOM_RECORD_HASH_MODE feature. An end-to-end hello-world demo is available.

@edwarnicke
Copy link
Contributor

I would recommend that we look at recommending build tools keep a map from artifacts to graphs as a symlink farm (see #20). This then becomes allowing an option to disable embedding in build tools.

@yonhan3
Copy link
Author

yonhan3 commented Sep 28, 2022

I would recommend that we look at recommending build tools keep a map from artifacts to graphs as a symlink farm (see #20). This then becomes allowing an option to disable embedding in build tools.

Thanks for the comments. I will take a look at #20.

@yonhan3
Copy link
Author

yonhan3 commented Sep 29, 2022

I have implemented a prototype in GCC-11.3 to support gitBOM document generation in a very flexible way. I have used the below environment variable to control how GCC builds software:

GITBOM_BUILD_MODE=sha1,sha256,create_adg,embed_bomid,record_hash

Each comma separated key or attribute is a flag to turn on/off a specific gitbom feature.

  1. sha1 and sha256 are hashing algorithms to use. Existence of sha1, will create SHA1 gitoids, while non-existence of sha1 means GCC will not generate SHA1 gitoids.
  2. create_adg/create_no_adg: if create_adg, then gitBOM docs will be created for each build step, the symlink for the output artifact is also created, pointing to the created gitBOM doc.
  3. embed_bomid/embed_no_bomid: if embed_bomid, then GCC will insert the ELF section and put the bom_id of the generated gitBOM doc to the output ELF file. If embed_no_bomid, then no embedding.
    Note, this embed_bomid implies create_adg, since the bom_id cannot be computed if we don't create gitBOM docs.
  4. record_hash/record_no_hash: if record_hash, then GCC will record the hashes of output and input files for each build step. if record_no_hash, then GCC will not record any hashes.

The default value is 'create_no_adg,embed_no_bomid,record_no_hash', but we can discuss what is the more appropriate default values for these flags.

A few examples to illustrate the usage:

  1. if "unset GITBOM_BUILD_MODE", then GCC will not do any gitBOM extra work.
  2. GITBOM_BUILD_MODE=sha1,sha256,create_adg,embed_bomid,record_hash, then GCC will create gitBOM docs of both SHA1 and SHA256, create the output artifact symlinks for SHA1 and SHA256. GCC also embeds both SHA1 bom_id and
    SHA256 bom_id into the output ELF file. GCC will also record the SHA1 and SHA256 hashes of output and input files.
  3. GITBOM_BUILD_MODE=sha1,create_adg,embed_bomid,record_no_hash, then GCC will create gitBOM docs of SHA1 only, create the output artifact symlinks for SHA1 only. GCC also embeds only SHA1 bom_id into the output ELF file. GCC will not record the hashes of output and input files.
  4. GITBOM_BUILD_MODE=sha256,embed_bomid,record_hash, then GCC will create gitBOM docs of SHA256 only, create the output artifact symlinks for SHA256 only. GCC also embeds only SHA256 bom_id into the output ELF file. GCC will also record the hashes of output and input files for SHA256 only.
  5. GITBOM_BUILD_MODE=sha1,sha256,record_hash, then GCC will not create gitBOM docs nor create the output artifact symlinks. GCC will not embed bom_id into the output ELF file. GCC will record the SHA1 and SHA256 hashes of output and input files.6.
  6. GITBOM_BUILD_MODE='', this also means GCC will not do any gitBOM extra work, since the default value is 'create_no_adg,embed_no_bomid,record_no_hash'.

Let me know if you have any questions/comments.

@yonhan3
Copy link
Author

yonhan3 commented Sep 29, 2022

I have also used another GITBOM environment variable to specify the top-level directory to store the generated gitBOM docs, symlinks, record_hash_logfiles.

Using absolute path:
GITBOM_DOC_SAVE_DIR=/tmp/gitbom-doc-dir

or

Using relative path:
GITBOM_DOC_SAVE_DIR=.adg/gitbom-doc-dir

Of course, there are still some sub-directories under this GITBOM_DOC_SAVE_DIR, to store gitBOM docs, symlinks, or record_hash log files.

if "unset GITBOM_DOC_SAVE_DIR", then all the gitBOM docs will be saved in the default directory, which is specified or recommended in the GITBOM specification.

@edwarnicke
Copy link
Contributor

I'd suggest we reduce this down to:

GITBOM_DO_NOT_EMBED

which, if set and non-empty instructs the build tool to not embed gitbom docs.

This does open some questions about the a2g link farm. Non-embedding opens the possibility for multiple GitBOM docs for the same artifact id.

In construction of the link farm, this can be solved by replacing the symlink with a directory of symlinks.

For consumption of the link farm, there is no one true solution. It cannot be known which GitBOM doc corresponds to the particular build the build tool is currently part of. The best we can manage is to use a simple heuristic like 'last modified'.

@yonhan3
Copy link
Author

yonhan3 commented Dec 5, 2022

Hi @edwarnicke Ed, how do we completely turn off the generating of gitBOM documents?
One single GITBOM_DO_NOT_EMBED is not sufficient.

Also, I think it is very valuable to have the flexibility for all the 3 combinations: 1. sha1 only; 2. sha256 only; 3. sha1+sha256. Because sha1+sha256 will at least double the number of gitBOM documents than sha1-only, most users will probably prefer to use either sha1-only or sha256-only. If there is no knob to provide this flexibility, then it seems not good.

Let me know your suggestions. Thanks.

@edwarnicke
Copy link
Contributor

@yonhan3 How would this do for embedded mode: #24

@edwarnicke
Copy link
Contributor

Hi @edwarnicke Ed, how do we completely turn off the generating of gitBOM documents? One single GITBOM_DO_NOT_EMBED is not sufficient.

Would it make sense to simply not generate GITBOM Docs if GITBOM_DIR (or equivalent tool flag) is not set?

@yonhan3
Copy link
Author

yonhan3 commented Dec 14, 2022

Hi @edwarnicke Ed, how do we completely turn off the generating of gitBOM documents? One single GITBOM_DO_NOT_EMBED is not sufficient.

Would it make sense to simply not generate GITBOM Docs if GITBOM_DIR (or equivalent tool flag) is not set?

That would work.
The current implementation will use a default .gitbom directory if GITBOM_DIR or GITBOM_DOC_SAVE_DIR is not set. I can change the implementation to not generate GITBOM docs if this GITBOM_DIR var s not set.
Thanks for the suggestion!

@yonhan3
Copy link
Author

yonhan3 commented Dec 14, 2022

@yonhan3 How would this do for embedded mode: #24

I guess you are talking about the 3 combinations: 1. sha1-only; 2. sha256-only; 3. sha1+sha256.
For embedded mode, do we have a way to specify one of the above 3 combinations?

In BOMSH, it is a command line option to specify it: --hashtype "sha1 | sha256 | sha1,sha256".
And another command line option to specify embedding or not: "-n" will not do embedding, while the missing of "-n" will do the embedding (which is the default).
Let me know if this answers your question.

@alilleybrinker
Copy link
Member

Per the linked comment in #53, I am trying to clarify and move forward the no-embedding discussion to a point of resolution, as I think it's a little bit murky right now. Full context here: #53 (comment)

@alilleybrinker alilleybrinker added c-spec Category: Improvements or additions to the OmniBOR specification s-needs-info Status: Further information is requested labels Sep 20, 2023
@alilleybrinker
Copy link
Member

Closing, per this comment: #53 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c-spec Category: Improvements or additions to the OmniBOR specification s-needs-info Status: Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants