Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

vendor/ verification #121

Closed
sdboyer opened this issue Jan 23, 2017 · 31 comments · Fixed by #1912
Closed

vendor/ verification #121

sdboyer opened this issue Jan 23, 2017 · 31 comments · Fixed by #1912
Assignees

Comments

@sdboyer
Copy link
Member

sdboyer commented Jan 23, 2017

dep needs a way of verifying that the contents of vendor/ are what's expected. This serves three purposes:

  1. Performance: without a verification mechanism, the only safe choice dep can make is to drop and recreate vendor in its entirety on every operation. This is wasteful and slow; any operation that avoids unnecessary disk writes to determine if vendor needs updating will be faster, and would be a huge boon.
  2. Transparency: dep status is very limited in what it can report about vendor/ if it can't verify that vendor is actually what's expected.
  3. Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

The simple, obvious solution is to deterministically hash (SHA256, I imagine) the file trees of dependencies, then store the resulting digest in the lock. However, the need to perform vendor stripping (#120) complicates this, as that creates perfectly valid situations in which the code under vendor/ might differ from the upstream source. The more choice around stripping modes we create, the more complex and error-prone this challenge becomes.

Or, we might simply rely on the hashing performed by the underlying VCS - indeed, we already record "revisions" in the lock, so we're already sorta doing this. Relying on it is triply disqualified, though:

  • It suffers from the same issues with vendor pruning as above.
  • On the a security front, git, hg, bzr, and svn all rely on the broken SHA1 for basic integrity checks. (Even when git was first released, they made it explicitly clear that their security model is not based on the internal use of SHA1)
  • It may bind us to VCS being the underlying source for code, which is not a corner we want to paint ourselves into.

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after). This would allow us to be more adaptive about assessing package integrity - e.g., we can know that it's OK if certain files are missing.

The downside is that these lists of file hash digests would be potentially huge (e.g., if you import k8s), though that might be mitigated by placing them under vendor/ rather than directly in the lock - sdboyer/gps#69. Also, without the larger security model in place, I don't know if disaggregating hashing to the individual file level might compromise some broader notion of integrity that will prove important.

We might also record just the project unit-level hash against some future day where we've reinvented GOPATH and are no longer relying on vendor.

@calebhearth
Copy link

I've tackled some verification of binaries in the past for a tool I built using GPG and SHA256: https://github.com/thoughtbot/ftree/tree/master/dist#verifying

@sdboyer
Copy link
Member Author

sdboyer commented Jan 24, 2017

@calebthompson Thanks for the link! I'll have a look through your stuff as soon as I can. Quickly, though - our goal here is verifying a source tree, not the compiled binary (maybe that matters, maybe it doesn't). I suspect we also have some thinking to do about what we're verifying against.

@calebhearth
Copy link

Definitely. We could pipe the entire tree into either tool to generate the signatures.

$ git ls-files **/*go | xargs cat | gpg --detach-sign --armor -o tree.gpg
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: Good signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]
gpg:                 aka "Caleb Thompson <caleb@heroku.com>" [ultimate]
$ echo "foo" > foo.go
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: BAD signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]

@Dr-Terrible
Copy link

Dr-Terrible commented Jan 25, 2017

Performance: without a verification mechanism, the only safe choice dep can make is to drop and recreate vendor in its entirety on every operation. This is wasteful and slow; any operation that avoids unnecessary disk writes to determine if vendor needs updating will be faster, and would be a huge boon.
Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

Some time ago the folks at Gentoo had to tackle the same exact problem for their package manager, Portage; they ended up with the generation of a digest for every first-level directory's content of their package manager's tree (which is file-based and quite huge).

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).

That is completely unnecessary. golang/dep can use the same solution of Gentoo's Portage by check-summing the vendor/ directories itself; yet hashing a directory's data poses not insignificant challenges: first things first, to guarantee the validity of the hash the order of the directory's data must be predictable across all the OS supported by golang/dep. For example on *BSD systems a find -s vendor/ -type f -exec md5sum {} \; | md5sum¹ is fine because "find -s" guarantees the order of the directory listings, but on Linux and Windows you need to use a hack in the form of find vendor/ -type f -exec md5sum {} \; | sort -k 2 | md5sum to obtain the same result. Linux and windows don't guarantee that file-systems can maintain the directory listings in a stable, predictable order.

Hence, listing the directory's data is a naive approach, and prone to errors; a better solution should be to hash the directory's data and its metadata altogether (stored as an archive): find vendor/ | cpio -o | md5sum. This way even changes to file permissions and ownership are detect, if this behaviour is required by golang/dep.

Yet even considering data+metadata doesn't solve the problem that content sorting differs from one file-system to another, even on the same OS. Therefor, listing the directory's data+metadata throughout external tools can't produce consistent hashes on different OSes, leading to undefined behaviours and false positives.

The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within golang/dep, filtering its content in a predictable order (such as alphabetic order, or other reliable heuristics), store it in a temporary archive and then check-summing it. It's far better and efficient than traverse and record hashes for individual files contained in vendor/, not to mention that you can safely ignore #120 because you have control over how you traverse the vendor/ directory.

Go standard library already offers everything that is required, from file-system abstraction to archive generation, in a cross-platform way, thus guaranteeing the uniqueness of the hashes across different platforms. As a final note, this solution would be marginally susceptible to length-extension attacks depending on the hash function used.

¹- md5sum is used only as an example for brevity, but any hash function should work here;

@sdboyer
Copy link
Member Author

sdboyer commented Jan 25, 2017

This is great info and context - thanks. Some quick thoughts:

Some time ago the folks at Gentoo had to tackle the same exact problem

I'm not sure it's exactly the same. This process is occurring on a user's system, probably when a lock.json is generated. The first sentence of that link reads:

MetaManifest provides a means of verifiable distribution from Gentoo Infrastructure to a user system

So, that's central-to-user; we have a user-to-user problem. That may not end up mattering, if only because there's no uniform, reliable way to verify a trusted upstream source. I'm just noting it while I absorb this information.

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).

That is completely unnecessary.

I think you may have missed the original reasoning for this: if we are to admit the possibility of vendor pruning (#120), and especially multiple/customizable types of pruning, then it would be valid to have a vendor/ directory with only a subset of the original files present in the upstream source. (I see you've commented over there, though, so this is definitely not news). Maybe that's OK - maybe changing any of those configuration options for what gets pruned necessarily changes the lock with the results of hashing the new vendored set.

But, there also may have to be additional files in vendor to allow for processing/code generation type tasks. That's much harder to integrate.

The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within golang/dep

The info about the issues with command-driven tooling is useful, but don't worry - we were never considering doing this any way other than directly within the tool, using the facilities provided in stdlib.

@mikijov
Copy link
Contributor

mikijov commented Jun 27, 2017

Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

On the a security front, git, hg, bzr, and svn all rely on the broken SHA1 for basic integrity checks. (Even when git was first released, they made it explicitly clear that their security model is not based on the internal use of SHA1)

I wonder if we need to tackle security at all? We inherit security from the underlying vcs. We inherit both vulnerabilities and future improvements this way. But more specifically, if security is paramount, user should commit vendored code into own repo. In this situation the only two posibilities for the code to be comprimised is during initial clone or in users repo. If initial clone is a suspect, then no security of ours can detect that. If users repo was compromised, then the lock file could be changed as well. In other words, what is the use case of layering extra security on top of vcs?

@sdboyer
Copy link
Member Author

sdboyer commented Jul 18, 2017

@mikijov Security is its own issue, and yeah, we do have to seriously address it. However, this issue is about verification - related to, but independent of, security. Without this verification mechanism, it's much more difficult, for us to check that the contents of a vendor/ directory are what we expect them to be (according to the lock file). Without the ability to check that, we have to fully regenerate vendor/ on every dep ensure, which is very wasteful. There are other implications for further down the line, too.

@karrick was working on this during the Gophercon hack day, and I think we have an algorithm together to do the hashing: karrick@9449c59

@karrick
Copy link
Contributor

karrick commented Jul 28, 2017

@sdboyer,

While working on the algorithm to recursively hash the file system contents rooted at a specified file system object, I came across a possible gotcha when we follow symbolic links, which made me consider that the reason filepath.Walk actually elides following symbolic links might be to avoid circular loops when a symbolic link exists that causes an infinite loop.

The naive solution to preventing an infinite loop is quite simple, but does not scale.

We could punt on this issue for now, and pretend that there are no symbolic links that would cause an infinite loop, and we can continue on the solution we discussed at GopherCon, but I wanted you to be made aware of it.

Or if you have any suggestions, please let me know.

@karrick
Copy link
Contributor

karrick commented Jul 28, 2017

Maybe you have a suggestion for quickly detecting whether the file system node has been visited already.

My initial idea was to use a map and perform a quick key check, despite this method involves a constantly growing map on the heap. However, in order to insulate from the underlying os-specific data structures, there is no way I have found to grab the underlying fileStat (not FileInfo) information returned by the runtime in order to create a hash key from the node.

Therefore the only way I can think of to correctly determine whether the current node has been visited is to use the os.SameFile function with the current FileInfo and a list of FileInfo instances already visited.

package fs

import (
	"crypto/sha256"
	"fmt"
	"io"
	"os"
	"path/filepath"
	"sort"
	"strconv"

	"github.com/pkg/errors"
)

var pathSeparator = string(os.PathSeparator)

// HashFromNode returns a deterministic hash of the file system node specified
// by pathname, performing a breadth-first traversal of directories, while
// ignoring any directory child named "vendor".
//
// While filepath.Walk could have been used, that standard library function
// skips symbolic links, and for now, it's a design requirement for this
// function to follow symbolic links.
//
// WARNING: This function loops indefinitely when it encounters a loop caused by
// poorly created symbolic links.
func HashFromNode(pathname string) (hash string, err error) {
	// bool argument: whether or not prevent file system loops due to symbolic
	// links
	return hashFromNode(pathname, false)
}

func hashFromNode(pathname string, preventLoops bool) (hash string, err error) {
	h := sha256.New()
	var fileInfos []os.FileInfo

	// Initialize a work queue with the os-agnostic cleaned up pathname.
	pathnameQueue := []string{filepath.Clean(pathname)}

	for len(pathnameQueue) > 0 {
		// NOTE: pop a pathname from the queue
		pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:]

		fi, er := os.Stat(pathname)
		if er != nil {
			err = errors.Wrap(er, "cannot stat")
			return
		}

		fh, er := os.Open(pathname)
		if er != nil {
			err = errors.Wrap(er, "cannot open")
			return
		}

		// NOTE: Optionally disable checks to prevent infinite recursion when a
		// symbolic link causes an infinite loop, because this method does not
		// scale.
		if preventLoops {
			// Have we visited this node already?
			for _, seen := range fileInfos {
				if os.SameFile(fi, seen) {
					goto skipNode
				}
			}
			fileInfos = append(fileInfos, fi)
		}

		// NOTE: Write pathname to hash, because hash ought to be as much a
		// function of the names of the files and directories as their
		// contents. Added benefit is that empty directories effect final hash
		// value.
		//
		// Ignore return values from writing to the hash, because hash write
		// always returns nil error.
		_, _ = h.Write([]byte(pathname))

		if fi.IsDir() {
			childrenNames, er := fh.Readdirnames(0) // 0: read names of all children
			if er != nil {
				err = errors.Wrap(er, "cannot read directory")
				return
			}
			// NOTE: Sort children names to ensure deterministic ordering of
			// contents of each directory, so hash remains same even if
			// operating system returns same values in a different order on
			// subsequent invocation.
			sort.Strings(childrenNames)

			for _, childName := range childrenNames {
				switch childName {
				case ".", "..", "vendor":
					// skip
				default:
					pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName)
				}
			}
		} else {
			// NOTE: Format the file size as a base 10 integer, and ignore
			// return values from writing to the hash, because hash write always
			// returns a nil error.
			_, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10)))
			_, er = io.Copy(h, fh)
			err = errors.Wrap(er, "cannot read file") // errors.Wrap only wraps non-nil, so elide checking here
		}

	skipNode:
		// NOTE: Close the file handle to the open directory or file.
		if er = fh.Close(); err == nil {
			err = errors.Wrap(er, "cannot close")
		}
		if err != nil {
			return // early termination if error
		}
	}

	hash = fmt.Sprintf("%x", h.Sum(nil))
	return
}

@sdboyer
Copy link
Member Author

sdboyer commented Jul 29, 2017

@karrick i've actually spent way, way too much time thinking about the roiling hellscape that is symlinks. i've spent some time investigating extant symlink cycle detection algorithms, and they...suck. just, categorically, suck. turns out, graphs are hard.

however, for once, i think i have a happy answer to a symlink-related question! 😄

basically: the hasher should never, ever traverse symlinks. if it encounters one, then it should instead hash the contents/address of the link itself (basically, the output of os.ReadLink(). this is essentially the behavior of pkgtree.ListPackages() right now, which matters because that function is the one that creates a virtual representation of the filesystem that, in turn, is used to decide what things actually exist.

i believe this is sufficient for all cases because of some key invariants:

  1. only the bits in files that are reachable as descendants of the project tree matter
  2. all of the bits in the tree are always incorporated by the hasher
  3. the ability to hash a tree does not guarantee that there are no broken symlinks in that tree

(it is not a coincidence that these invariants mirror those of snapshot-based version control systems)

i'm not entirely sure of this approach (i never am with symlinks, because symlinks 😱💥💀), so lemme walk it through.

symlinks can vary along a number of independent dimensions that are relevant for our purposes:

  1. referent is file or directory (or another symlink)
  2. symlink is absolute or relative
  3. symlink's referent escapes the tree that encompasses the original link (so, if relative, it contains at least one ..)

as long as do not embed a guarantee in the hashing process that symlinks reaching outside of a project tree must exist/be valid, then we're fine to simply record the target of the symlink. if the symlink escapes the tree (which, if we want to defend against, is something we should enforce elsewhere), then we're making no correctness guarantee anyway. if it does not, then the structure of the hashing algorithm guarantees that the bits contained in/under the referent will be reached through normal, non-symlink means. given that, there's no added value to hashing the bits again; it's sufficient to simply record that a link exists.

the only real question in my mind with this is how we record the fact - in a cross-platform way - that a given file is a symlink in the hashing inputs itself, so that the hasher disambiguates between a normal file containing an address and a symlink with that same address.

@karrick
Copy link
Contributor

karrick commented Jul 29, 2017

@sdboyer, thanks for the feedback.

I spent a bit of time today looking at the surrounding dep code looking for where I'd hook this into, but I did not run across it yet. But I did see that other parts of dep were ignoring symlinks, and was curious about the necessity of following them in the hashing function if we ignore them elsewhere.

I'll update my fork tomorrow based on your feedback, and send a PR your way for the hashing code.

@sdboyer
Copy link
Member Author

sdboyer commented Jul 29, 2017

sounds great!

I spent a bit of time today looking at the surrounding dep code looking for where I'd hook this into, but I did not run across it yet.

yeah, i hadn't previously carved out a place for it. i have some basic thoughts on where to fit it in:

first and simplest would be to just add a workhorse func to internal/gps/pkgtree: HashPackageTree(path string). i suspect that returning a []byte is probably best for future flexibility, but i could be convinced of a fixed-length array, too.

then, we'll want something like gps.VerifyDepTree(basedir string, l Lock) (map[ProjectRoot]MismatchType, error). the basedir would either be an actual vendor dir, or at least have the same semantics.MismatchType would basically be Go's anemic version of an enum:

type MismatchType uint8
const (
	NoMismatch MismatchType = iota
	EmptyDigestInLock
	DigestMismatchInLock
	NotInLock
	NotInTree
)

i feel like these signatures and names may be enough to suggest the desired behavior, but lmk if more detail would be helpful. (also, we need a better name than MismatchType, as that's too general for this special-purpose type)

eventually, i'd like to see this func called in a more flexible way from up in dep - e.g., we dispatch the hasher in the background, prior to running the solver. we should be able to get some parallelism benefits there; both exert CPU pressure, but to the extent they are I/O-bound, the hashing code is disk-intensive, whereas the solver is often more network-intensive.

but, let's have a first PR just focus on these two functions, and the concomitant additions to gps.LockedProject (it needs the new hash digest). once we're happy with those, we can actually start making use of the verification mechanism during normal operations.

@karrick
Copy link
Contributor

karrick commented Jul 30, 2017

Someone feel free to correct me if I'm wrong, but IIRC one of the reasons that file size prefixed its contents when git creates a hash for the file was because SHA-1 was chosen because of its speed and "good enough" quality, despite not being a very long hash. However, in dep we're free to use SHA-256, which is both fast and long, and it seems a bit cargo-cultish to include file sizes when creating a hash for its contents.

This comment doesn't really change the algorithm much, but I thought I'd point that out. We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time.

@karrick
Copy link
Contributor

karrick commented Aug 21, 2017

@sdboyer,

I've got some interesting preliminary benchmarks from a new custom directory walking function.

First we establish a baseline using the find utility.

[karrick@linux walk-stdlib]$ time find ~ >/dev/null

real    0m17.218s
user    0m2.437s
sys     0m5.279s

Then we walk the directory tree using the standard library filepath.Walk function.

[karrick@linux walk-stdlib]$ cd ..
[karrick@linux examples]$ time walk-stdlib/walk-stdlib ~ >/dev/null

real    0m20.882s
user    0m7.407s
sys     0m13.117s

Then we walk the directory tree using a new faster walk function.

[karrick@linux examples]$ time walk-fast/walk-fast ~ >/dev/null

real    0m10.332s
user    0m10.557s
sys     0m2.827s

I know some of that is due to virtual file system caching, but it's at least a promising start.

@karrick
Copy link
Contributor

karrick commented Aug 21, 2017

After many more benchmarks, the find utility finishes the same directory structure in less than 4 seconds. So what I posted above was not accurate because the OS caches were not primed.

However, the fast walking example consistently performs at 2x speed as compared to the version that uses the standard library filepath.Walk function.

@karrick
Copy link
Contributor

karrick commented Aug 22, 2017

The modified code works well on Linux and Mac, but not on Windows, because of different syscalls available, as expected. I think godirwalk ought to have two different code paths with correct build flags, so it naturally runs on different operating systems.

@karrick
Copy link
Contributor

karrick commented Aug 26, 2017

I have extracted the directory walking logic to https://github.com/karrick/godirwalk, which now works with unix and Windows. Incidentally it even corrects a subtle bug on Windows over the standard library. On Windows, a symlink to a directory has multiple type mode bits set. The logic in filepath.Walk fails to detect the situation and attempts to follow those symlinked directories. By reversing the bit checks, the godirwalk project correctly detects a symlinked directory and either ignores it or follows it, depending on how the walk function is invoked. I have updated my local copy of dep to use gordirwalk.Walk and all tests still pass. Weren't there a few tests that had to be deactivated on Windows?

@sdboyer
Copy link
Member Author

sdboyer commented Aug 27, 2017

sounds like a good time for a PR to switch out our internal impl and use godirwalk for the hasher, then 😄

@CrushedPixel
Copy link

Just wondering what the state of this issue is - it seems like @karrick came up with a cross-platform method of creating a hash of each vendored dependency. What steps have to be taken next to push resolution of this important issue forward?

@qknight
Copy link

qknight commented Feb 17, 2018

@sdboyer in nixos the problem for generating a hash of a GIT repository is already solved. i've added some notes here: https://github.com/nixcloud/dep2nix#golang-dep-git-repo-hash-extension

i love the dep effort and using dep2nix it is really easy now to finally nail down all the dependency hashes to package a go project. but here is the catch:

using nix for go deployment we have to build a Gopkg.lock -> deps.nix manually, each time the project dependencies change. manually means lots of effort for the maintainer of the package.

here is the difference to yarn:

  • yarn.lock adds a sha256 hash per dependency (tar.bz2) so we can generate a yarn.nix file on the fly
  • Gopkg.lock lacks a hash (GIT repo) as there is no way to serialize a GIT repository cross platform

that said, using the NAR file specification along with the store concept and the hash implementation we have with nix, we pretty much already have all the tools dep would need to make generate a hash. should we now use that c++ code and create a go library from it which can be used by dep?

libraries needed for building the hash (implemented as go library using cgo):

@qknight
Copy link

qknight commented Feb 25, 2018

@sdboyer do you have any license concerns using nix related code as a dependency?

@sdboyer sdboyer mentioned this issue Jun 28, 2018
4 tasks
@sdboyer
Copy link
Member Author

sdboyer commented Jul 11, 2018

Finally got this all wrapped up!

@qknight i appreciate the offer (and my apologies for not responding sooner), but we had a planned roadmap for all of this already, and have had a CRLF-normalizing hasher in place since last fall. That's what i utilized when implementing #1912.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants