-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul npm detector #2264
Overhaul npm detector #2264
Conversation
f583bf3
to
0080bf4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This went through several iterations so there may be some leftover cruft or confusing bits.
// Skip the first two indices: 1 is the entire string, 2 is the protocol. | ||
index, uri := firstNonEmptyMatch(matches, 2) | ||
info := ®istryInfo{ | ||
RegistryType: registryType(index - 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is spooky black magic that relies on knownRegistryPat
having the same number of capture groups as registryType
has values. Unfortunately, it doesn't seem possible to programmatically get enum values so this can't be verified at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to create some sort of registry
struct that contains a pattern and a name and a const int, and then smoosh them all together into a list in one place? That way there wouldn't have to be these multiple lists maintained across multiple files. (If not, then I think some loud comments in all the places that are magically correlated are warranted.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered doing that but was concerned that looping through n patterns would be much less efficient. In the worst case, you'd need to loop through all the "known" patterns, find all "generic" registry matches, and do a secondary lookup to determine the scheme for each registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm either misunderstanding something or being unclear. I'm just proposing merging the three lists of registries you have (const ints, string names, and patterns) into one list of structs, each of which contains three fields. Why would that add any additional loops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm misunderstanding. Can you provide an example of what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type registry struct {
// maybe an int too, i don't know if it's necessary
name string
pattern string
}
// somewhere, maybe at a top level
registries := []registry {
{
name: "npm",
pattern: `(registry\.(?:npmjs\.org|yarnpkg\.com))`
},
{
name: "artifactoryCloud",
pattern: `([a-z0-9-]+\.jfrog\.io/(?:artifactory|[a-z0-9._-]+)/api/npm/[a-z][a-z0-9._-]+)`
{,
// etc
}
(This could also be some kind of map. I see that some of the patterns are pretty long and decomposable, so maybe they get built out of a function somehow.) Building the big monster pattern is a matter of looping through the slice and just joining all the patterns together. This would eliminate the spooky magic that ties together the big pattern and the registry list.
a67a1c4
to
a62d713
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a ton of work, thanks for diving into it. I added some questions inline.
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", token)) | ||
res, err := client.Do(req) | ||
if err != nil { | ||
// A |tls.RecordHeaderError| likely means that the server is using HTTP, not HTTPS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I paranoid to be concerned by an automatic fallback here? It seems like we should be trying to the scheme in the original URI and failing if https is expected but not present. Am I thinking about this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some circumstances it isn't possible to determine the scheme of the registry URI. For example, //registry.host.com/npm/:_authToken=...
does not contain a scheme but we can do a secondary lookup to tell that it should be HTTP.
//registry.host.com/npm/:_authToken=...
registry=http://registry.host.com/npm/
If that information (registry=http://...
) isn't present in the chunk then the scheme is unknown
. HTTPS is a reasonable default but may not necessarily be correct. Trying to connect to a HTTP server with HTTPS seems to reliably result in a tls.RecordHeaderError
; conversely, I haven't found a reliable indicator that you're to connect to a HTTPS port with HTTP.
We could limit the automation fallback to cases where the scheme is unknown
. Personally, I don't think there's any risk in having it.
_ = res2.Body.Close() | ||
|
||
if res2.StatusCode == http.StatusOK { | ||
return false, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me if I've got this right:
- If the whoami user is anonymous,
- And the request succeeds even without the token sent at all,
- Then we conclude that the token isn't valid and the endpoint is just bad at communicating that fact
- But if the request fails after removing the token, then we conclude that the token legitimately has no user attached
If I got that correctly, then it took me a bit to understand it. (And if I didn't, then I definitely didn't understand it!) I think I would have figured it out faster if this logic were factored out into a named function, but maybe there's another approach that would clarify it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got that correctly, then it took me a bit to understand it. (And if I didn't, then I definitely didn't understand it!) I think I would have figured it out faster if this logic were factored out into a named function, but maybe there's another approach that would clarify it as well.
Your understanding is correct, but the verification logic is probably going to get even more convoluted, so it would make sense to extract this into its own function.
Some registries do not implement the whoami endpoint and return a 404. Others, like npm.fontawesome.com
, have a no-op whoami endpoint that always returns 200 {"username": null}
. That's not to mention that some registries don't implement either /-/whoami
or /-/all
but do implement /-/v1/search
or only support direct package references.
e.g., I have encountered dozens of Azure and Google registry URLs but they all return 404. I don't know if that means the tokens are invalid, the registry doesn't implement the endpoint, or the registry no longer exists.
func (s ScannerGeneric) Version() int { return 0 } | ||
|
||
// genericKeyPat should match all possible values for .npmrc auth tokens. | ||
// TODO: Ensure this works with Yarn and UPM configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this PR degrade any existing functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some further testing and having dedicated scanners for each pattern type (.npmrc
, .yarnrc
, .upmconfig.toml
, etc.) seems to produce fewer false positives. However, I'm not sure how to best organize them. Are "versions" meant to have any semantic meaning/order?
i.e., is it confusing to have one detector type with numerical versions that are completely unrelated (npm version 1 & 2 are siblings, version 3 is yarn, version 4 is rennovate, version 5 is a hypothetical update to the yarn config, etc.)
// Skip the first two indices: 1 is the entire string, 2 is the protocol. | ||
index, uri := firstNonEmptyMatch(matches, 2) | ||
info := ®istryInfo{ | ||
RegistryType: registryType(index - 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to create some sort of registry
struct that contains a pattern and a name and a const int, and then smoosh them all together into a list in one place? That way there wouldn't have to be these multiple lists maintained across multiple files. (If not, then I think some loud comments in all the places that are magically correlated are warranted.)
for _, registry := range registries { | ||
verified, extraData, err = doVerification(ctx, s.client, registry, token) | ||
if verified { | ||
return true, extraData, err | ||
} | ||
if err != nil { | ||
errs = append(errs, err) | ||
} | ||
} | ||
return false, nil, errors.Join(errs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell me if I'm reading this right: Any verification errors returned from any registry will propagate to the final result, even if any (other) individual registry returns a determinate failure (i.e. no error.) This means that if any registry reports indeterminate verification, then either the token will be verified or indeterminately unverified.
If I'm right, this is an issue, because it means that tokens can plausibly be impossible to determinately revoke. (A similar problem exists with the JDBC detector.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of. The 'happy path' is that a token is associated with a single known registry(1) and we get either a determinate verification or failure. If the token is associated with a single registry but it isn't a 'known' pattern(2), it's possible to encounter an indeterminate failure caused by implementation-specific differences.
# 1
//artifactory.corp.com/artifactory/api/npm/npm/:_authToken=*abcdef123456789*
# 2 .npmrc
//packages.corp.net/custom_layout/:_authToken=...
# 2 CLI flag
npm config set registry https://packages.corp.net/custom_layout/
The worst-case scenario is that a token cannot be correlated with a single registry, and there are multiple potential registries1. In that case, it can be difficult to tell whether something determinately or indeterminately failed. Some implementations are extremely sensitive, with things as small as a trailing slash being the difference between a 200 API response and a generic 400 HTML error page.
So you can either report failures for all attempted URLs, or report none of them and just mark the token as unverified2. Realistically, this is a problem when any multi-part secret has multiple potential matches. For example, Azure is a combination of secret (distinct) + clientId (uuid) + tenantId (uuid); if it fails and you have multiple UUIDs is it better to report ALL possible combinations or none of them?
Footnotes
-
Honestly, the detector that's most likely to trigger this right now is the existing npm v1 one that just searches for
npm.+{uuid}
. It's prone to false positives — even GitHub Advanced Security flags random UUIDs as npm tokens. ↩ -
prior versions had more logic to handle invalid registries (e.g., being redirected to https://landing.jfrog.com/reactivate-server/test or a HTML 404 page). That was a bit complicated to refactor so I ended up dropping it. ↩
// language=regexp | ||
`(?i)(//%s(?:/[a-z0-9._-]+)*)/:(?:_auth(?:Token)?|_password).{1,20}%s`, hostPat, token)) | ||
matches := registryAuthPat.FindStringSubmatch(data) | ||
if len(matches) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be len(matches) < 2
since we're checking matches[1]
next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. If a regex pattern matches, it always returns a fixed number of match groups. If you have mutually exclusive match groups like (?:(foo)|(bar))
then the non-matching group will just be an empty string.
func main() {
pat := regexp.MustCompile(`(?i)hello (?:(john)|(anne)|(world)|(.+))`)
matches := pat.FindStringSubmatch("Hello Anne")
fmt.Printf("Matches return %d groups: %s\n", len(matches), strings.Join(matches, ", "))
}
// Matches return 5 groups: Hello Anne, , Anne, ,
https://go.dev/play/p/cdPwcqZwDcT
TL;DR len(matches) should always be 0 or > 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could make the argument that it prevents failure, but if your pattern changes I think it's better to explicitly fail than silently break.
e.g., stuff like this will silently break if the pattern is updated to add another match group.
trufflehog/pkg/detectors/aha/aha.go
Lines 40 to 42 in 5e78ad4
if len(match) != 2 { | |
continue | |
} |
Description:
This fixes #1455. It matches new npm tokens (
npm_xxx...
), old npm tokens (NpmToken.0000-...
,0000-...
), and "non-standard" tokens such as Artifactory using a JWT or GitHub packages using a PAT. Basic auth is a separate can of worms, so is out of scope for this PR.The high-level flow for all detectors is:
Checklist:
make test-community
)?make lint
this requires golangci-lint)?