Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristics for "random" values to help with base-encoded typo false positives #484

Open
repi opened this issue May 7, 2022 · 7 comments
Labels
bug Not as expected

Comments

@repi
Copy link

repi commented May 7, 2022

This is a base58-encoded string from our codebase, is there some heuristic for the typo checker to not consider such a long "random" string to not be a word and not suggest anything for it? This was part of a larger JSON string in a test.

error: `Wew` should be `We`
  --> ./desc.rs:200:49
    |
200 |             "bytes_cid": "z177xERgbqgBdC97Y5GYXZWew1cFgkttqr5ipF2b8iCN17",
    |                                                 ^^^
    |

Here is another similar one also from a embedded JSON string:

error: `nd` should be `and`
  --> ./test.rs:31:221
   |
31 | pub const JWK: &str = r#"{"alg":"sig","n":"wnI2iD6F7qAg0qKGpFQ6L7qYdGbPkHSUHzigaW3p89fWBbZRT-WawqdU4vu3vANL9whlXMGlzLsPNUwXsoDKu6CnzAUUO9pr7E6CukN9A1UN13L-ZRKHAGv33NkdygDpTsYXUVAoQLykPnjToNVDKA0ohy96kzPkT4vql9n_5ev7Dhy69nd79mI09QhHo62RGzZDDanjdjXRBLBFA3Hm-CKiu"]}"#;
   |                                                                                                                                                                                                                             ^^
   |

These are the last two major false positives we've been seeing in our codebase with typos, works really well otherwise!

@repi repi changed the title Long base-encoded typo false positive Long base-encoded typo false positives May 7, 2022
@epage
Copy link
Collaborator

epage commented May 8, 2022

Yes, we have several issues related to hashes / base encodings of some sort

Having some kind of heuristic to discard hashes / base-encodings beyond a strict syntax check would be a big help. What that'd look like is the question though. To start off brainstorming,

  • X numbers (groups of digits) in string
  • X "words" (groups of letters) shorter than Y characters
  • We probably can treat base encoding equally with hashes (ie no special heuristics for how "much" of a word exists between -, protecting against math between variables) as we can identifiers in math will just show up somewhere else in the code and get flagged

Any other ideas for heuristics and for what the Xs and Ys should be?

#316 has a list of alternative approaches. Feel free to share how useful or not those approaches would be in that issue.

@repi
Copy link
Author

repi commented Aug 5, 2022

think the most important would be to have a way to opt out of tricky situations, like you may want to have a text string that has typos in it included in a test code or similar, and there will be cases that are difficult to detect properly with these type of base-encoded numbers of JSON strings and other stuffs.

So having some solution like the ones in #316 to opt out would be great robust fallback. For our particular use cases having a way to disable handling through comments that enable/disable the spell check would work

@boris-smidt-klarrio
Copy link

Maybe this will help there is 'ripsecrets' which is a tool written in rust which uses ripgrep to find secrets in an existing project.

So these regexes could be reused for 'high heuristic' values:
https://github.com/sirwart/ripsecrets/blob/main/src/lib.rs

@epage
Copy link
Collaborator

epage commented Aug 22, 2024

@boris-smidt-klarrio thanks! For now, I've at least linked to that in the docs in 8b729e1

@boris-smidt-klarrio
Copy link

boris-smidt-klarrio commented Aug 22, 2024

Not sure if it will work with the ignores because of the way the tokenizer works. I had a look at it but i assumed it kept on splitting tokens until it finds UUIDs/ words or numbers. So is there a setting to add other entries to the tokenizer with these regexes?

@epage
Copy link
Collaborator

epage commented Aug 22, 2024

@boris-smidt-klarrio extend-ignore-re is independent of the tokenizer. If we see a typo, we run extend-ignore-re against it and see if the typo is within the range. This is different than extend-ignore-identifiers-re and extend-ignore-words-re which work on tokenized values.

@boris-smidt-klarrio
Copy link

@epage Thank you it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not as expected
Projects
None yet
Development

No branches or pull requests

3 participants