Skip to content

[dev.fuzz] internal/fuzz: mutator should generate valid UTF-8 for strings #46874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jayconrod opened this issue Jun 22, 2021 · 7 comments
Closed
Labels
FrozenDueToAge fuzz Issues related to native fuzzing support NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@jayconrod
Copy link
Contributor

Currently, we use the same mutation engine for string and []byte. This tends to generate a lot of invalid UTF-8 strings that aren't usable for many use cases. While invalid UTF-8 is likely to turn up many shallow parser bugs, it may make the mutator less effective at finding more subtle, deeper bugs.

We should have an option to make the mutator only generate UTF-8. Some ideas:

  • Create a UTF8String defined type. A fuzz function that accepts that as a parameter would only get valid UTF-8 strings.
  • Only provide valid UTF-8 strings for string parameters. A function could request []byte for random bytes, and that can still be converted to string.

cc @golang/fuzzing @findleyr

@jayconrod jayconrod added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. fuzz Issues related to native fuzzing support labels Jun 22, 2021
@jayconrod jayconrod added this to the Unreleased milestone Jun 22, 2021
@dsnet
Copy link
Member

dsnet commented Jun 22, 2021

Some ideas ...

IIUC, one of the eventual goals is to be able to fuzz structs. If so, I'm not sure either idea works as well for higher-order types, where the field type is already a string and can't be easily changed.

Perhaps there can be a method hung off of testing.F that configures parameters about the mutator?

@findleyr
Copy link
Member

findleyr commented Jun 22, 2021

Create a UTF8String defined type. A fuzz function that accepts that as a parameter would only get valid UTF-8 strings.

I like this idea. We could also have ASCIIString. I think it would be convenient to have such a mechanism, for exactly the reasons you list, but I'm wary of making it the default. Despite my comment about the readability of non-utf8 string examples, consuming arbitrary bytes DID turn up a bug that wouldn't have been found with valid characters alone (#46855).

IIUC, one of the eventual goals is to be able to fuzz structs. If so, I'm not sure either idea works as well for higher-order types, where the field type is already a string and can't be easily changed.

I wonder if fields could be explicitly ignored by the mutator, perhaps with a struct tag, so that they may be substituted by other means. I.e.

type S struct {
Count int
Input string fuzz:"ignore"
}

func f(t *testing.T, s S, input fuzz.UTF8String) {
s.Input = string(input)
...
}

Edit: I already dislike this last suggestion, at least as I manifested it. Fuzzing, being testing, should not add struct tags to non-test objects.

@CAFxX
Copy link
Contributor

CAFxX commented Jun 22, 2021

Perhaps there can be a method hung off of testing.F that configures parameters about the mutator?

Or, we could start with trying to have the mutator automatically balance exploration and exploitation based on which kind of string (e.g. all valid utf-8 runes, mix of valid and invalid utf-8 runes, binary data) yield the best results, in terms of defects discovered, on that specific target.

So e.g. if for that target the fuzzer detects that all-valid utf-8 runes strings unearth more defects, it will progressively bias the mutator to generate more all-valid utf-8 runes strings, and less of the others types of strings. The benefit of this approach being that it requires no changes to the code being fuzzed and no additional knobs (although it may be slightly slower).

Later on, if needed, we could easily add a way to disable this auto-tuning by passing an explicit configuration as suggested by @dsnet.

@rolandshoemaker
Copy link
Member

Or, we could start with trying to have the mutator automatically balance exploration and exploitation based on which kind of string (e.g. all valid utf-8 runes, mix of valid and invalid utf-8 runes, binary data) yield the best results, in terms of defects discovered, on that specific target.

Essentially this is the concept behind some of the advanced fuzzing strategies discussed in #46507. Implementing input prioritization methods which focus on inputs which produce more coverage (among other characteristics) naturally biases towards mutations that produce inputs that the program understands, without actually having to alter the mutator at all.

That said there have been some discussions previously about biasing certain mutators, i.e. in order to prioritize mutators which reduce the size of inputs over those that increase the size of inputs etc. In a similar vein we may want to bias the mutators for strings, to produce inputs with valid UTF-8, over invalid UTF-8. I'd like to do some evaluation with a string based target to see how much of an effect of coverage this produces.

@katiehockman
Copy link
Contributor

Only provide valid UTF-8 strings for string parameters

I don't think we should do this. There is no property in the language which states that strings must be UTF-8, and we shouldn't special case this for fuzzing.

Rob made an important point:

Despite my comment about the readability of non-utf8 string examples, consuming arbitrary bytes DID turn up a bug that wouldn't have been found with valid characters alone (#46855).

This stuck out to me because it demonstrates one of the biggest benefits of fuzzing: the mutator is able to generate inputs to f.Fuzz that are vastly different from what you may have assumed is "valid input". Limiting to UTF-8 by default would take away from this.

A few general thoughts:

  • https://go.googlesource.com/proposal/+/master/design/draft-fuzzing.md#custom-generators talks a bit about what a custom generator might look like with the current design (once structs are supported). We could provide our own set of custom generators in the testing package in the future, including one for UTF-8-only strings
  • I also wonder if testing/quick would make more sense for inputs like this which have very specific generator needs.

@katiehockman
Copy link
Contributor

Is there anything else to discuss on this issue, or are we okay with closing it?

@jayconrod
Copy link
Contributor Author

Let's close. I think we're in agreement that UTF-8 by default is not a good idea. Custom mutators or defined types are appealing, but let's open an issue for those when we have a design.

@golang golang locked and limited conversation to collaborators Jun 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge fuzz Issues related to native fuzzing support NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
Status: No status
Development

No branches or pull requests

7 participants