-
-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quadratic behaviour on pathological html #299
Comments
Thanks! |
This has to do with parsing of CDATA elements as inline HTML, not HTML blocks. And this is handled entirely by a scanner defined using re2c, which works in linear time. However, in this case the linear-time parser gets applied repeatedly to the tail of the input string. It's tricky because, while with regular tags we can quit parsing when we hit I guess we'll need to special-case CDATA somehow, keeping track of the last position we've failed to find |
I came to a similar conclusion. Maybe simply setting a flag would be enough? If we fail to find |
It's likely the same story as it was in MD4C. EDIT: No. Copied something else. cmark has problem with the one in next comment. Does cmark handle HTML processing instructions differently then other inline raw HTML kinds? |
And
|
Also
|
I think that, instead of handling these with the regex, we need a manual scanner that can keep track of where it failed. |
Also found by OSS-Fuzz: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32344 Possible fix: Don't try to reparse these constructs if the (fixed) character sequence required to terminate them wasn't found during the first scan. |
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. Fixes commonmark#299.
Proposed fix: nwellnhof@1944fe7 |
@nwellnhof I haven't had a chance to really look in detail, but the idea looks good to me on first glance; do you want to submit a PR? |
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove the suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply check whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes #299.
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.
Found this vulnerability in pulldown-cmark and md4c. It appears cmark is also vulnerable.
The text was updated successfully, but these errors were encountered: