Quadratic behaviour on pathological html #299

marcusklaas · 2019-04-29T14:50:03Z

Found this vulnerability in pulldown-cmark and md4c. It appears cmark is also vulnerable.

python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null
0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k

python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null
1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k

python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null
6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k

jgm · 2019-04-29T15:18:55Z

Thanks!

jgm · 2019-04-29T15:28:04Z

This has to do with parsing of CDATA elements as inline HTML, not HTML blocks. And this is handled entirely by a scanner defined using re2c, which works in linear time. However, in this case the linear-time parser gets applied repeatedly to the tail of the input string.

It's tricky because, while with regular tags we can quit parsing when we hit <, with CDATA, we need to keep parsing when we hit <![CDATA[, because this can occur within a CDATA element.

I guess we'll need to special-case CDATA somehow, keeping track of the last position we've failed to find ]]>.

marcusklaas · 2019-04-29T15:37:17Z

I came to a similar conclusion. Maybe simply setting a flag would be enough? If we fail to find ]]> once, I think we will never find it after.

mity · 2019-04-29T17:28:05Z

~~cmark is also vulnerable to~~

time python -c 'print("a" + "<!A" * 40000)' | ./src/cmark >/dev/null

It's likely the same story as it was in MD4C.

EDIT: No. Copied something else. cmark has problem with the one in next comment. Does cmark handle HTML processing instructions differently then other inline raw HTML kinds?

mity · 2019-04-29T19:13:02Z

And ~~also~~ to

time python -c 'print("a" + "<?" * 40000)' | ./src/cmark >/dev/null

jgm · 2020-02-16T16:14:05Z

Also

time python -c 'print("a <!A " * 40000)' | time cmark > /dev/null

jgm · 2020-02-16T16:14:54Z

I think that, instead of handling these with the regex, we need a manual scanner that can keep track of where it failed.

nwellnhof · 2021-03-22T17:56:12Z

Also found by OSS-Fuzz: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32344

Possible fix: Don't try to reparse these constructs if the (fixed) character sequence required to terminate them wasn't found during the first scan.

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. Fixes commonmark#299.

nwellnhof · 2021-03-22T21:29:36Z

Proposed fix: nwellnhof@1944fe7

jgm · 2021-03-25T20:35:05Z

@nwellnhof I haven't had a chance to really look in detail, but the idea looks good to me on first glance; do you want to submit a PR?

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove the suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply check whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes #299.

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans. The basic idea is to remove suffixes `>`, `?>` and `]]>` from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex. To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. `matchlen` is adjusted with a single addition that accounts for both the relevant prefix and suffix. Fixes commonmark#299.

jgm mentioned this issue Jan 1, 2020

pathological case parsing inline CDATA tag jgm/commonmark-hs#7

Closed

nwellnhof mentioned this issue Mar 22, 2021

Quadratic behavior with inline HTML #379

Closed

nwellnhof mentioned this issue Mar 26, 2021

Fix quadratic behavior with inline HTML #380

Merged

jgm closed this as completed in #380 Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic behaviour on pathological html #299

Quadratic behaviour on pathological html #299

marcusklaas commented Apr 29, 2019

jgm commented Apr 29, 2019

jgm commented Apr 29, 2019

marcusklaas commented Apr 29, 2019

mity commented Apr 29, 2019 •

edited

Loading

mity commented Apr 29, 2019 •

edited

Loading

jgm commented Feb 16, 2020

jgm commented Feb 16, 2020

nwellnhof commented Mar 22, 2021

nwellnhof commented Mar 22, 2021

jgm commented Mar 25, 2021

Quadratic behaviour on pathological html #299

Quadratic behaviour on pathological html #299

Comments

marcusklaas commented Apr 29, 2019

jgm commented Apr 29, 2019

jgm commented Apr 29, 2019

marcusklaas commented Apr 29, 2019

mity commented Apr 29, 2019 • edited Loading

mity commented Apr 29, 2019 • edited Loading

jgm commented Feb 16, 2020

jgm commented Feb 16, 2020

nwellnhof commented Mar 22, 2021

nwellnhof commented Mar 22, 2021

jgm commented Mar 25, 2021

mity commented Apr 29, 2019 •

edited

Loading

mity commented Apr 29, 2019 •

edited

Loading