-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large inline images use a lot of memory #10075
Comments
This does seem to have to do with inline (base64-encoded) images specifically. I tried the same file but with a linked image, and it only used 30 MB. |
Odd that even |
If we do I'd need to do some profiling to track this down further. |
Profiling, first three entries for
|
It seems that We could also think about whether we could avoid calling it. |
Yeah, here's the problem ( The URI parser parses multiple segment :: URIParser String
segment =
do { ps <- many pchar
; return $ concat ps
} This parses many small strings, one for each This should be pretty easy to optimize. I'm surprised nobody has run into this before, as this is a widely used package! For reference, here is pchar:
|
I've made the patch to parseURI, so you'll notice a new difference once a new version of network-uri is released; but it's not going to make a HUGE difference, because that function is still fairly inefficient. We could think about trying a different URI library, maybe uri-bytestring. |
Thanks @jgm, really nice explanation |
I have a not-polished-yet diff that uses an Attoparsec parser to identify valid base64-encoded data: URLs without extra allocation that skips Network.URI as a fast path, which saves hundreds of megs of allocation and a lot of runtime and gc in the html reader case. There's still multiple gigs of allocation and profiling indicates that the next allocation cost centers are in TagSoup:
I don't really know how to play "spot the allocation" but here's |
Interestingly, the markdown reader case seems to have a completely different bottleneck once the URL parsing is removed as a problem, but something really odd is going on with the cost center nesting in the profile that I can't make sense of. (Note that the unreasonable memory usage is visible even with images that are several KB, as opposed to several MB; profile behind the cut
edit: I tried out "late profiling" and got a cost center tree that led me more reasonably to the The lesson learned here is there's more than one thing where allocation blows up when the length of a URI is ~kilobytes, and possibly more than three things. |
This looks like progress, anyway! It's worth integrating your shortcutting code, probably, and then we can look into the next bottleneck. I usually use |
(in some places) Very long data: URIs in source documents are causing outsized memory usage due to various parsing inefficiencies, for instance in Network.URI, TagSoup, and T.P.R.Markdown.source. See e.g. jgm#10075. This change improves the situation in a couple places we can control relatively easily by using an attoparsec text-specialized parser to consume base64-encoded strings. Attoparsec's takeWhile + inClass functions are designed to chew through long strings like this without doing unnecessary allocation, and the improvements in peak heap allocation are significant. One of the observations here is that if you parse something as a valid data: uri it shouldn't need any further escaping so we can short-circuit various processing steps that may unpack/iterate over the chars in the URI.
(in some places) Very long data: URIs in source documents are causing outsized memory usage due to various parsing inefficiencies, for instance in Network.URI, TagSoup, and T.P.R.Markdown.source. See e.g. jgm#10075. This change improves the situation in a couple places we can control relatively easily by using an attoparsec text-specialized parser to consume base64-encoded strings. Attoparsec's takeWhile + inClass functions are designed to chew through long strings like this without doing unnecessary allocation, and the improvements in peak heap allocation are significant. One of the observations here is that if you parse something as a valid data: uri it shouldn't need any further escaping so we can short-circuit various processing steps that may unpack/iterate over the chars in the URI.
Regarding the tagsoup issue: I see that there is this "fast" tagsoup parser: It has some major drawbacks: (a) not maintained since 2017, (b) depends on text-icu, which has a huge C library dependency that is hard to install. But it is actually only one 500 line module which replaces the tagsoup parser, and if this worked we could simply vendor it. It would be easy to remove the text-icu usage, since we will already have decoded the document in an earlier stage. The question is whether it parses as accurately as tagsoup, but it could be worth trying it. [EDIT:] There is also a version that does not depend on text-icu: But there's a problem with both of these: they don't support parsing with source positions, which we use in our code. [EDIT2:] This is an alternative fast HTML5 tokenizer: |
It can be very expensive to call network-uri's URI parser on these. See #10075.
Effect of ca4ad3b on html -> json: before:
after:
|
Effect of 3da4d41 on md -> json: before:
after:
|
This patch borrows some code from @silby's PR #10434 and should be regarded as co-authored. This is a lighter-weight patch that only touches the Markdown reader. The basic idea is to speed up parsing of base64 URIs by parsing them with a special path. This should improve the problem noted at #10075. Benchmarks (optimized compilation): Converting the large test.md from #10075 (7.6Mb embedded image) from markdown to json, before: 6182 GCs, 1578M in use, 5.471 MUT (5.316 elapsed), 1.473 GC (1.656 elapsed) after: 951 GCs, 80M in use, .247 MUT (1.205 elapsed), 0.035 GC (0.242 elapsed) For now we leave #10075 open to investigate improvements in HTML rendering with these large data URIs.
This patch borrows some code from @silby's PR #10434 and should be regarded as co-authored. This is a lighter-weight patch that only touches the Markdown reader. The basic idea is to speed up parsing of base64 URIs by parsing them with a special path. This should improve the problem noted at #10075. Benchmarks (optimized compilation): Converting the large test.md from #10075 (7.6Mb embedded image) from markdown to json, before: 6182 GCs, 1578M in use, 5.471 MUT (5.316 elapsed), 1.473 GC (1.656 elapsed) after: 951 GCs, 80M in use, .247 MUT (1.205 elapsed), 0.035 GC (0.242 elapsed) For now we leave #10075 open to investigate improvements in HTML rendering with these large data URIs. Co-authored-by: Evan Silberman <evan@jklol.net>
This patch borrows some code from @silby's PR #10434 and should be regarded as co-authored. This is a lighter-weight patch that only touches the Markdown reader. The basic idea is to speed up parsing of base64 URIs by parsing them with a special path. This should improve the problem noted at #10075. Benchmarks (optimized compilation): Converting the large test.md from #10075 (7.6Mb embedded image) from markdown to json, before: 6182 GCs, 1578M in use, 5.471 MUT, 1.473 GC after: 951 GCs, 80M in use, .247 MUT, 0.035 GC For now we leave #10075 open to investigate improvements in HTML rendering with these large data URIs. Co-authored-by: Evan Silberman <evan@jklol.net>
Text.Pandoc.URI: export `pBase64DataURI`. Modify `isURI` to use this and avoid calling network-uri's inefficient `parseURI` for data URIs. Markdown reader: use T.P.URI's `pBase64DataURI` in parsing data URIs. Partially addresses #10075. Obsoletes #10434 (borrowing most of its ideas). Co-authored-by: Evan Silberman <evan@jklol.net>
This was done to determine the "media category," but we can get that directly from the mime component of data: URIs. Profiling revealed that a significant amount of time was spent in this function when a file contained images with large data URIs. Contributes to addressing #10075.
This avoids calling the slow URI parser from network-uri on data URIs, instead calling our own parser. Benchmarks on an html -> docx conversion with large base64 image: GCs from 7942 to 6695, memory in use from 3781M to 2351M, GC time from 7.5 to 5.6. See #10075.
Avoid the slow URI parser from network-uri on large data URIs. See #10075. In a benchmark with a large base64 image in HTML -> docx, this patch causes us to go from 7942 GCs to 3654, and from 3781M in use to 1396M in use. (Note that before the last few commits, this was running 9099 GCs and 4350M in use.)
With the latest changes, the md -> docx benchmark above has gone down to: |
HTML reader still uses more memory than it should, but this seems to be a separate issue with tagsoup. |
Here is an example HTML file (10Mb) with one embedded JPEG image (7.6Mb).
Memory usage:
html
tomd
uses 2985Mmd
todocx
uses 3435Mhtml
todocx
uses 4350MTest examples:
OS: macOS 14.14.1, m3/arm
The text was updated successfully, but these errors were encountered: