-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SINGLE_FILE option to embed all subresources into emitted JS #5296
Conversation
As discussed in emscripten-core#5279, subresource paths are converted into base64 data URIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like it's in a good direction, nice.
Some thoughts:
- We had code to embed the mem init file as a JS array. It seems like that could be replaced with calling this new utility method?
- Same for
--embed-file
which embeds data. However, maybe that's a separate issue because of how the file packager stuff works, not sure. - A more serious issue is URI size. First, we should confirm that the new method is at least as good as the old on the mem init. (If not, maybe there's something we can do.)
- Second, and this is my biggest concern here, in the old method we split up the array into chunks because JS engines had issues with very large arrays - parsing them was horribly slow. I hope that isn't an issue with very large strings, but we should confirm.
- Finally, these data URIs will only work in a browser environment, I think? We do need things to work in node.js and other shell environments as well. (Perhaps it's possible to add a utility function for such environments that converts the URI into a typed array?)
As for testing, yes, we will need to add tests for this. We should test both browser and shell, so tests/test_browser.py
and tests/test_core.py
are the relevant files. Look at some of the tests there to get an idea of how to write a new one, and let me know if something isn't clear.
emcc.py
Outdated
@@ -513,6 +513,16 @@ def filter_emscripten_options(argv): | |||
|
|||
# ---------------- Utilities --------------- | |||
|
|||
# Returns the subresource location for run-time access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this to tools/shared.py
, perhaps a static method on the JS
class?
src/settings.js
Outdated
@@ -856,6 +856,13 @@ var FETCH = 0; // If nonzero, enables emscripten_fetch API. | |||
|
|||
var ASMFS = 0; // If set to 1, uses the multithreaded filesystem that is implemented within the asm.js module, using emscripten_fetch. Implies -s FETCH=1. | |||
|
|||
var SINGLE_FILE = 0; // If set to 1, embeds all subresources in the emitted JS file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add examples of those files (data, asm.js, wasm etc.), i think it would help clarify things
Well, it does accomplish basically the same thing as
Not 100% sure, but based on the ctrl+F results for
From a quick test on a binary file I had lying around, this seems like the much more efficient encoding:
I also made the suggestion to Google recently to specifically optimise for base64-encoded strings in a future release of Brotli, which wasn't immediately shot down, so maybe it'll work out and even further improve this situation long-term.
Huh, that's interesting. Yeah, we should definitely test that and maybe split up the strings if necessary.
Ah, I hadn't thought of that. Would updating Edit: What if we also updated the web version of re: your comments on the code, sure, I'll take care of those. |
emcc.py
Outdated
shared.Settings.WASM_TEXT_FILE = os.path.basename(wasm_text_target) | ||
shared.Settings.WASM_BINARY_FILE = os.path.basename(wasm_binary_target) | ||
shared.Settings.ASMJS_CODE_FILE = os.path.basename(asm_target) | ||
shared.Settings.WASM_TEXT_FILE = get_subresource_location(wasm_text_target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just went ahead and tested my change. (I was expecting a complicated and/or long setup process based on past experience installing from source with the SDK, but in retrospect I think it was just compiling clang that would take a long time.)
Anyway, it fails right here, apparently because the wasm, wast, and asm.js don't exist yet. How should I handle this?
@kripken, the changes from your comments and as discussed in the main thread are done. Remaining issues:
As far as string parsing performance, from a quick test in my JS consoles, setting one long (~1 MB) string to a variable seems to be noticeably faster than concatenating it together in 1 KB chunks in Chrome (single-to-double-digit µs vs single-digit ms) and about the same in Firefox (single-to-double-digit µs in either case). |
Rather than embedding Data URIs, why not just embed the base64 data as a normal string and decode it to an array in all environments, rather than just Node? Another array that is already embedded is the emterpreter data. I couldn't tell if you were handling that yet or not. |
That is what's being done now. The data URI prefix at this point is only used to tell It seems a bit arbitrary now with this being the case, but I think it probably makes sense to continue using this format since it forces us to to avoid collisions with actual file names (with no prefix or some other prefixes, we could easily parse an actual file name as base64 when it isn't meant to be). To save a few bytes, we could use a shorter mime type than As far as emterpreter data, thanks, I'll look around for that and see if it can also be easily migrated to this pattern. |
src/jsifier.js
Outdated
@@ -45,7 +47,8 @@ function JSify(data, functionsOnly) { | |||
// else. This lets us not hold any strings in memory, we simply print | |||
// things out as they are ready. | |||
|
|||
var shellParts = read(shellFile).split('{{BODY}}'); | |||
var thirdParty = thirdPartyFiles.map(function(f) { return read(f) }).join('\n'); | |||
var shellParts = read(shellFile).replace('{{THIRD_PARTY}}', thirdParty).split('{{BODY}}'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to add a third party library sodium to all generated builds? That is awkward for build sizes, plus also it will trigger a legal review for a number of partner companies, which we would prefer to avoid. It looks like sodium is used to base64 decode? Could https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding be used instead?
This looks like a good feature, but it should definitely be implemented as a build option so that builds which don't use it don't need to compromise on generated build sizes. That is, gate the generated code on SINGLE_FILE
option to not impact builds which don't utilize SINGLE_FILE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to add a third party library sodium to all generated builds? That is awkward for build sizes ... gate the generated code on
SINGLE_FILE
option to not impact builds which don't utilizeSINGLE_FILE
Yep, as mentioned above, that was exactly how I wanted to handle this in another commit. It's only 3 KB, which isn't much compared to the size of typical emscripten-generated code, but there's still no reason to include it when it isn't used.
Could https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding be used instead?
As mentioned, I have simple implementations working (which basically do this, although the base64 one is a little different because because it's converting to bytes, not a string):
function base64ToBytes (s) {
var decoded = atob(s);
var bytes = new Uint8Array(decoded.length);
for (i = 0 ; i < decoded.length ; ++i) {
bytes[i] = decoded.charCodeAt(i);
}
return bytes;
}
function bytesToString (bytes) {
if (typeof TextDecoder !== 'undefined') {
return new TextDecoder('utf-8', {fatal: true}).decode(bytes);
}
var s = '';
for (var i = 0 ; i < bytes.length ; ++i) {
s += String.fromCharCode(bytes[i]);
}
return s;
}
I just don't know offhand how well these handle corner cases. Maybe there's a test suite out there that we could use.
plus also it will trigger a legal review for a number of partner companies, which we would prefer to avoid
There's always my earlier proposed solution of asking the libsodium maintainers to provide from_base64
and to_string
as MIT, or we could go further and ask them to fully donate them to emscripten.
(The base64 implementation as I understand is nothing special aside from being constant time, which doesn't matter here, but I believe their string decoding has been extended to support relevant edge cases like very large byte arrays.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @juj, we should avoid adding a dependency here if we can. It should be easy enough to write code for this, the code you quoted looks like it could be a basis. For testing, I think there is a test for MEM_INIT_METHOD (can search under tests/) which checks a large variety of patterns, which we can maybe repurpose. If not, we should add something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense. Well, as an alternative, if we wanted to just steal the specific functions instead of using a dependency, this is what that would look like.
The original version of libsodium.js's to_string
function was written by me and would've been similar to the above, but based on jedisct1/libsodium.js#57 it seems that some sort of issue with handling large byte arrays was discovered and fixed, which would probably also become a problem for some emscripten users if we were to start from what I have above.
I don't mind working on making those implementations more robust and/or searching around for well tested public domain implementations if you'd prefer not to do this, but so we have the option: @jedisct1, would it be possible for those two functions from libsodium.js to be gifted to emscripten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juj and @kripken, I just committed my functions from above in place of sodiumutil, which I think we'll be okay with (particularly if we get that MEM_INIT_METHOD test working here).
As far as the large byte array problem with decoding UTF-8: upon further investigation, apparently the issue was that libsodium's original to_string
used decodeURIComponent(escape(String.fromCharCode.apply(null, bytes)))
instead of a String.fromCharCode(bytes[i])
for loop as I had here. Using the former method, an exception would be thrown with byte arrays larger than a few hundred KB, but the latter is limited only by the heap size as far as I can tell.
The reason it was implemented that way was for better error handling; the latter method will happily return weird strings if you give it an array of bytes that aren't valid UTF-8. That being said, as far as I can tell, that really doesn't matter here as much as it does in a crypto library, so there's really no reason not to just use the latter method.
Some of these files can be hundreds of megabytes in size (asm.js, wasm). This should make sure that these files are not loaded in to memory when |
Nice, glad to know there's a test suite that can properly validate the performance difference. Even if there are a lot of good reasons to expect this to be faster (on top of the improved space-efficiency), I'd hate for something to be rolled into production across the web based on a few minutes of unscientific testing in my JS console. |
@@ -2394,7 +2386,7 @@ def generate_html(target, options, js_target, target_basename, | |||
%s | |||
}; | |||
emterpretXHR.send(null); | |||
''' % (shared.Settings.EMTERPRETIFY_FILE, script.inline) | |||
''' % (shared.JS.get_subresource_location(shared.Settings.EMTERPRETIFY_FILE), script.inline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care of this as suggested by @curiousdannii. However, while doing this, I noticed that this part of emcc.py injects a number of one-off implementations of basically the same logic from Module.read*
. Is there a reason that I shouldn't change them to use Module.read*
where possible instead of reimplementing the data URI parsing in each one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those places are emitting code for the HTML file, where we normally don't depend on JS contents. However, we do create Module in the HTML, and we define Module.read there - so this might work. But if the JS is required to run to set up things, it might not. The safest thing might be to leave it as it is, or to create a utility function in the HTML to avoid duplication in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, got it. In that case, it sounds like the best solution right now would be to just copy and paste my tryParseAsDataURI
function from shell.js into emcc.py to be inserted into the HTML? (Another candidate for future refactoring, but it would be consistent with the Module.read*
logic already being duplicated here.)
Edit: Alternatively, we could either:
-
Not support
SINGLE_FILE
with HTML output. -
Document CSP requirements for combining either
SINGLE_FILE
or--memory-init-file 0
with HTML output. Slightly longer-term, maybe go back and resolve this requirement after the duplication of theModule.read*
logic is factored out.
I like option 2, if that's all right with you.
I'm biased since I don't personally use the HTML output for anything, but my thinking is that:
-
Whereas someone could easily rebuild and publish a JS library without knowing that they'll have just broken some of their users' applications, it's pretty obvious and easy to fix when your own page blows up on load after throwing a CSP error
-
Whereas a JS lib needs to work in every environment (web, node, shell), an HTML page only needs to run in a browser, where we know that loading a data URI through XHR works (CSP issues aside)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. If we emit html, then we are on the web and should be able to load the data uri directly? So there is no need for special code in the html to handle parsing it, that's just a problem for shell environments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, requesting a data URI through XHR/fetch works as expected; it's just in the node and shell environments where it wouldn't work since they would try to read it from the filesystem.
The drawback to relying on XHR for this in a web page is the CSP thing — that a Content-Security-Policy
header would need to whitelist data:
— which I documented as a limitation of combining SINGLE_FILE
with HTML output. This would eventually fix itself when/if the HTML generation is refactored to use the same implementations from shell.js, and until then probably wouldn't be a common issue.
As a demonstration, you can compare running (async () => console.log(await (await fetch('data:text/plain;base64,YmFsbHMK')).text()))()
in a JS console on GitHub (which has a strict CSP) and almost any other site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks. Seems fine to document the CSP issue. The one concern I have is does a situation exist where a user needs to use a single file, but can't set the CSP (like maybe they are an embedded game in a game website). For that case, could we detect that the xhr fails and use the parsing code we use for shell environments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, well, in that case, I can just copy and paste the new functions into the HTML output and update all the one-off Module.read*
-like to behave like Module.read*
(parse data URIs without relying on XHR), in which case there'd be no reason to document the CSP connect-src
concern. This seemed like more of a hassle when there was a whole other library involved, but not as much now.
That said, script-src
will be an issue with SINGLE_FILE
in that kind of scenario no matter how we handle it. Right now, script-src
(if defined at all) must whitelist data:
, which I've also documented, but if we parse the data URI to a string then we'll still need either unsafe-inline
or unsafe-eval
to be able to execute it. I believe script-src
is most commonly used to whitelist just 'self'
(the current origin) and maybe specific CDNs, which is fine for regular non-SINGLE_FILE
usage but would be a problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if this has any significant risk at all of not working, then we shouldn't do it in -O0, as that's the first thing people try and we should work our hardest to not break. So falling back to parsing the string seems like the right thing do do. As for other concerns, for -O0 we just use the mem init file, which doesn't require eval or such, so it should be ok (with the string parsing)?
Yeah, if we do that, seems like we need the code in the html too. Might put those methods in src/ and include them in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, sounds good; I just pushed another commit to handle all the parsing in the HTML output and update the documentation. The only remaining CSP quirks are:
-
Embedding asm.js requires
script-src 'unsafe-inline'
(out ofdata:
,'unsafe-eval'
, and'unsafe-inline'
, that one seemed like the least unusual to require) -
Using Workers requires
child-src blob:
(something new that I caught — passing a data URI into the Worker constructor wouldn't have worked in most browsers, so I had to parse it and convert it into a Blob URL)
As far as mem init 0, yep, there's no risk of CSP breaking that now that it's always directly parsed without XHR.
I don't fully understand the CSP issue, can you please elaborate? About the wasm etc. not existing on line 1119 in emcc.py, yeah, that looks like a problem. We only generate those files later on, all we know is their path at that point. So we can only create a data uri for them later down. This might require refactoring, or perhaps you can do something like embed a special string (like |
Originally, the data URI was going to be requested via XHR, which would fail on any page with a It's no longer an issue now, though, since the base64 data is being parsed directly in JS.
Perfect, thanks, that works for me. Sounds like a good candidate for future refactoring, but right now I'll move forward with the special string solution. |
src/settings.js
Outdated
var WASM_TEXT_FILE = ''; // name of the file containing wasm text, if relevant | ||
var WASM_BINARY_FILE = ''; // name of the file containing wasm binary, if relevant | ||
var ASMJS_CODE_FILE = ''; // name of the file containing asm.js, if relevant | ||
|
||
var INCLUDE_THIRD_PARTY_SODIUMUTIL = 0; // If set to 1, sodiumutil will be included in the bundle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way to communicate between shared.JS.get_subresource_location
and jsifier.js?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, yes.
Remaining issues after my latest push:
|
Thanks @kripken, the function scope issue was the biggest issue that I saw as well, and that last option is what I had in mind; just committed it. Earlier you had brought up a concern about using Anyway, the SINGLE_FILE tests are all still passing, and I can't imagine this last change would've broken anything else, so I think we're in a pretty good state here now. |
If you're going to give it a separate name you might as well do what I said before: function decodeBase64(input) {
if (typeof atob === 'function') {
return atob(input);
}
... This way it doesn't matter at all where in the file the function is defined and it makes it impossible that decodeBase64 is still undefined as the issue I had before. |
Good catch. However, we're also already using that pattern in arrayUtils.js, which isn't avoidable until Python preprocessor support is implemented, so we're going to be depending on the ordering of at least one of these functions either way (for the time being). There's also the (trivial) performance implication of checking Might make more sense to just ensure that arrayUtils and base64Utils are both emitted as early as they possibly can be without breaking anything, but it doesn't matter much to me. I'll leave this up to @kripken before touching anything. |
Merged this PR locally on top of latest incoming + PR #5664 for testing, and noticed that the test My merge point is at https://github.com/juj/emscripten/tree/buu_incoming (it did merge cleanly without conflicts) The error message is
|
Nice catch @juj, thanks! It looks like the issue is that we were preferring to parse the embedded mem init base64 asynchronously with |
Ok, assuming tests pass (running locally now), I think this is good to merge. Thanks for all the work here @buu700! And for your patience, I know this was a long process. I think we can improve this further after it lands. There are a few places where code is not behind an ifdef where it could be. For example when building and testing ammo.js now, I see that removeRunDependency is included, when it wasn't before. Looks like it's because we used to have Might be a good idea to compare code outputs from before and after this PR lands, in various optimization modes, and look for differences. Also in the longer term we should improve things with a better preprocessor approach, which would let us remove the duplicated code this PR introduces. |
That sounds great, thanks! I definitely wasn't expecting it to take over four months to land, but ultimately this approach was still much faster and easier than building my own separate bundling implementation and maintaining it long-term, so thanks for your patience working with me to shake out the bugs and inefficiencies in this. As far as the post-merge improvements, sounds like a good plan to me, and will probably be faster not to have to test all those things by proxy through me. One other thing we may want to consider doing before merging, as suggested by @erikdubbelboer: change |
Ok, tests passed over here (except for a failure that seems to be an existing unrelated regression). I think it's ok to merge this before the atob improvements, I'll take a look at that later today. |
Hmm, actually, I'm not sure what the problem is with the current Anyhow, a PR with improvements there is welcome, maybe I'm just not seeing the problem yet. |
Awesome! As far as the @erikdubbelboer's point was that this entire class of bug could be avoided going forward if the functions were hoisted. I decided to leave this up to you because of 1) the minor performance hit we'd take by running |
I see, thanks. Yeah, I think it's ok to leave things as they are, but I don't feel strongly either way. |
Is |
I think it might not exist in web worker and shell contexts. |
Specifically, it doesn't exist in versions of WebKit from before last summer or the SpiderMonkey shell. |
According to MDN it is supported in WebKit since about forever (also Safari is evergreen browser, so doesn't really matter): https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/atob#Browser_compatibility |
Oops, sorry, that was a typo. I meant that it isn't supported in Workers in older versions of WebKit: https://bugs.webkit.org/show_bug.cgi?id=158576 |
It seems like the error is strictly worse than just letting the exception propagate out. For example, we have a report of someone hitting this error but I can't tell what the root cause was: emscripten-core#20349 This try/catch and this error message was part of the original function back emscripten-core#5296 but I don't see any reason for them.
It seems like the error is strictly worse than just letting the exception propagate out. For example, we have a report of someone hitting this error but I can't tell what the root cause was: emscripten-core#20349 This try/catch and this error message was part of the original function back emscripten-core#5296 but I don't see any reason for them.
It seems like the error is strictly worse than just letting the exception propagate out. For example, we have a report of someone hitting this error but I can't tell what the root cause was: #20349 This try/catch and this error message was part of the original function back #5296 but I don't see any reason for them.
Resolves #5279.
I haven't tested this (don't currently have a local development environment set up), but wanted to submit it early for initial feedback. (Alternatively, if anyone else who's already set up to do so has time to test it, that would be really helpful.)
Concerns:
Is there anything I'm missing that could potentially end up requested by the emitted JS at run-time? In this commit I covered the memory initialiser, wasm, wast, and asm.js.
Are there any other
os.path.basename
references (or any other places in general) that I should have updated to use the new utility function, and/or is there any reference I changed that was incorrect? (I just ctrl+F'd foros.path.basename
and changed anything that looked like it would probably be inserted into the JS somewhere along the line.)Am I expected to add an automated test before this is merged, and if so are there any guidelines for that?