-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Memory initializer in string literal #3326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I just found out about #2188 and the comparisons done there, and I think there are better approaches than the one I used here. |
Since base 88 is not as memory-efficient as hoped, and in particular compresses really poorly, here is a forced push using a lookup table instead. This follows findings reported on #2188. I've by now included a setting as well, and only emit mem file or initializer string, not both. So in my opinion this is now ready for review and merge. |
The first issues here are as discussed in that other pull - we need to make sure this works with all other features. In particular we had problems there with minification breaking the string, I don't remember the details though. If the full test suite passes, then we are probably ok. Aside from that, I saw some numbers in the other pull, but can you please summarize how this works, the measured effect on code size, and the measured effect on startup time? |
Do you mean the string minification in this code, or the one applied by e.g. closure afterwards? The former was discussed and addressed in #2188 (comment) if that is what you have in mind. Using the tool from https://github.com/gagern/Web-Benchmarks/tree/tinylut/meminit I get the following figures for memory efficiency, sorted by gzipped size, and augmented by some load time experiments:
The current mem init file based approach depends not only on parse time but on network latency, so it's hard to include that in the table. The load times above were the median of 16 runs each. I'm currently running the test suite on the plain incoming branch, starting not with the sanity component but the default component. Once that succeeds I'll run tests for my pull requests. |
As for how this works: I compute how often each byte occurs, and then assign printable values to the common bytes, leaving the values which require escape sequences for the less common values. Small characters are encoded in octal notation if possible, since that's shorter. To reconstruct, I provide a translation table with 256 elements. Each char code from the string is looked up in that table to translate it into a byte which is then stored in some typed array. |
What does "lut" mean in "tinylut"? |
@@ -708,7 +708,7 @@ try: | |||
if js_opts is None: js_opts = opt_level >= 2 | |||
if llvm_opts is None: llvm_opts = LLVM_OPT_LEVEL[opt_level] | |||
if opt_level == 0: debug_level = max(3, debug_level) | |||
if memory_init_file is None: memory_init_file = opt_level >= 2 | |||
if memory_init_file is None: memory_init_file = 2 if opt_level >= 2 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this change make this new mode the default in -O2+
? But it's less optimized than a binary mem init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would make this the default. I would have assumed that for most applications, the overhead for a second request would be bigger than the performance cost of parsing that initializer. But I guess you know your customers better than I do, so if you say that for most their setups mem files are preferable, then I won't argue with that. I know that I personally will try to use the string initializer simply because I prefer to have a single artifact containing the whole program, but that's just a matter of taste to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to do the requests in parallel. But I didn't really consider until now that if this is efficient enough, it could be better than even that. I wonder how to measure it though.
Regardless, until we have such measurements, I don't think we should change the default - we shouldn't confuse users with changes unless they have proven benefits.
“LUT” = LookUpTable, the one which translates characters back to byte values. |
@@ -1385,7 +1386,7 @@ try: | |||
|
|||
if memory_init_file: | |||
if shared.Settings.USE_TYPED_ARRAYS != 2: | |||
if type(memory_init_file) == int: logging.warning('memory init file requires typed arrays mode 2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping this typ echeck means that I'll now have greater chances of hitting this message. I guess I should introduce some other variable to distinguish a user-requested setting from an automatically chosen one. Or should I use extra values for that single variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have actually deprecated non typed arrays mode 2 code, so that should never be hit
Hi @gagern. I am very happy that you are looking into this. We have wanted similar behavior for a long time. However, I would very much prefer that the string be an as-is JavaScript string, such that the decoding loop is literally: for (var i = 0; i < memInitStr.length; ++i) { Large regions of contiguous zeroes could be elided. This way the data will be trivially inspectable with human eyes, and my understanding (though prove me wrong) is, the code size hit after gzip is rather small. |
@kripken: The default for @chadaustin: The string is now literal, with no table lookup step, so you can read the memory content. (By the way, I notice that libc++ content contains a lot of full paths to source files, thus exposing the developer's directory structure. I wonder what to do about that.) So far my things are untested, because I have trouble getting a recent fastcomp installed. Apart from emscripten-core/emscripten-fastcomp#85 it's trying to link against libedit, which I don't have installed. Working on this, may file a separate ticket if I can't resolve it. I had to resolve quite a number of conflicts to keep up with recent development. Should I rebase my changes? |
Between @kripken and I, it looks like you've probably had a lot of changes to keep up with. It looks like this change would be a lot simpler if you:
|
+1 to @waywardmonkeys suggestions Also: yes, need to rebase. For tests, we should probably use this on a whole test mode, like asm2f (see end of Now that this is a string without a table lookup, where is it in the numbers listed before (which line is it there)? |
See memory init testing in Btw, we have a few settings that disable the memory init file (like shared libraries). We might want to default them to this new mode, perhaps. |
I rebased my changes, and hope this makes sense the way I did things. Running the tests will take some time yet, since I need to update all my toolchain first, where I'm encountering a bunch of problems, and it's pretty late in the day here. I noticed that at some later point, we probably should attempt to exempt the string literal from the influence of the closure compiler, since it will re-encode the string to something which is longer again. The semantics stay the same, though, so this is no crucial change. And I haven't yet figured out how exactly the asm.js output gets protected from closure, to see whether we could duplicate that. |
I've got some failing tests in a core asm2m run: test_python test_the_bullet test_files test_fnmatch test_strings test_webidl error or fail. I'm investigating the first of these. Something appears to be post-processesing the string literal, turning byte escapes into unicode escapes but also turning octal escapes into their binary equivalent. I haven't figured out yet who's doing this. But it caused me to read section 7.8.4 of the ECMA 262 spec and find out that indeed almost any unicode character may be contained in a string literal, while on the other hand support for octal escapes is optional and even forbidden in strict mode. Does that mean that we should avoid escaping anything except for newlines and quotes? Even a two-byte UTF-8 sequence for +U0080 through +U00FF is shorter than the four-byte In any case, the problem why the test is that the hex-to-octal rewrite would also affect things like I think I'll push a change where I rely on the source being parsed as UTF-8. Let me know whether you think such an assumption acceptable. With that change applied, some of the tests that used to fail now pass, but some still fail. |
While you were away, I had tried implementing this as well. My implementation ended up going into a different part of the code, but it kept running into problems as well. There's something about this that is hard to crack! |
@waywardmonkeys: Do you have your approach in some branch here on github? What kinds of problems did you encounter? |
OK, I think I fixed the previously failing test cases. I dropped some test functionality in c333383, please have a look whether that's acceptable. I also had to merge 1.33.1 to avoid conflicts with current incoming due to #3266 (very interesting development there, by the way). Now I'm updating my emsdk to current incoming, in order to run tests once more. Again. Once thing I've been thinking about, regarding the fact that the string literal now depends on the character encoding used for the source file: perhaps we should include some kind of checksum with the string literal, in order to at least detect any corruption due to wrong encoding there. Should I add some CRC32 or Adler32 or similar? Should the verification be made optional inside an |
A checksum sounds like a very good idea, and yes, in an When you say UTF8 above, is the requirement that the entire JS file - including |
What I mean is that the browser must decode it as UTF-8, probably because the HTTP server delivered it as such. If that doesn't sound good, we should go with |
|
I finally found the uglify call which explicitely passes Since mishoo/UglifyJS@81f5efe uglify knows how to produce |
memfile = target + '.mem' | ||
shared.try_delete(memfile) | ||
def repl(m): | ||
# handle chunking of the memory initializer | ||
s = m.groups(0)[0] | ||
if len(s) == 0 and not shared.Settings.EMTERPRETIFY: return m.group(0) # emterpreter must have a mem init file; otherwise, don't emit 0-size ones | ||
open(memfile, 'wb').write(''.join(map(lambda x: chr(int(x or '0')), s.split(',')))) | ||
membytes = [int(x or '0') for x in s.split(',')] | ||
if not shared.Settings.EMTERPRETIFY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this needed because the emterpreter used to append its data to the normal memory init file? that is no longer true, so this should be removable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly because if the memory is all null bytes, then I do a return ''
, which I considered semantically equivalent to return m.group(0)
. Come to think of it, if it indeed is, then perhaps I should make the two lines agree. Can I use return ''
in both cases? Anyway, the return m.group(0)
is not used for EMTERPRETIFY
, presumably because we need a mem file there, even if it's empty. If it is OK for the emterpreter to have trailing zeros stripped, then I can move that condition to the if not membytes
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, all the emterpreter stuff here (and around it) is all no longer needed (there is no interaction between the emterpreter and mem init file anymore). removing any special-casing for it should be fine, if that makes your changes simpler.
This makes the resulting literals more independent from the character encoding the environment assumes for the resulting file. It requires slightly more memory, but large bytes are far less common than small bytes (zero in particular), so the cost should not be too much. If we want to, we can still make this optional later on.
* MEM_INIT_METHOD != 1 with --with-memory-file 1 now triggers an assertion * Consistently return '' instead of m.group(0) if there is no initializer * Strip trailing zeros for emterpreter as well * Include crc32 in literal only if it gets verified * Enable assertions for the asm2m test run in general * Disable assertions for one test case, fnmatch, to cover that as well * Include the asm2m run name in two lists of run modes * Add browser test to verify all pairs of bytes get encoded correctly * Add browser test to verify that a >32M initializer works without chunking * Omit duplicate var declaration for the memoryInitializer variable * Minor comments and syntax improvements * Capture the memory_init_file setting by its MEM_INIT_METHOD value. * Drop special handling for emterpreter, which shouldn't be needed any more.
I've squashed some commits, but kept others apart where I considered the extra history valuable. And I verified that my final result is equivalent to the 4db4ee6 you reviewed. I'd say everything should be ready for merge now. |
I am so so so happy this is happening. Thanks for all the hard work. |
Merged. Thanks! I also pushed a commit to refactor the crc code into |
Thanks for the cooperation, I'm looking forward to using emscripten with this new improvement. Do you have any idea how to gather feedback on whether this should become the default for some settings? |
Opening a thread on the mailing list might be a good start, asking people to test it out. Meanwhile I ran some fuzzing on this last night, and found no issues. |
It looks like there is windows-only breakage on the bots due to this. For example |
On OS X, I see that when looking at the output of Although I think the issue of the test crashing optimizer.exe on Windows is not in the null bytes, but that somehow python is computing the file write bad, and it's outputting/truncating the rest of the file somehow, since nothing gets printed to the file after the memory initializer is outputted. Was the design of the string memory initializer to depend on outputting bytes in the range 0-31 raw to the generated .js file? Or should those have gotten escaped somehow? |
Originally I had intended to escape control bytes, then I noticed that our version of uglify would unescape those, read in the spec that almost any code point is valid in a string literal, including nulls, and decided to go with that myself. Should I write a pull request to always escape control bytes, and we investigate relaxing that later on? Or should we investigate now until we understand the problem, then fix just that and hope all other platforms are fine?
Perhaps Python should open the file in binary mode, i.e. open('foo.bin', 'w').write(''.join(map(chr,range(256)))) Unfortunately I don't have a Windows at hand just now, so can someone on Windows please give the above a try and report back? From reading Wikipedia I'd guess 26 or 4 to be the most likely bytes to cause problems here. |
I'm now seeing one crossplatform difference, that causes the breakage. If one runs the following python program
on Windows, the second read will truncated on what looks like the byte 0x19 (end of medium) or 0x20 (substitute), and the application prints
but on OS X, the read will complete past each byte, and the application prints
The effect is that writing the memory string initializer succeeds correctly, but afterwards reading it back with However even if I go in places and replace 'r' with 'rb', I am still getting corrupted output in the final executable. I am not sure if I found all the relevant places to transform, or if I missed a place. The output does change from optimizer.exe crashing, to another kind of corruption where the end of the memory initializer reads |
I don't think there is a rush to fix this, in that the breakage is only on the string initializer mode, which is off by default. The only thing this blocks is us moving to that mode by default. But since that would be good to do, this is important to fix, I'm just saying I think we can wait for a proper fix, no need for an interim workaround. It does sound like binary/text issues are in play here. Possibly 'rb'/'wb' in the right places would fix it, as you two suggest. I don't have a windows machine either, but perhaps someone else reading this does. We could also ask on irc or the mailing list. I'd also be ok to merge a pull that has those fixes, and we'll see what happens on the windows bot. Side note, I've been wondering why the new |
OK, what we are seeing here is apparently some raw bytes decoded according to code page 437 or something very similar. We have bytes 00 through 04 in the first row. Not sure where the rest of that line went, but the next line starts with 0E, so I'd say 0D got interpreted as a carriage return without line feed, so stuff between the beginning of the file and that position got overwritten. Then we have bytes 0E through 19 in the second row, but no 1A. This is consistent with 1A denoting end of file in text mode. So my hunch there was correct.
I have no idea yet where that may come from.
I do have a VM with Windows. It's painfully slow, and I have nearly no useful tools installed there, but if everything else fails, I guess I can set up an emsdk there.
That is strange indeed. As I said, I guess it's not the null bytes, but the 1A bytes instead. But |
I'm going to disable the |
If you want to, you can try the following change instead: -s = re.sub('[\x80-\xff]', escape, s)
+s = re.sub('\x00(?![0-9])', '\\0', s)
+s = re.sub('[\x00-\x1f\x7f-\xff]', escape, s) Can't write a full pull request just now. The idea is to turn any null byte which doesn't precede a digit into |
Thanks, I don't have time either currently, though (no windows machine, and busy with the syscalls landing). |
Emscripten really should use 'rb' everywhere anyway. |
👍 |
If we can confirm that fixes this issue, and has no downsides (I don't know windows enough to tell...), sounds good to me. |
…e#3326" This reverts commit b256af6. Since emscripten-core#3854 started escaping \x1a, this should no longer be needed. Conflicts due to code reformatting had to be resolved manually.
Having the memory initializer as a separate file can become a huge problem in some environments. It would be nice if we could offer more alternatives. I see how big array literals would be kind of bulky. But how about encoding the memory initializer into one or more strings? Here is some code to do that, based on a base 88 encoding scheme. If you like the idea, it would be nice if you could introduce some setting to switch between file and string.