-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to pare down code size? #2786
Comments
cc @msaboff if JSC has any numbers to share re: Apple's resource-constrained devices. |
I have some good things to report on this topic. I've spent the last 2 weeks trying out some experiments in V8 to see what is most effective in reducing the binary size. I've been measuring as follows. I use
Edit: also make sure to add To try to reproduce the size measurements of the object files above, I used the following: for file in builtins-temporal js-temporal-objects temporal-parser; do \
size -A -d out/android_arm.release/obj/v8_base_without_compiler/$file.o | \
grep -E '\.(text|rodata|data)' | cut -c20-25 | jq -s add; \
done Here's what I get:
More to come in following comments! |
A short overview of where the size goes to. JS builtins are the JS function objects, approximately equivalent to intrinsics using the %-notation in the spec text, e.g. %Temporal.PlainDate.prototype.add%. There is one per public API and this is pretty much fixed. Temporal has ~300 of them. C++ builtins are C++ functions linked in such a way that when you call a JS builtin from JS, the C++ builtin is executed. Currently, Temporal has about ~300 of them, one per public API, but unlike JS builtins that number is not fixed. There is also C++ code called by the C++ builtins, but that isn't a builtin itself. Of course the machine code generated by the compiler also takes up space in the binary. The only way you could reduce the number of JS builtins is by removing public API from the proposal. (Is it really the only way? You could try sharing the same JS builtin between more than one API entry point, but then you'd have e.g. You can reduce the number of C++ builtins by reusing the same C++ builtin for multiple JS builtins. I'll call this "multiplexing". I've found that C++ builtins take up the most space in the binary, You can also reduce the size of the C++ code, which hopefully leads to the compiled machine code being smaller. You really have to delete a lot of code for this to have any noticeable effect, I found. |
Multiplexing. There seem to be many opportunities to combine similar C++ builtins into one. For example, the getters for the various types' function CalendarIdGetter() {
let slotValue;
switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
case PLAIN_DATE:
if (!IsTemporalDate(this)) throw new TypeError();
slotValue = this[PLAIN_DATE_CALENDAR_SLOT];
break;
case PLAIN_DATE_TIME:
if (!IsTemporalDateTime(this)) throw new TypeError();
slotValue = this[PLAIN_DATE_TIME_CALENDAR_SLOT];
break;
// ...
}
return ToTemporalCalendarIdentifier(slotValue);
} Then each getter is a separate JS function object, but each calls the same C++ builtin as its body. We can stash the needed info to control the C++ builtin's behaviour on private properties of that JS function object, for example here we'd stash the numeric constant We can do similar things for the The runtime overhead of this is basically 1 Note that there is precedent in the V8 codebase for this: There isn't a separate builtin for each TypedArray constructor, instead they all share the same one. (Although it's a TurboFan builtin, not a C++ builtin.) I have a branch at ptomato/v8@3cb3a6f that implements some of this multiplexing. We could do more, or less of it, I just picked a few of the ones that seemed like obvious gains. |
You can also take the multiplexing idea to its extreme conclusion and multiplex everything into one giant C++ builtin. That essentially looks like this, in JS-like pseudocode: function TemporalMethodMultiplexer() {
switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
case PLAIN_DATE:
if (!IsTemporalDate(this)) throw new TypeError();
switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
case 'add':
return TemporalPlainDatePrototypeAdd(this,
arguments[0], arguments[1]);
case 'subtract':
return TemporalPlainDatePrototypeSubtract(this,
arguments[0], arguments[1]);
// ...
}
case PLAIN_DATE_TIME:
if (!IsTemporalDateTime(this)) throw new TypeError();
switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
// ...
}
// ...
}
} This makes the runtime overhead basically 2 This approach would probably need some fine-tuning based on usage data, because the order in which methods are considered, does matter for performance... I also have a branch implementing this approach, at v8/v8@fab0ae9. (For reasons, it multiplexes everything into 3 builtins: one for all constructors, one for all static methods, and one for all prototype methods.) (This is what suggests to me that the most effective gains come from reducing the number of C++ builtins. This branch removes builtins but almost no actual implementation.) |
Interesting, multiplexing is a much bigger reduction than I expected. Might be worth revisiting the decision. Multiplexing was discussed and rejected in the past as a general technique because it muddies performance timers and tracing. One can argue that Temporal performance isn't so important, but the idea is if you're tracing a whole application, you'd want to know that X% of the app time is spent in, say, Date subtraction vs DateTime subtraction. That would be harder to tease out if there's just a multiplexed C++ frame for both. |
Is there any interest in API simplification itself? It is extremely batteries-included. |
I've been focusing primarily on whether we can achieve gains without touching the current state of the proposal and delaying it even more. But I did do a build where I simply ripped out a bunch of the APIs and all the code associated with them, probably about 30–40% of the total proposal, applied on top of the fully multiplexed build. That build was So I think it's not a question of ripping out a few of the included batteries. For comparison, I also did a build where I implemented #2789 and #2798, which are PRs with potential observable optimizations that we were not planning to pursue as normative changes unless they yielded significant size reductions. |
Of course only after I wrote up everything here did I figure out why my baseline number was so much higher than the one in the OP — my build had runtime callstats turned on, while Shu mentioned:
(So I assumed that the release builds I'd been making also had runtime callstats disabled, but that was not the case.) I'll go back through the thread and edit in the updated numbers, leaving the old ones visible with The size reduction from multiplexing is not as dramatic as I originally thought, but it is still large and in my opinion worth doing. Additionally, taking the full multiplexing approach could allow V8 to switch runtime callstats back on by default, if Temporal was the only reason it was disabled; once all the Temporal builtins have been removed by multiplexing, runtime callstats actually make The partial multiplex approach saves -36k, the full multiplex approach saves -92k, and ripping out 30–40% of the proposal still saves about -56k although this time, due to page alignment, we lose several k of padding additionally. |
See #2834 for a different approach to reducing the number of builtins. |
I've been asked to comment on this. Summary: size is not free, it does make sense to trim bloat when possible. Multiplexing addresses one aspect of the cost of having many functions, by avoiding duplication of machine code compiled from C++ (or whatever language the engine is written in), and it's nice to see those savings. (FWIW, I'd use the Multiplexing doesn't address another aspect of the cost of having many functions though, and that's simply having the JavaScript object hierarchy. Each object property increases the size of that object. Each closure (which all have observable identity: As a related data point, V8 uses a "snapshotting" technique (deserializing a heap snapshot that's embedded in the shipped binary) because creating all of Since you're already experimenting with custom V8 builds to measure things, you could relatively simply measure what dropping all or part of |
I haven't measured the runtime memory consumption. Those experiments could be done, but I focused on the static code size here because that's what @syg requested at the top of the thread. If we need to reverse course on that, I'd appreciate having some guidelines as to what kind of runtime size increase Google would consider acceptable and what would not. Is the snapshot size not included in the stats that Bloaty gives when comparing d8 binaries? |
I didn't mean to suggest reversing course. Binary size is an important metric. My previous comment came from a more general "why proposal size reductions are worth considering" perspective: in addition to binary size, memory consumption matters too. Sorry that that was a bit of a deviation from the original topic of this thread; feel free to discuss it elsewhere if you prefer. I don't know how to determine a number X such that "X bytes is acceptable, X+1 bytes is not" (for either binary size or memory consumption). The less, the better. If you can cover 98% of the use cases for 70% of the cost, go for it. There may be a few developers who will complain that some niche feature they liked is gone, but on the flip side there will be a few billion users who will be happy that the software on their devices is a little less bloated. |
@jakobkummerow Thanks for the explanation. This is really helpful. I share your goal of reducing complexity and surface area where possible. I'm a little curious what your thoughts are on a different kind of multiplexing, more like TypedArray methods: what if we reused the same function identity, but then multiplexed internally among function bodies, as proposed for toJSON #2858 ? Not that this weakens your argument, but I'm just curious about the impact on making the snapshot bigger, and therefore more snapshot restoration work: Has the V8 team recently considered implementing lazy allocation of the objects backing some JS builtins, as other browsers do? I wonder if that could help mitigate the runtime memory usage/startup time cost. Is such a technique too complex/costly? |
I haven't really formed an opinion on which of the size reduction ideas under discussion are most worthwhile/impactful/acceptable.
I haven't heard of any such explorations recently, fwiw. |
I'm assuming the slate of removals that was approved in the TC39 meeting on 2024-06-12 is sufficient to close this issue. Please let us know ASAP if that's not the case. |
As discussed elsewhere, Chrome and V8 would ideally like to see the size of the proposal reduced to make shipping on low-resource devices like lower end Android phones more palatable.
This issue is only for focusing on the static code size, the size that's in the shipped binary of Chrome.
The bulk of Temporal in V8 is implemented across 3 files:
builtins-temporal.cc
,js-temporal-objects.cc
, andtemporal-parser.cc
. IIUC this isn't even complete yet. @FrankYFTang, is that the case?The numbers we have as of V8 12.4 on 32bit release Android APK:
.text
+.data
+.rodata
+.data.rel.ro
builtins-temporal.cc
js-temporal-objects.cc
temporal-parser.cc
V8 contributes a total of 9539.62 KiB, which makes Temporal at least 2.6% of all of V8.
This is after size optimizations. V8 has already disabled runtime callstats (a profiling functionality that bloats code size) on release channels due to Temporal.
The guidelines from the Android folks is that a feature that adds 100-250 KiB is considered Medium, with 250+ KiB considered Large. "Features" in that context is not strictly language features. A new JIT tier is also a feature, and V8 uses a lot of size budget for our compilers since we believe better performance has outsized impact.
So, ideally, any reduction in code size here would be welcome. The closer we get to 100 KiB the better.
The text was updated successfully, but these errors were encountered: