Skip to content

Add intrinsicsize attributes to images #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 8, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/blog/10-years.md
Original file line number Diff line number Diff line change
@@ -54,7 +54,7 @@ Recently, we shipped a baseline compiler for WebAssembly named [Liftoff](/blog/l
Chrome’s V8 Bench score over the years shows the performance impact of V8’s changes. (We’re using the V8 Bench because it’s one of the few benchmarks that can still run in the original Chrome beta.)

<figure>
<img src="/_img/10-years/v8-bench.png" srcset="/_img/10-years/v8-bench@2x.png 2x" alt="">
<img src="/_img/10-years/v8-bench.png" srcset="/_img/10-years/v8-bench@2x.png 2x" intrinsicsize="1247x572" alt="">
<figcaption>Chrome’s <a href="http://www.netchain.com/Tools/v8/">V8 Bench</a> score from 2008 to 2018</figcaption>
</figure>

@@ -65,7 +65,7 @@ However, you might notice two performance dips over the years. Both are interest
Another take-away from this chart is that it starts to level off around 2013. Does that mean V8 gave up and stopped investing in performance? Quite the opposite! The flattening of the graphs represents the V8 team’s pivot from synthetic micro-benchmarks (such as V8 Bench and Octane) to optimizing for [real-world performance](/blog/real-world-performance). V8 Bench is an old benchmark that doesn’t use any modern JavaScript features, nor does it approximate actual real-world production code. Contrast this with the more recent Speedometer benchmark suite:

<figure>
<img src="/_img/10-years/speedometer-1.png" srcset="/_img/10-years/speedometer-1@2x.png 2x" alt="">
<img src="/_img/10-years/speedometer-1.png" srcset="/_img/10-years/speedometer-1@2x.png 2x" intrinsicsize="1247x572" alt="">
<figcaption>Chrome’s <a href="https://browserbench.org/Speedometer/">Speedometer 1</a> score from 2013 to 2018</figcaption>
</figure>

14 changes: 7 additions & 7 deletions src/blog/array-sort.md
Original file line number Diff line number Diff line change
@@ -158,14 +158,14 @@ The output shows the `object` after it’s sorted. Again, there is no right answ
V8 has two pre-processing steps before it actually sorts anything. First, if the object to sort has holes and elements on the prototype chain, they are copied from the prototype chain to the object itself. This frees us from caring about the prototype chain during all remaining steps. This is currently only done for non-`JSArray`s but other engines do it for `JSArray`s as well.

<figure>
<img src="/_img/array-sort/copy-prototype.svg" alt="">
<img src="/_img/array-sort/copy-prototype.svg" intrinsicsize="641x182" alt="">
<figcaption>Copying from the prototype chain</figcaption>
</figure>

The second pre-processing step is the removal of holes. All elements in the sort-range are moved to the beginning of the object. `undefined`s are moved after that. This is even required by the spec to some degree as it requires us to *always* sort `undefined`s to the end. The result is that a user-provided comparison function will never get called with an `undefined` argument. After the second pre-processing step the sorting algorithm only needs to consider non-`undefined`s, potentially reducing the number of elements it actually has to sort.

<figure>
<img src="/_img/array-sort/remove-array-holes.svg" alt="">
<img src="/_img/array-sort/remove-array-holes.svg" intrinsicsize="815x297" alt="">
<figcaption>Removing holes and moving <code>undefined</code>s to the end</figcaption>
</figure>

@@ -208,7 +208,7 @@ Runs that are found this way are tracked using a stack that remembers a starting
- `|B| > |A|`

<figure>
<img src="/_img/array-sort/runs-stack.svg" alt="">
<img src="/_img/array-sort/runs-stack.svg" intrinsicsize="770x427" alt="">
<figcaption>Runs stack before and after merging <code>A</code> with <code>B</code></figcaption>
</figure>

@@ -263,7 +263,7 @@ A fast-path then simply becomes a set of function pointers. This means we only n
### Sort state

<figure>
<img src="/_img/array-sort/sort-state.svg" alt="">
<img src="/_img/array-sort/sort-state.svg" intrinsicsize="570x710" alt="">
</figure>

The picture above shows the “sort state”. It’s a `FixedArray` that keeps track of all the things needed while sorting. Each time `Array#sort` is called, such a sort state is allocated. Entry 4 to 7 are the set of function pointers discussed above that comprise a fast-path.
@@ -287,21 +287,21 @@ Before we started with `Array#sort`, we added a lot of different micro-benchmark
Keep in mind that in these cases the JIT compiler can do a lot of work, since sorting is nearly all we do. This also allows the optimizing compiler to inline the comparison function in the JavaScript version, while we have the call overhead from the builtin to JavaScript in the Torque case. Still, we perform better in nearly all cases.

<figure>
<img src="/_img/array-sort/micro-bench-basic.svg" alt="">
<img src="/_img/array-sort/micro-bench-basic.svg" intrinsicsize="616x371" alt="">
</figure>

The next chart shows the impact of Timsort when processing arrays that are already sorted completely, or have sub-sequences that are already sorted one-way or another. The chart uses Quicksort as a baseline and shows the speedup of Timsort (up to 17× in the case of “DownDown” where the array consists of two reverse-sorted sequences). As can be seen, expect in the case of random data, Timsort performs better in all other cases, even though we are sorting `PACKED_SMI_ELEMENTS`, where Quicksort outperformed Timsort in the microbenchmark above.

<figure>
<img src="/_img/array-sort/micro-bench-presorted.svg" alt="">
<img src="/_img/array-sort/micro-bench-presorted.svg" intrinsicsize="600x371" alt="">
</figure>

### Web Tooling Benchmark

The [Web Tooling Benchmark](https://github.com/v8/web-tooling-benchmark) is a collection of workloads of tools usually used by web developers such as Babel and TypeScript. The chart uses JavaScript Quicksort as a baseline and compares the speedup of Timsort against it. In almost all benchmarks we retain the same performance with the exception of chai.

<figure>
<img src="/_img/array-sort/web-tooling-benchmark.svg" alt="">
<img src="/_img/array-sort/web-tooling-benchmark.svg" intrinsicsize="990x612" alt="">
</figure>

The chai benchmark spends *a third* of its time inside a single comparison function (a string distance calculation). The benchmark is the test suite of chai itself. Due to the data, Timsort needs some more comparisons in this case, which has a bigger impact on the overall runtime, as such a big portion of time is spent inside that particular comparison function.
8 changes: 4 additions & 4 deletions src/blog/background-compilation.md
Original file line number Diff line number Diff line change
@@ -19,7 +19,7 @@ However, due to limitations in V8’s original baseline compiler, V8 still neede
V8’s Ignition bytecode compiler takes the [abstract syntax tree (AST)](https://en.wikipedia.org/wiki/Abstract_syntax_tree) produced by the parser as input and produces a stream of bytecode (`BytecodeArray`) along with associated meta-data which enables the Ignition interpreter to execute the JavaScript source.

<figure>
<img src="/_img/background-compilation/bytecode.png" alt="">
<img src="/_img/background-compilation/bytecode.png" intrinsicsize="1162x523" alt="">
</figure>

Ignition’s bytecode compiler was built with multi-threading in mind, however a number of changes were required throughout the compilation pipeline to enable background compilation. One of the main changes was to prevent the compilation pipeline from accessing objects in V8’s JavaScript heap while running on the background thread. Objects in V8’s heap are not thread-safe, since Javascript is single-threaded, and might be modified by the main-thread or V8’s garbage collector during background compilation.
@@ -31,7 +31,7 @@ Bytecode finalization involves building the final `BytecodeArray` object, used t
With these changes, almost all of the script’s compilation can be moved to a background thread, with only the short AST internalization and bytecode finalization steps happening on the main thread just before script execution.

<figure>
<img src="/_img/background-compilation/threads.png" alt="">
<img src="/_img/background-compilation/threads.png" intrinsicsize="1211x307" alt="">
</figure>

Currently, only top-level script code and immediately invoked function expressions (IIFEs) are compiled on a background thread — inner functions are still compiled lazily (when first executed) on the main thread. We are hoping to extend background compilation to more situations in the future. However, even with these restrictions, background compilation leaves the main thread free for longer, enabling it to do other work such as reacting to user-interaction, rendering animations or otherwise producing a smoother more responsive experience.
@@ -41,11 +41,11 @@ Currently, only top-level script code and immediately invoked function expressio
We evaluated the performance of background compilation using our [real-world benchmarking framework](/blog/real-world-performance) across a set of popular webpages.

<figure>
<img src="/_img/background-compilation/desktop.png" alt="">
<img src="/_img/background-compilation/desktop.png" intrinsicsize="1424x880" alt="">
</figure>

<figure>
<img src="/_img/background-compilation/mobile.png" alt="">
<img src="/_img/background-compilation/mobile.png" intrinsicsize="1616x1290" alt="">
</figure>

The proportion of compilation that can happen on a background thread varies depending on the proportion of bytecode compiled during top-level streaming-script compilation verses being lazy compiled as inner functions are invoked (which must still occur on the main thread). As such, the proportion of time saved on the main thread varies, with most pages seeing between 5% to 20% reduction in main-thread compilation time.
26 changes: 13 additions & 13 deletions src/blog/concurrent-marking.md
Original file line number Diff line number Diff line change
@@ -20,24 +20,24 @@ Marking is a phase of V8’s [Mark-Compact](https://en.wikipedia.org/wiki/Tracin
We can think of marking as a [graph traversal](https://en.wikipedia.org/wiki/Graph_traversal). The objects on the heap are nodes of the graph. Pointers from one object to another are edges of the graph. Given a node in the graph we can find all out-going edges of that node using the [hidden class](/blog/fast-properties) of the object.

<figure>
<img src="/_img/concurrent-marking/00.svg" alt="">
<img src="/_img/concurrent-marking/00.svg" intrinsicsize="508x293" alt="">
<figcaption>Figure 1. Object graph</figcaption>
</figure>

V8 implements marking using two mark-bits per object and a marking worklist. Two mark-bits encode three colors: white (`00`), grey (`10`), and black (`11`). Initially all objects are white, which means that the collector has not discovered them yet. A white object becomes grey when the collector discovers it and pushes it onto the marking worklist. A grey object becomes black when the collector pops it from the marking worklist and visits all its fields. This scheme is called tri-color marking. Marking finishes when there are no more grey objects. All the remaining white objects are unreachable and can be safely reclaimed.

<figure>
<img src="/_img/concurrent-marking/01.svg" alt="">
<img src="/_img/concurrent-marking/01.svg" intrinsicsize="380x290" alt="">
<figcaption>Figure 2. Marking starts from the roots</figcaption>
</figure>

<figure>
<img src="/_img/concurrent-marking/02.svg" alt="">
<img src="/_img/concurrent-marking/02.svg" intrinsicsize="380x290" alt="">
<figcaption>Figure 3. The collector turns a grey object into black by processing its pointers</figcaption>
</figure>

<figure>
<img src="/_img/concurrent-marking/03.svg" alt="">
<img src="/_img/concurrent-marking/03.svg" intrinsicsize="380x290" alt="">
<figcaption>Figure 4. The final state after marking is finished</figcaption>
</figure>

@@ -48,13 +48,13 @@ Note that the marking algorithm described above works only if the application is
Marking performed all at once can take several hundred milliseconds for large heaps.

<figure>
<img src="/_img/concurrent-marking/04.svg" alt="">
<img src="/_img/concurrent-marking/04.svg" intrinsicsize="580x50" alt="">
</figure>

Such long pauses can make applications unresponsive and result in poor user experience. In 2011 V8 switched from the stop-the-world marking to incremental marking. During incremental marking the garbage collector splits up the marking work into smaller chunks and allows the application to run between the chunks:

<figure>
<img src="/_img/concurrent-marking/05.svg" alt="">
<img src="/_img/concurrent-marking/05.svg" intrinsicsize="595x50" alt="">
</figure>

The garbage collector chooses how much incremental marking work to perform in each chunk to match the rate of allocations by the application. In common cases this greatly improves the responsiveness of the application. For large heaps under memory pressure there can still be long pauses as the collector tries to keep up with the allocations.
@@ -80,13 +80,13 @@ Because of the write-barrier cost, incremental marking may reduce throughput of
**Parallel** marking happens on the main thread and the worker threads. The application is paused throughout the parallel marking phase. It is the multi-threaded version of the stop-the-world marking.
<figure>
<img src="/_img/concurrent-marking/06.svg" alt="">
<img src="/_img/concurrent-marking/06.svg" intrinsicsize="595x120" alt="">
</figure>
**Concurrent** marking happens mostly on the worker threads. The application can continue running while concurrent marking is in progress.
<figure>
<img src="/_img/concurrent-marking/07.svg" alt="">
<img src="/_img/concurrent-marking/07.svg" intrinsicsize="595x120" alt="">
</figure>
The following two sections describe how we added support for parallel and concurrent marking in V8.
@@ -96,7 +96,7 @@ The following two sections describe how we added support for parallel and concur
During parallel marking we can assume that the application is not running concurrently. This substantially simplifies the implementation because we can assume that the object graph is static and does not change. In order to mark the object graph in parallel, we need to make the garbage collector data structures thread-safe and find a way to efficiently share marking work between threads. The following diagram shows the data-structures involved in parallel marking. The arrows indicate the direction of data flow. For simplicity, the diagram omits data-structures that are needed for heap defragmentation.
<figure>
<img src="/_img/concurrent-marking/08.svg" alt="">
<img src="/_img/concurrent-marking/08.svg" intrinsicsize="655x250" alt="">
<figcaption>Figure 5. Data structures for parallel marking</figcaption>
</figure>
@@ -109,7 +109,7 @@ The implementation of the marking worklist is critical for performance and balan
The extreme sides in that trade-off space are (a) using a completely concurrent data structure for best sharing as all objects can potentially be shared and (b) using a completely thread-local data structure where no objects can be shared, optimizing for thread-local throughput. Figure 6 shows how V8 balances these needs by using a marking worklist that is based on segments for thread-local insertion and removal. Once a segment becomes full it is published to a shared global pool where it is available for stealing. This way V8 allows marking threads to operate locally without any synchronization as long as possible and still handle cases where a single thread reaches a new sub-graph of objects while another thread starves as it completely drained its local segments.
<figure>
<img src="/_img/concurrent-marking/09.svg" alt="">
<img src="/_img/concurrent-marking/09.svg" intrinsicsize="593x213" alt="">
<figcaption>Figure 6. Marking worklist</figcaption>
</figure>
@@ -174,7 +174,7 @@ Without the memory fence the object color load operation can be reordered before
Some operations, for example code patching, require exclusive access to the object. Early on we decided to avoid per-object locks because they can lead to the priority inversion problem, where the main thread has to wait for a worker thread that is descheduled while holding an object lock. Instead of locking an object, we allow the worker thread to bailout from visiting the object. The worker thread does that by pushing the object into the bailout worklist, which is processed only by the main thread:

<figure>
<img src="/_img/concurrent-marking/10.svg" alt="">
<img src="/_img/concurrent-marking/10.svg" intrinsicsize="655x336" alt="">
<figcaption>Figure 7. The bailout worklist</figcaption>
</figure>

@@ -219,15 +219,15 @@ Note that a white object that undergoes an unsafe layout change has to be marked
We integrated concurrent marking into the existing incremental marking infrastructure. The main thread initiates marking by scanning the roots and filling the marking worklist. After that it posts concurrent marking tasks on the worker threads. The worker threads help the main thread to make faster marking progress by cooperatively draining the marking worklist. Once in a while the main thread participates in marking by processing the bailout worklist and the marking worklist. Once the marking worklists become empty, the main thread finalizes garbage collection. During finalization the main thread re-scans the roots and may discover more white objects. Those objects are marked in parallel with the help of worker threads.

<figure>
<img src="/_img/concurrent-marking/11.svg" alt="">
<img src="/_img/concurrent-marking/11.svg" intrinsicsize="594x212" alt="">
</figure>

## Results

Our [real-world benchmarking framework](/blog/real-world-performance) shows about 65% and 70% reduction in main thread marking time per garbage collection cycle on mobile and desktop respectively.

<figure>
<img src="/_img/concurrent-marking/12.png" alt="">
<img src="/_img/concurrent-marking/12.png" intrinsicsize="2280x1453" alt="">
</figure>

Concurrent marking also reduces garbage collection jank in Node.js. This is particularly important since Node.js never implemented idle time garbage collection scheduling and therefore was never able to hide marking time in non-jank-critical phases. Concurrent marking shipped in Node.js v10.
2 changes: 1 addition & 1 deletion src/blog/csa.md
Original file line number Diff line number Diff line change
@@ -31,7 +31,7 @@ With the advent of TurboFan the answer to this question is finally “yes”. Tu
This combination of functionality made a robust and maintainable alternative to hand-written assembly builtins feasible for the first time. The team built a new V8 component—dubbed the CodeStubAssembler or CSA—that defines a portable assembly language built on top of TurboFan’s backend. The CSA adds an API to generate TurboFan machine-level IR directly without having to write and parse JavaScript or apply TurboFan’s JavaScript-specific optimizations. Although this fast-path to code generation is something that only V8 developers can use to speed up the V8 engine internally, this efficient path for generating optimized assembly code in a cross-platform way directly benefits all developers’ JavaScript code in the builtins constructed with the CSA, including the performance-critical bytecode handlers for V8’s interpreter, [Ignition](/docs/ignition).

<figure>
<img src="/_img/csa/csa.png" alt="">
<img src="/_img/csa/csa.png" intrinsicsize="414x496" alt="">
<figcaption>The CSA and JavaScript compilation pipelines</figcaption>
</figure>

Loading