Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research Discussion: Faster startup with dynamically generated custom v8 snapshots #9473

Closed
ghost opened this issue Nov 5, 2016 · 16 comments
Labels
discuss Issues opened for discussions and feedbacks. feature request Issues that request new features to be added to Node.js.

Comments

@ghost
Copy link

ghost commented Nov 5, 2016

If you know about v8 snapshots, and node's 'default' snapshot, you might be able to provide insight into this discussion.

I seem to be using more and more 'commands' that are node based. When I run them, like npm, it's my understanding node has to reload and recompile the entry.js and all its require()d modules every time.

I'm wondering if just before node exited, if it could take a v8 snap and save it to a file named to correlate to the full path of entry.js. (obviously optimized to only snap when something changed however, so not every time node exited)

Then, when node was launched again, it could look for a saved snap based on entry.js and just create a new context from that snap, effectively getting to an executing state much faster.

At a high level I'm wondering if my understanding is correct, that a v8 snap is a heap dump that can be reloaded into an isolate, and that when a new context is created in the isolate it starts with modules already jitted, having come from the snap. But wouldn't there be some state that was 'left over' from the original snap that could potentially 'infect' the new context in some non deterministic way? Like say a module set a flag within itself, then the snap was taken, then when reloaded the flag would still be set?

Is there even a possibility, in some way, to save something from run to run, such that entry.js and all require()'d modules don't have to be parsed/jitted every time?

@ghost
Copy link
Author

ghost commented Nov 5, 2016

Thinking some more, I can't see how state from a previous snap couldn't potentially 'infect' a later context. So maybe entry.js would just have to tell node that it's 'agreeing' to be snapped (i.e explicit opt-in to the above described behavior), and will deal with 'resetting' it's global state as necessary. Like maybe it could just call a method vm.snapable() at any point, activating the snap-on-exit behaviour, etc.

Any thoughts?

@mscdex
Copy link
Contributor

mscdex commented Nov 5, 2016

Well, vm already has a way to produce/use binary code cache data. Is that what you're after or ?

@mscdex mscdex added question Issues that look for answers. feature request Issues that request new features to be added to Node.js. and removed question Issues that look for answers. labels Nov 5, 2016
@ghost
Copy link
Author

ghost commented Nov 5, 2016

Are you referring to this vm in node?

There's nothing there that lets you generate a v8 snapshot, nor load one. Am I missing something?

@ghost
Copy link
Author

ghost commented Nov 5, 2016

@mscdex Could you also mark as a discussion too? I think I could implement something like this, but looking to others to get their thoughts first.

For example, it could very well speed startup times, but it creates a potential for whats loaded from the snap becoming out of sync with the actual source .js, like if a dependency was updated.

There could/should be a way to 'clear' the snap. Or maybe require calls could be tracked and saved with the snap, and then checked for changed modified times on startup, but things like npm might have hundreds or thousands of dependency modules, so checking each may work against the whole point of the thing.

There also needs to be a way to detect if a new snap must be made on exit, for example npm dynamically require()s modules based on command line args (most likely to also speed startup), which would mean tracking require calls, and overhead, and again working against the whole point of the thing.

I guess this first thing I can try is a POC to see if the time savings on startup are even worth pursuing further... collect some metrics.. Just let it snap on exit every time, add a '--clear-snap' switch. Can try with npm -v, which seems to take quite a long time just to spit out the version.

@mscdex
Copy link
Contributor

mscdex commented Nov 5, 2016

@phestermcs It's probably not a V8 snapshot proper. The vm options I was referring to specifically were cachedData and produceCachedData. Also, see this V8 blog post about it.

@ghost
Copy link
Author

ghost commented Nov 5, 2016

I see now; didn't catch that!. That would be another way to go as well, maybe even a better, simpler way. In fact I think I can tweak node's module.js module to POC it and see if there's a big difference. It would address the issue of reloading state, and therefore could in theory be applied automatically rather than an entry.js having to opt into snapping.

I'm still curious just how much of a boost a snap would have though.

Thanks for the clarification!

@ghost ghost changed the title Discussion: Faster startup with auto custom v8 snapshots Discussion: Faster startup with dynamically generated custom v8 snapshots Nov 5, 2016
@ghost
Copy link
Author

ghost commented Nov 5, 2016

@mscdex Would you mind adding the 'discussion' label to this issue?

I tweaked node and made a POC that creates and loads codeCachedData files for each '.js' file that gets require()d; the first require() will generate an adjacent '.js.bin' file to the '.js' file, and subsequent require()s from separate executions will then load the codeCachedData from the '.js.bin' file.

Unfortunately, and surprisingly, I did not notice any improvement in startup times running things like npm, tsc, a couple others; there might have been some, but it was not detectable to the human eye. (I did discover npm launches itself twice on every run! or more precisely the shell script that launches 'node npm-cli.js' does, once to get the 'prefix' config, and then a second time to re-run npm-cli.js located relative to 'prefix'.. yuck.. but I digress).

Is there anyone out there that might have insight as to why, in node's case, code caching in the above way didn't provide any benefit?

Next experiment is to dynamically generate snapshots, a bit more complex to do.

@ghost
Copy link
Author

ghost commented Nov 5, 2016

@rvagg I see on some commits you may have insight into snapshots. This issue wasn't intended to be a feature request, but a research discussion around a potential way of improving start times for node based shell commands. Just wondering if you have any thoughts as to viability, potential issues, etc.

You also made a comment on another issue I had in the Modules subsystem. I posted a PR showing what I was talking about. I'm sure you're incredibly busy, but it would be a big favor if you took a peak. I totally get that the node core team will never consider changes to Modules, so I'd much rather have any technical insight you might offer, without the politics if possible? That would be kindof amazing to me if you did that :)

@bnoordhuis
Copy link
Member

At a high level I'm wondering if my understanding is correct, that a v8 snap is a heap dump that can be reloaded into an isolate, and that when a new context is created in the isolate it starts with modules already jitted, having come from the snap.

That's correct. A snapshot is essentially the serialized state of the process at the time the snapshot was taken. It's a bit like emacs's unexec or sbcl's save-lisp-and-die command.

cachedData does something different, it's just a precompile step. It takes source code and compiles it into baseline machine code but it doesn't optimize or run it. Unless you are on a slow system (like raspberry pi 1 slow), the overhead of the baseline compiler is negligible for most applications.

The problem with snapshots is that there is no way to revive external state, like file descriptors, child processes, etc. That makes it useless for quickstarting most applications.

@ghost
Copy link
Author

ghost commented Nov 5, 2016

@bnoordhuis Thanks for the insight!.

...the overhead of the baseline compiler is negligible for most applications

That would explain why my simple POC using cachedData had no noticeable effect.

The problem with snapshots is that there is no way to revive external state, like file descriptors, child processes, etc. That makes it useless for quickstarting most applications.

Does this mean then that if an application did not have any state outside the v8 heap, that a snapshot would in fact substantially improve startup time?

If so, then for the use case I'm thinking it might be worth the effort to explore this further. Many node commands I'm using seem to follow the same basic pattern:

  1. Some amount of .js gets require()d from just starting up
  2. One or more config file(s) are opened, read to set flags, etc., then closed
  3. Command line args are processed to set/override flags, etc.
  4. Files specified on the command line are opened, read, and transformed into some kind of output.

So what if the application, between steps 1 and 2, could explicitly inform node that it was at a point where a snapshot could be taken, and it would not contain any external state of any importance.

  1. Would actually taking the snapshot take some inordinate amount of time, like dozes of seconds or minutes? (I get it depends on how much has been require()d, but just trying to get a 'feel' for overhead of actually taking a snapshot)
  2. If the application did get to step 2. from a snapshot, would there be a substantial reduction in startup time on the order of at least 4 times or more (i.e. from 400ms to 100ms or less)?

If startup time could be reduced by at least 4 times or more, it would be compelling to me. I've looked around and others have indicated startup times being reduced by up to 20 times in their particular use case, and the way in which they described using snapshots implied similar gains could be achieved. Would you agree, assuming the issues you mentioned where addressed in some way?

@ghost
Copy link
Author

ghost commented Nov 6, 2016

Did some more research. Hoping someone having experience with using v8 snaphots could confirm the following.

But also could first validate that, assuming for the moment all the state issues could be addressed, that if a snapshot could be made of some application, like npm, it would offer a substantial increase not just in startup time, but in overall execution time as well, since everything would have already been jitted. And by substantial I'm thinking at least 4 times improvement, hopefully much more. This first kinda 'eyeball-it' of potential gain is a gate I'd like to get through before researching this issue further, so if you know what kinda of gains are typically possible, please let me know :).

Reviewing the node source, it appears dynamically creating a snapshot that included some portion of a user-land application's compiled .js and pure js instances, wouldn't work when reloaded, because even if the application ensured it had no references to any resources outside the v8 heap (OS handles, Buffers, add-ons), node would still have created, and held references to (via bootstrap_node.js), host instances, like process, that in some way had references/pointers to the native implementation (i.e. FunctionTemplates) that obviously live outside the v8 heap.

Now generally, what's important about creating a usable v8 snapshot is that when the snapshot is created, the v8 heap has no references to things outside. However, this does not mean that everything that was executed to create the heap state, couldn't at some point have held references to the outside. For example, the environment could have exposed a FunctionTemplate instance that just provided access to a single native function that could read a file. While creating the heap state, some pure js could have used that function to, say, load a json file which was then deserialized into a pure js object in the heap, then nulled the reference to the FunctionTemplate instance. The heap would then have no references to outside state, and could be safely reloaded and used.

I know intuitively that is conceptually true, but wondering if the special Isolate you get from SnapshotCreator in someway blocks creation of FunctionTemplates, etc., or if it just doesn't care and leaves it up to the consumer to ensure the state of the v8 heap at the point the snapshot is taken. I'll assume for the moment it doesn't care.

It appears with a normal build of node, that a snapshot is made of v8's core library. So, the startup sequence of things is fundamentally:

  1. v8 creates an Isolate, and deserializes the v8 core snapshot into it
  2. v8 creates some host types/instances (FunctionTemplate)
  3. node creates an Environment that creates several host types/instances (process, fs, etc)
  4. node runs bootstrap_node.js that creates references to the host types/instances
  5. bootstrap_node runs the entry.js specified on command line

Definitely a challenge, but I think there's a way to make this work. It would require that an application explicitly inform node when it could be snapped, which is easy. The hard part would be rebinding all the host types to host instances, like process, after a snapshot was deserialized.

I think I have a way to make that happen, but still researching a bit. But first, it would be such a huge favor if someone with hands-on experience could kinda vet and validate the above, but even more importantly, could make a statement like "If you can make a snapshot, the time savings would be much more the 4x's and it would totally be worth it".

@ghost ghost changed the title Discussion: Faster startup with dynamically generated custom v8 snapshots Resarch Discussion: Faster startup with dynamically generated custom v8 snapshots Nov 6, 2016
@ghost ghost changed the title Resarch Discussion: Faster startup with dynamically generated custom v8 snapshots Research Discussion: Faster startup with dynamically generated custom v8 snapshots Nov 6, 2016
@bnoordhuis
Copy link
Member

The embedder - i.e., node.js - can register FunctionTemplates and ObjectTemplates when the snapshot is created. There is probably no way to make that work for native add-ons, though.

Hard to say whether it's going to be significantly faster on average. Taking a snapshot right after start-up won't help much if your application doesn't do significant processing at start-up. Most of the main application will still be baseline code (or won't have been compiled at all if lazy compilation is enabled) because it hasn't run long enough to reach the optimizing tier.

@Fishrock123 Fishrock123 added the discuss Issues opened for discussions and feedbacks. label Nov 7, 2016
@ghost
Copy link
Author

ghost commented Nov 7, 2016

You're being very helpful @bnoordhuis, and it's truly appreciated.

Regarding Templates, I'm thinking registering them wouldn't be a problem, rather having an instance of them in the heap, that was still referenced, would be the problem; when the snap would be deserialized in a new isolate, those instances would be pointing to who knows what. So not creating an instance, or creating an instance but dereferencing before the snap, Im reasoning might be ok.

Regarding add-ons, Just because the bits aren't loaded with the main process, I don't think would mean their living under different constraints regarding the above (v8 itself has no idea from where Templates are defined). It could define Templates that an instance is never made of, and nothing would be the wiser.

I was wondering exactly how much is compiled, and you've clarified that the level of compilation would be exactly that of having used vm.Script.cachedData, and only if lazy comp is off. So other than just disk io, which the o/s has probably cached already, it wouldn't be any faster than my poc using codeCachedData.

My next main gating question then is,

  1. what is the difference in performance between just baseline compiled code vs. having been fully optimized as much as possible?

Before I created this issue, my thought was very simple: v8 keeps doing all this work every time I'm running the same command, it would be nice to keep all of that for the next time I run it.

It seems V8 started its life to run js in browsers, so it's been tuned to support that context the best, were time-to-something-running is more important than running-as-fast-as-possible as-soon-as-possible, and where the source might be changing at a higher frequency than in other contexts. There's also GUI rendering and networking potentially going on which is adding to the overall 'feel' of how long things take to start happening and complete, from a user perspective.

Using V8 in the context of a server also turns out to be pretty good too, because the server process is long lived, and lots of backend services have a kinda 'warmup' period anyways. So after or short while v8 has optimized all it can so the server is executing as fast as it can. And user experience is only indirectly influenced by the server.

But then there's 'command-lets' written in node, most often only used be developers, and it's this use case I'm trying to improve if possible. In this context, compared to the browser, code doesn't change much, and there's often a TON more code coming from thousands of files, and no GUI or networking potentially intermixed. And compared to the server, the commands are launched, they do some stuff, and then terminate.

Now, I've written compilers and pcode type vm's, so I believe I have a general sense of the characteristics of such things, and maybe ignorance is bliss, because even considering how amazing v8 is at what it does, I can't help but kinda 'know' there's just no way that going from utf8 text across thousands of files, to a single chunk of optimized code, is done so fast that it shouldn't matter.

So I'm wondering if anyone has actually done any real-world research in really nailing down how much an impact there is of a 'cold' launch-and-run vs. a 'hot' launch-and-run of a 'command-let'? Obviously the longer a command runs (more specifically loops over the same code) the less of an impact v8's compilation & optimization has to the overall time to completion, but still... Maybe a POC can take something like npm and turn into a server.... and...

Light Bulb!

Maybe that would be the simplest way to achieve what I'm trying to do; which fundamentally is just trying to get an optimized binary into memory without compilation/optimization.

I think I can write a module that offers the basics of this to those writing command-lets.

In fact I've now convinced myself my original thinking of taking a snapshot can't really offer anything better than having a server running (other than improved cold start), and is obviously orders of magnitude more complex to implement. In both cases the command-let is still responsible for cleaning/resetting global state from run-to-run, and in the case of a server, issues of 'staleness' of bin vs code are much simplified as well.

I'm gunna do that (if someone hasn't already), and I guess this issue could be closed, but I'll leave open for a little while in case others were interested in the whole concept.

Thanks @bnoordhuis for letting me bend your ear!

@bnoordhuis
Copy link
Member

Regarding Templates, I'm thinking registering them wouldn't be a problem, rather having an instance of them in the heap, that was still referenced, would be the problem; when the snap would be deserialized in a new isolate, those instances would be pointing to who knows what.

If I understand your concern correctly: there is a mechanism for saving and restoring FunctionTemplate function pointers and a similar (albeit not identical) mechanism for internal fields.

(Untested but I suspect there are actually several ways to restore function pointers. One is Isolate::CreateParams::external_references, another is FunctionTemplate::FromSnapshot() followed by FunctionTemplate::SetCallHandler().)

what is the difference in performance between just baseline compiled code vs. having been fully optimized as much as possible?

I'd say 1.5-2x on average. If you want to try it out, --nocrankshaft disables the optimizing tier.

I can't help but kinda 'know' there's just no way that going from utf8 text across thousands of files, to a single chunk of optimized code, is done so fast that it shouldn't matter.

I wouldn't say it's so fast it never matters but the baseline compiler is really just a template JIT: it's not smart at all, it's just fast (but not fast enough on mobile systems, hence the Ignition interpreter that is being added.)

Lazy compilation is another factor (and is enabled by default): generating machine code for a function doesn't happen until the first time it's called.

@ghost
Copy link
Author

ghost commented Nov 9, 2016

Regarding FunctionTemplates, my initial understanding came from this article on v8 snapshots, however it's over a year old:

There is an important limitation to this: the snapshot can only capture V8’s heap. Any interaction from V8 with the outside is off-limits when creating the snapshot. Such interactions include:

  1. defining and calling API callbacks (i.e. functions created via v8::FunctionTemplate)
  2. creating typed arrays, since the backing store may be allocated outside of V8

From your comments it appears limitations around FunctionTemplates have been reduced since then.

Going from utf8 text across 1,000's of files, to fully optimized binary code in a single chunk of memory (i.e. v8 has done all it would ever be able to, including optimizations resulting from continuous execution) has got to cost something. When we run node based command line programs, that happens every single time. In that regard the comment...

but the baseline compiler is really just a template JIT: it's not smart at all, it's just fast

...was a little confusing, but I think that was my fault in not distinguishing from just a baseline compile vs a fully optimized compile (which I understand that today, v8 can only achieve after having run and profiled the program).

In my mind, for command line programs, I wish I could say:

"Node & V8, please just take an extra few moments the first time I run this thing to compile and optimize the heck out of it. Then save that output to a single binary file, so the next time I run this command, you can just load that single, binary, executable"

That's really what I'm after.

As an 'eyeballing-it' measure:

I'd say 1.5-2x on average. If you want to try it out, --nocrankshaft disables the optimizing tier.

...is compelling, but I'm guessing doesn't account for the time to actually optimize; i.e. going from basline to fully optimized. I can see it being a bit hard to measure given the way v8 works, unless there's some perfcounters or the like, that can aggregate how much time node/v8 spends doing things over the course of an execution; loading .js files, baseline compiling, fully optimized compiling, and then difference in execution times of every function between baseline compiled and fully optimized, and times called.

The nature of browsers and how user's interact with them create much more opportunity for V8 to amortize its optimizations over the duration of interaction, and server based programs are long lived and in the background, so the time v8 takes to optimize is hardly noticed, and once done is leveraged for the duration the service stays running.

Command line programs, mostly used by us developers, aren't really accounted for by v8, although at first glance snapshots seem like a path to eventually have that.

Anyways, today I think the simplest, quickest way to get what I'm after is to have a 'command-server' module authors of command line program can easily leverage. At a high level it will work like this:

  1. The command's first require() call is to a small-as-possible bootstrapper (for fastest startup).
  2. The bootstrapper looks for a background 'command-server' process, or launches one if not found.
  3. It passes the name of the entry.js file to the background service.
  4. The background service checks to see if it's already aware of entry.js, and if not calls vm.runInNewContext() to associate a separate context for the particular entry.js.
  5. For a newly created context, the bootstrapper will be loaded again, except it will know it's now running in the background, and instead of looking for a background process, will 'register' the real-entry-point-function() of the command.
  6. The foreground bootstrapper will then ipc call to the background process, connecting up stdio.
  7. The background process will then call the real-entry-point-function() to execute the command; that function will be responsible for ensuring any global state from a prior run is reset.

So, rather than a fully optimized binary being loaded from disc, there will be one already sitting and waiting in memory in the background process :)

The first dev/poc version will be as simple as possible just to see if there's a marked improvement in overall performance. If so, then I'll work on a full release that will be a bit more clever in how it launches and routes to background processes, so that every command across shells don't end up sharing a single Isolate/thread.

I'll let you know what I discover!

@ghost
Copy link
Author

ghost commented Nov 12, 2016

@bnoordhuis Thanks for your insights. Closing this issue as it has served its purpose.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues opened for discussions and feedbacks. feature request Issues that request new features to be added to Node.js.
Projects
None yet
Development

No branches or pull requests

3 participants