Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APIs for libraries/frameworks/tools to control on-disk compilation cache (NODE_COMPILE_CACHE) #53639

Closed
joyeecheung opened this issue Jun 29, 2024 · 8 comments · Fixed by #54501

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Jun 29, 2024

Spinning from #52535 (comment)

Currently, the built-in on-disk compilation cache can only be enabled by NODE_COMPILE_CACHE. It's possible for the end user to control where the NODE_COMPILE_CACHE is stored and so that it's also possible for them to find the cache and clean it up when necessary. That's the simplest enabling mechanism for sure, but from the use cases of v8-compile-cache (a package that monkey-patches the CJS loader, which is a capability that we want to sunset, see #47472). It's also common for library/framework authors to want to enable this in a more flexible manner. So this issue is opened to discuss what an API for this should look like and what the directory structure of the cache should look like.

With the global NODE_COMPILE_CACHE the current cache directory structure looks like this:

- Compile cache directory (from NODE_COMPILE_CACHE)
  - $version_hash1: CRC32 hash of CachedDataVersionTag + NODE_VERESION (maybe we need to add UID too)
  - $version_hash2:
    - $module_hash_1: CRC32 hash of filename + module type.      <--- cache files
    - $module_hash_2: ...
...

For reference v8-compile-cache's cache directory looks like this

- $tmpdir/v8-compile-cache-$uid-$arch-version
  - $main_name.BLOB: filename of the module that `require('v8-compile-cache')`, or process.cwd() if it's not required in-file
  - $main_name.MAP:
  - $main_name.LOCK

And inside the .BLOB files it maintains a module_filename + sha-1 checksum -> cache_data storage. In the documentation it explains:

The cache is entry module specific because it is faster to load the entire code cache into memory at once, than it is to read it from disk on a file-by-file basis.

In my investigation when implementing NODE_MODULE_CACHE though, there's actually not much performance difference in reading on a file-by-file basis, at least when it's implemented using native FS calls and when the file only gets loaded when the corresponding module is about to get compiled (so not all the cache is loaded into the process at once even though the module might not be needed by the application at all - which v8-compile-cache does).

For third-party tooling (e.g. transpilers, package managers) I think the layout that don't distinguish about entrypoints would still be beneficial - as long as the final resolved file path remains the same and its content matches the checksum, and it's still being loaded by the same Node.js version etc., then the cache is going to hit. Then if multiple dependencies in the same project try to enable it, we wouldn't be saving multiple caches on disk even though they are effectively caching the code for the same files (e.g. the end user code needs package foo that resolves to /path/to/foo.js, whose cache gets repeatedly stored in the cache enabled by a transpiler and then again in the cache enabled a package manager that executes a run command).

I wonder if we should just provide the following APIs:

const module = require('node:module');  // Or import it

/**
 * Enable on-disk compiled cache for all user modules being complied in the current Node.js instance
 * after this method is called.
 * If cacheDir is undefined, defaults to the NODE_MODULE_CACHE environment variable.
 * If NODE_MODULE_CACHE isn't set, default to `$TMPDIR/node_compile_cache`.
 * @param {string|undefined} cacheDir
 * @returns {string} The path to the resolved cache directory.
 */
module.enableCompileCache(cacheDir);

/**
 * @returns {string|undefined} The resolved cache directory, if on-disk compiled cache is configured.
 *   Otherwise return undefined.
 */
module.getCompileCacheDir();

process.getCompileCacheDir() would still allow end users to find and clean stale cache to release disk space. We could probably also add a file to the designated directory with a name that's easy to find (e.g. $CACHE_DIR/node_compile_cache_mark) to facilitate this too.

In most use cases, tooling and libraries should simply call module.enableCompileCache() without passing in an argument so that the cache is stored in tmpdir and can be shared with other dependencies by default, and end users can override the default cache directory location with NODE_COMPILE_CACHE. Some more advanced tooling/framework might want more advanced customizations and use their own cache directory, then they can specify it.

Some more powerful APIs are probably needed to allow advanced configuration of the cache storage, but at least the APIs mentioned above would address the use cases of existing v8-compile-cache users. For the more power API, it would be difficult to just think of one that works well without some collaboration with adopters, so ideas welcomed regarding how that should look like :)

@joyeecheung
Copy link
Member Author

cc @merceyz @jakebailey @H4ad from #47472

@benjamingr
Copy link
Member

It's also common for library/framework authors to want to enable this in a more flexible manner.

Why? What sort of flexibility would the library/framework need that the environment variable doesn't provide?

@jakebailey
Copy link

jakebailey commented Jun 29, 2024

This sounds great; supporting a default location in a reasonable location is super helpful.

Is node:module the right place for this? Or is node:v8 actually where it might be?

after this method is called.

All-in-all, I'm not sure how I feel about being unable to use this without using CJS or TLA; if an executable wants to enable caching of itself, it needs to have an extra entrypoint which only serves to enable the caching and then load the other code. Or, fork, which is slow. Not sure that one can do better, though. The call has to happen somewhere...

I guess this is exactly how v8-compile-cache works? (Not familar with its implementation but I guess it must have the same restriction...)

Why? What sort of flexibility would the library/framework need that the environment variable doesn't provide?

If you want to enable caching today, you have to set the environment variable. This means that applications which want to enable it for themselves have to fork a new process to enable it, defeating the speedup.

@joyeecheung
Copy link
Member Author

Is node:module the right place for this? Or is node:v8 actually where it might be?

I used node:module off the top of my head but node:v8 sounds reasonable too. I am slightly leaning towards node:module because this only applies to the user modules loaded by the usual module loading process (so if the user compiles some module differently via vm APIs, this won't apply, at least not automatically).

@joyeecheung
Copy link
Member Author

All-in-all, I'm not sure how I feel about being unable to use this without using CJS or TLA

You could also --import an ESM that does this call synchronously.

if an executable wants to enable caching of itself, it needs to have an extra entrypoint which only serves to enable the caching and then load the other code.

Yeah I think there is a general lack of way for libraries to "define something to be run before everything else, without the use of command line flags, or environment variables". It was also raised in the module loader hooks discussion (#52219 (comment)). IMO we need to figure out a way to allow developers/users to specify code that needs to be preloaded for every/some process/worker. But some configuration needs to happen - perhaps some magic field in package.json is a good place for it to be done, but that would probably be a separate topic.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jul 29, 2024

I have a WIP at https://github.com/joyeecheung/node/tree/compile-cache-api - still needs to finish the tests and docs.

Locally with this wrapper

const { enableCompileCache } = require('module');
if (enableCompileCache) {
  enableCompileCache();
}
require('./test/fixtures/snapshot/typescript.js');

I get the following numbers:

❯ hyperfine  "./node_main ./test-tsc.js" "out/Release/node ./test-tsc.js"  --warmup 2
Benchmark 1: ./node_main ./test-tsc.js
  Time (mean ± σ):     131.7 ms ±   1.1 ms    [User: 118.2 ms, System: 12.3 ms]
  Range (min … max):   130.4 ms … 135.2 ms    21 runs

Benchmark 2: out/Release/node ./test-tsc.js
  Time (mean ± σ):      85.2 ms ±   1.4 ms    [User: 72.4 ms, System: 11.5 ms]
  Range (min … max):    83.4 ms …  90.7 ms    34 runs

Summary
  out/Release/node ./test-tsc.js ran
    1.55 ± 0.03 times faster than ./node_main ./test-tsc.js

I think for the use case of TypeScript, a trampoline entrypoint like this is still needed to 1. enable code cache and 2. load the actual lib. If the lib part is ESM, this trampoline entrypoint must either be CJS that does require(lib_esm), or ESM that import(lib_esm), it cannot load the lib via static import, because ESM compilation happens before the static importer's code gets evaluated, or there is no way for the static importer to modify how the module gets compiled currently. That would be the case until something like import.now or importSync is added to ECMA262, at least...

EDIT: actually you can do it in a ESM trampoline, just that the lib itself still cannot be imported statically, but you can have require() working as importSync() for you in an ESM via createRequire().

import { createRequire, enableCompileCache } from 'node:module';  // Or use process.getBuiltinModule()
const require = createRequire(import.meta.url);  // You can call this importSync if you want ;)
if (enableCompileCache) {
  enableCompileCache();
}
require('./test/fixtures/snapshot/typescript.js');

@joyeecheung
Copy link
Member Author

Something that I think we should add along with the API - an environment variable to disable caching, as an escape hatch for users running into bugs (it has helped some people using v8-compile-cache in #51555 - while the built-in cache would be a bit more robust, an escape hatch would still be useful in case there are bugs).

nodejs-github-bot pushed a commit that referenced this issue Aug 19, 2024
This refactors the compile cache handler in preparation for the
JS API, and updates the compile cache storage structure into:

- $NODE_COMPILE_CACHE_DIR
  - $NODE_VERION-$ARCH-$CACHE_DATA_VERSION_TAG-$UID
    - $FILENAME_AND_MODULE_TYPE_HASH.cache

This also adds a magic number to the beginning of the cache
files for verification, and returns the status, compile
cache directory and/or error message of enabling the
compile cache in a structure, which can be converted as
JS counterparts by the upcoming JS API.

PR-URL: #54291
Refs: #53639
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
RafaelGSS pushed a commit that referenced this issue Aug 19, 2024
This refactors the compile cache handler in preparation for the
JS API, and updates the compile cache storage structure into:

- $NODE_COMPILE_CACHE_DIR
  - $NODE_VERION-$ARCH-$CACHE_DATA_VERSION_TAG-$UID
    - $FILENAME_AND_MODULE_TYPE_HASH.cache

This also adds a magic number to the beginning of the cache
files for verification, and returns the status, compile
cache directory and/or error message of enabling the
compile cache in a structure, which can be converted as
JS counterparts by the upcoming JS API.

PR-URL: #54291
Refs: #53639
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
RafaelGSS pushed a commit that referenced this issue Aug 21, 2024
This refactors the compile cache handler in preparation for the
JS API, and updates the compile cache storage structure into:

- $NODE_COMPILE_CACHE_DIR
  - $NODE_VERION-$ARCH-$CACHE_DATA_VERSION_TAG-$UID
    - $FILENAME_AND_MODULE_TYPE_HASH.cache

This also adds a magic number to the beginning of the cache
files for verification, and returns the status, compile
cache directory and/or error message of enabling the
compile cache in a structure, which can be converted as
JS counterparts by the upcoming JS API.

PR-URL: #54291
Refs: #53639
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
Reviewed-By: Chengzhong Wu <legendecas@gmail.com>
@joyeecheung
Copy link
Member Author

Opened #54501

nodejs-github-bot pushed a commit that referenced this issue Aug 28, 2024
This patch adds the following API for tools to enable compile
cache dynamically and query its status.

- module.enableCompileCache(cacheDir)
- module.getCompileCacheDir()

In addition this adds a NODE_DISABLE_COMPILE_CACHE environment
variable to disable the code cache enabled by the APIs as
an escape hatch to avoid unexpected/undesired effects of
the compile cache (e.g. less precise test coverage).

When the module.enableCompileCache() method is invoked without
a specified directory, Node.js will use the value of
the NODE_COMPILE_CACHE environment variable if it's set, or
defaults to `path.join(os.tmpdir(), 'node-compile-cache')`
otherwise. Therefore it's recommended for tools to call this
method without specifying the directory to allow overrides.

PR-URL: #54501
Fixes: #53639
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
This patch adds the following API for tools to enable compile
cache dynamically and query its status.

- module.enableCompileCache(cacheDir)
- module.getCompileCacheDir()

In addition this adds a NODE_DISABLE_COMPILE_CACHE environment
variable to disable the code cache enabled by the APIs as
an escape hatch to avoid unexpected/undesired effects of
the compile cache (e.g. less precise test coverage).

When the module.enableCompileCache() method is invoked without
a specified directory, Node.js will use the value of
the NODE_COMPILE_CACHE environment variable if it's set, or
defaults to `path.join(os.tmpdir(), 'node-compile-cache')`
otherwise. Therefore it's recommended for tools to call this
method without specifying the directory to allow overrides.

PR-URL: #54501
Fixes: #53639
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
This patch adds the following API for tools to enable compile
cache dynamically and query its status.

- module.enableCompileCache(cacheDir)
- module.getCompileCacheDir()

In addition this adds a NODE_DISABLE_COMPILE_CACHE environment
variable to disable the code cache enabled by the APIs as
an escape hatch to avoid unexpected/undesired effects of
the compile cache (e.g. less precise test coverage).

When the module.enableCompileCache() method is invoked without
a specified directory, Node.js will use the value of
the NODE_COMPILE_CACHE environment variable if it's set, or
defaults to `path.join(os.tmpdir(), 'node-compile-cache')`
otherwise. Therefore it's recommended for tools to call this
method without specifying the directory to allow overrides.

PR-URL: #54501
Fixes: #53639
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
RafaelGSS pushed a commit that referenced this issue Aug 30, 2024
This patch adds the following API for tools to enable compile
cache dynamically and query its status.

- module.enableCompileCache(cacheDir)
- module.getCompileCacheDir()

In addition this adds a NODE_DISABLE_COMPILE_CACHE environment
variable to disable the code cache enabled by the APIs as
an escape hatch to avoid unexpected/undesired effects of
the compile cache (e.g. less precise test coverage).

When the module.enableCompileCache() method is invoked without
a specified directory, Node.js will use the value of
the NODE_COMPILE_CACHE environment variable if it's set, or
defaults to `path.join(os.tmpdir(), 'node-compile-cache')`
otherwise. Therefore it's recommended for tools to call this
method without specifying the directory to allow overrides.

PR-URL: #54501
Fixes: #53639
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
louwers pushed a commit to louwers/node that referenced this issue Nov 2, 2024
This patch adds the following API for tools to enable compile
cache dynamically and query its status.

- module.enableCompileCache(cacheDir)
- module.getCompileCacheDir()

In addition this adds a NODE_DISABLE_COMPILE_CACHE environment
variable to disable the code cache enabled by the APIs as
an escape hatch to avoid unexpected/undesired effects of
the compile cache (e.g. less precise test coverage).

When the module.enableCompileCache() method is invoked without
a specified directory, Node.js will use the value of
the NODE_COMPILE_CACHE environment variable if it's set, or
defaults to `path.join(os.tmpdir(), 'node-compile-cache')`
otherwise. Therefore it's recommended for tools to call this
method without specifying the directory to allow overrides.

PR-URL: nodejs#54501
Fixes: nodejs#53639
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants