Environment variable and CLI argument handling in user-land startup snapshots #55603

joyeecheung · 2024-10-30T15:02:40Z

During the snapshot building process, the process.env and process.execArgv can be queried and captured in the snapshot, which may not match what the environment variables or CLI arguments actually are when the snapshot is deserialized.

For the environment variables queried by Node.js internals, we currently work around this problem by avoiding to cache these values in JS land and always run the pre-execution code to refresh whatever states that are dependent on environment variables or CLI arguments, so after the snapshot is deserialized, any code would get new values from process.env and process.execArgv that reflect the environment where the deserialized application is run. For user land access to these states in the builder script, the expectation is that users should do something similar - either avoid accessing them in the builder script to avoid capturing them in JS, or refreshing any cached states using the deserialize callbacks.

During discussions with @acutmore it seems that while this can be worked around using the solution described above, it would be better if Node.js provides more utilities for working with the environment variables. I think some APIs may be useful:

Some CLI flags to trace access to environment variables, so that users can have a better idea about what variables are used and potentially cached during the snapshot building process, and whether the access need to be delayed to run time or at least refreshed during deserialization. This seems useful in general, not necessarily limited to the startup snapshot feature.

An API that passes a list of environment variables accessed by the builder script to a user-provided serialization callback. Users can e.g. store a copy of those (if necessary they can encrypt sensitive values) as part of the snapshot, and specify a deserializer callback to check for mismatches using that copy or adjust accordingly. Or generate a dotenv file using that API to ensure that some important environment variables are always configured to be consistent with what's used by the builder script, etc. A rough sketch looks like this (more ideas welcomed!)

v8.startupSnapshot.getAccessedEnvironment((envVars) => {
  // envVars is an array containing the environment variables accessed
  // during snapshot building
  const valuesFromBuild = {};
  // Users can special-case certain keys and encrypt the values if necessary.
  for (const key of envVars) { valuesFromBuild[key] = process.env[key]; }
  v8.startupSnapshot.addDeserializeCallback(() => {
      if (process.env.IMPORTANT_KEY !== valuesFromBuild.IMPORTANT_KEY) {
          // Mismatch, refresh the states, or throw?
      }
   });
  // Users can also instruct the deserialized main function to go another route if
  // there are important mismatches.
});

Some field in the snapshot configurations that can be used to allow-list access to environment variables in the builder script, or some field that be used to run an alternative main script if certain environment variables mismatch as a fallback (this should be opt-in since serializing environment variables can store sensitive information into the blob) - this is also a rough sketch of what the general idea may look like:
```
{
   "envInBuilder": {
     "allowed": [...],   // once specified, if the builder script access any environment variable
              // that's not in the list, the building process exit with a non-zero exit code.
   }
}
```

The text was updated successfully, but these errors were encountered:

joyeecheung · 2024-10-30T15:44:44Z

Opened #55604 for 1, since I think it's a handy flag on its own (and for 2 or 3 we'll also need some changes already in there to track the accessed environment variables in node::Environment)

acutmore · 2024-12-02T17:50:39Z

v8.startupSnapshot.getAccessedEnvironment looks good to me. We'd be able to use that to both: have an assertion that we don't regress by starting to read certain environment variables too eagerly (#55604 could help there by checking the logs in CI tests) and also capture the set of environment variables that we wouldn't want the "host" script to forward on when launching node with the snapshot such as NODE_V8_COVERAGE; or alternatively we could detect when these "snapshot busting" variables are being set and either rebuild a new snapshot or skip using the snapshot altogether.

joyeecheung · 2024-12-17T15:16:43Z

I was implementing getAccessedEnvironment() and it feels like a utility that's useful beyond just snapshots. It seems useful to have:

process.accessedEnv.record(true);
// Do things..
process.accessedEnv.get();  // Returns a set of accessed environment variables
process.accessedEnv.record(false);

So users can do what the v8.startupSnapshot.getAccessedEnvironment proposed before provide by doing:

process.accessedEnv.record(true);
v8.startupSnapshot.addSerializeCallback(() => {
  // Get all the environment variables accessed by the script and do some checks.
  const envs = process.accessedEnv.get();
  process.accessedEnv.record(false);
});
// Run scripts

If it's necessary to start the recording as soon as the process is launched instead of waiting until some JavaScript is executed, the configuration can be passed into the process via command line flags or snapshot configuration JSON. That also allows users to leave environment variable accesses coming from Node.js internals to Node.js and not worry about them.

acutmore · 2024-12-17T18:57:18Z

I can imagine that being useful for asserting about ENV access in unit tests too.

How would you recommend getting the set of environment variables read by Node internals during the snapshot creation, such as NODE_V8_COVERAGE?

acutmore · 2024-12-18T10:13:52Z

Actually for that I think I can use --trace-env-native-stack and the 'host' process that is orchestrating the snapshotting can parse out the accesses from the logs. The simplest thing would be to assume all environment access that it sees should be treated as part of the 'key' it will use to decide if a snapshot can be re-used on the next execution.

joyeecheung · 2024-12-18T12:46:09Z

How would you recommend getting the set of environment variables read by Node internals during the snapshot creation, such as NODE_V8_COVERAGE?

I think we can support a configuration in the snapshot configuration json (--build-snapshot-config), maybe something like:

{
  "builder": "snapshot.js",
  "traceEnvInAPI": true
}

which instructs Node.js to track all the environment variables in a set as soon as the configuration is read (currently snapshot config is read after a few very global initializations so we'll need to do some refactoring to move the read of the snapshot configs earlier once we notice that --build-snapshot-config is present, but shouldn't be too difficult). And then in the JS API, say in snapshot.js specified in the config:

// Returns all the environment variables accessed up until the point
// process.accessedEnv.record(false) is called (if it is ever called)
// Set(*) { 'NODE_ICU_DATA', 'OPENSSL_CONF', 'NODE_EXTRA_CA_CERTS', ... }
process.accessedEnv.get();

The same can be implemented in a --trace-env-in-api command line flag for a non-snapshot startup scenario, just that the snapshot config route would keep the execArgv cleaner if it is actually building a snapshot.

joyeecheung · 2024-12-18T13:43:51Z

Maybe for a more dedicated API it's better to have a global record & local ones (e.g. for tests), so it can be something like:

process.traceEnv.global.start(); // can also be started by --trace-env-start-global or traceEnvStartGlobal in snapshot config
process.traceEnv.global.get();  // Set(13) {  'NODE_ICU_DATA', 'OPENSSL_CONF', 'NODE_EXTRA_CA_CERTS', ... }
process.traceEnv.global.stop(); // Generally no need to do this for application code controlling the process

We can implement the global one first and then look into a local version later, the local version would be a bit more complex to implement so should be in a follow up, though it's useful to think about what the local version looks like so the API looks more consistent.

const trace = process.traceEnv.createLocal();
trace.start();  // By default it's started as soon as it's created.
trace.get();
trace.stop();
// When records is unreachable and gc'ed, the tracking will be stopped as well, along with it's env var stores

Returning a set (lower memory usage) or an array seems adequate for the basic use cases though we can also think about more complex options (in createLocal and the CLI args etc.) that allows including more information in the traces if their needs come up.

acutmore · 2024-12-18T16:47:12Z

just that the snapshot config route would keep the execArgv cleaner if it is actually building a snapshot

Ah right, good point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment variable and CLI argument handling in user-land startup snapshots #55603

Environment variable and CLI argument handling in user-land startup snapshots #55603

joyeecheung commented Oct 30, 2024 •

edited

Loading

joyeecheung commented Oct 30, 2024 •

edited

Loading

acutmore commented Dec 2, 2024

joyeecheung commented Dec 17, 2024 •

edited

Loading

acutmore commented Dec 17, 2024

acutmore commented Dec 18, 2024

joyeecheung commented Dec 18, 2024 •

edited

Loading

joyeecheung commented Dec 18, 2024 •

edited

Loading

acutmore commented Dec 18, 2024

Environment variable and CLI argument handling in user-land startup snapshots #55603

Environment variable and CLI argument handling in user-land startup snapshots #55603

Comments

joyeecheung commented Oct 30, 2024 • edited Loading

joyeecheung commented Oct 30, 2024 • edited Loading

acutmore commented Dec 2, 2024

joyeecheung commented Dec 17, 2024 • edited Loading

acutmore commented Dec 17, 2024

acutmore commented Dec 18, 2024

joyeecheung commented Dec 18, 2024 • edited Loading

joyeecheung commented Dec 18, 2024 • edited Loading

acutmore commented Dec 18, 2024

joyeecheung commented Oct 30, 2024 •

edited

Loading

joyeecheung commented Oct 30, 2024 •

edited

Loading

joyeecheung commented Dec 17, 2024 •

edited

Loading

joyeecheung commented Dec 18, 2024 •

edited

Loading

joyeecheung commented Dec 18, 2024 •

edited

Loading