Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment variable and CLI argument handling in user-land startup snapshots #55603

Open
joyeecheung opened this issue Oct 30, 2024 · 8 comments

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Oct 30, 2024

During the snapshot building process, the process.env and process.execArgv can be queried and captured in the snapshot, which may not match what the environment variables or CLI arguments actually are when the snapshot is deserialized.

For the environment variables queried by Node.js internals, we currently work around this problem by avoiding to cache these values in JS land and always run the pre-execution code to refresh whatever states that are dependent on environment variables or CLI arguments, so after the snapshot is deserialized, any code would get new values from process.env and process.execArgv that reflect the environment where the deserialized application is run. For user land access to these states in the builder script, the expectation is that users should do something similar - either avoid accessing them in the builder script to avoid capturing them in JS, or refreshing any cached states using the deserialize callbacks.

During discussions with @acutmore it seems that while this can be worked around using the solution described above, it would be better if Node.js provides more utilities for working with the environment variables. I think some APIs may be useful:

  1. Some CLI flags to trace access to environment variables, so that users can have a better idea about what variables are used and potentially cached during the snapshot building process, and whether the access need to be delayed to run time or at least refreshed during deserialization. This seems useful in general, not necessarily limited to the startup snapshot feature.

  2. An API that passes a list of environment variables accessed by the builder script to a user-provided serialization callback. Users can e.g. store a copy of those (if necessary they can encrypt sensitive values) as part of the snapshot, and specify a deserializer callback to check for mismatches using that copy or adjust accordingly. Or generate a dotenv file using that API to ensure that some important environment variables are always configured to be consistent with what's used by the builder script, etc. A rough sketch looks like this (more ideas welcomed!)

    v8.startupSnapshot.getAccessedEnvironment((envVars) => {
      // envVars is an array containing the environment variables accessed
      // during snapshot building
      const valuesFromBuild = {};
      // Users can special-case certain keys and encrypt the values if necessary.
      for (const key of envVars) { valuesFromBuild[key] = process.env[key]; }
      v8.startupSnapshot.addDeserializeCallback(() => {
          if (process.env.IMPORTANT_KEY !== valuesFromBuild.IMPORTANT_KEY) {
              // Mismatch, refresh the states, or throw?
          }
       });
      // Users can also instruct the deserialized main function to go another route if
      // there are important mismatches.
    });
  3. Some field in the snapshot configurations that can be used to allow-list access to environment variables in the builder script, or some field that be used to run an alternative main script if certain environment variables mismatch as a fallback (this should be opt-in since serializing environment variables can store sensitive information into the blob) - this is also a rough sketch of what the general idea may look like:

    {
       "envInBuilder": {
         "allowed": [...],   // once specified, if the builder script access any environment variable
                  // that's not in the list, the building process exit with a non-zero exit code.
       }
    }
@joyeecheung
Copy link
Member Author

joyeecheung commented Oct 30, 2024

Opened #55604 for 1, since I think it's a handy flag on its own (and for 2 or 3 we'll also need some changes already in there to track the accessed environment variables in node::Environment)

@acutmore
Copy link

acutmore commented Dec 2, 2024

v8.startupSnapshot.getAccessedEnvironment looks good to me. We'd be able to use that to both: have an assertion that we don't regress by starting to read certain environment variables too eagerly (#55604 could help there by checking the logs in CI tests) and also capture the set of environment variables that we wouldn't want the "host" script to forward on when launching node with the snapshot such as NODE_V8_COVERAGE; or alternatively we could detect when these "snapshot busting" variables are being set and either rebuild a new snapshot or skip using the snapshot altogether.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 17, 2024

I was implementing getAccessedEnvironment() and it feels like a utility that's useful beyond just snapshots. It seems useful to have:

process.accessedEnv.record(true);
// Do things..
process.accessedEnv.get();  // Returns a set of accessed environment variables
process.accessedEnv.record(false);

So users can do what the v8.startupSnapshot.getAccessedEnvironment proposed before provide by doing:

process.accessedEnv.record(true);
v8.startupSnapshot.addSerializeCallback(() => {
  // Get all the environment variables accessed by the script and do some checks.
  const envs = process.accessedEnv.get();
  process.accessedEnv.record(false);
});
// Run scripts

If it's necessary to start the recording as soon as the process is launched instead of waiting until some JavaScript is executed, the configuration can be passed into the process via command line flags or snapshot configuration JSON. That also allows users to leave environment variable accesses coming from Node.js internals to Node.js and not worry about them.

@acutmore
Copy link

I can imagine that being useful for asserting about ENV access in unit tests too.

How would you recommend getting the set of environment variables read by Node internals during the snapshot creation, such as NODE_V8_COVERAGE?

@acutmore
Copy link

Actually for that I think I can use --trace-env-native-stack and the 'host' process that is orchestrating the snapshotting can parse out the accesses from the logs. The simplest thing would be to assume all environment access that it sees should be treated as part of the 'key' it will use to decide if a snapshot can be re-used on the next execution.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 18, 2024

How would you recommend getting the set of environment variables read by Node internals during the snapshot creation, such as NODE_V8_COVERAGE?

I think we can support a configuration in the snapshot configuration json (--build-snapshot-config), maybe something like:

{
  "builder": "snapshot.js",
  "traceEnvInAPI": true
}

which instructs Node.js to track all the environment variables in a set as soon as the configuration is read (currently snapshot config is read after a few very global initializations so we'll need to do some refactoring to move the read of the snapshot configs earlier once we notice that --build-snapshot-config is present, but shouldn't be too difficult). And then in the JS API, say in snapshot.js specified in the config:

// Returns all the environment variables accessed up until the point
// process.accessedEnv.record(false) is called (if it is ever called)
// Set(*) { 'NODE_ICU_DATA', 'OPENSSL_CONF', 'NODE_EXTRA_CA_CERTS', ... }
process.accessedEnv.get();

The same can be implemented in a --trace-env-in-api command line flag for a non-snapshot startup scenario, just that the snapshot config route would keep the execArgv cleaner if it is actually building a snapshot.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 18, 2024

Maybe for a more dedicated API it's better to have a global record & local ones (e.g. for tests), so it can be something like:

process.traceEnv.global.start(); // can also be started by --trace-env-start-global or traceEnvStartGlobal in snapshot config
process.traceEnv.global.get();  // Set(13) {  'NODE_ICU_DATA', 'OPENSSL_CONF', 'NODE_EXTRA_CA_CERTS', ... }
process.traceEnv.global.stop(); // Generally no need to do this for application code controlling the process

We can implement the global one first and then look into a local version later, the local version would be a bit more complex to implement so should be in a follow up, though it's useful to think about what the local version looks like so the API looks more consistent.

const trace = process.traceEnv.createLocal();
trace.start();  // By default it's started as soon as it's created.
trace.get();
trace.stop();
// When records is unreachable and gc'ed, the tracking will be stopped as well, along with it's env var stores

Returning a set (lower memory usage) or an array seems adequate for the basic use cases though we can also think about more complex options (in createLocal and the CLI args etc.) that allows including more information in the traces if their needs come up.

@acutmore
Copy link

just that the snapshot config route would keep the execArgv cleaner if it is actually building a snapshot

Ah right, good point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants