Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI sometimes fails to read swingset config file #10092

Closed
gibson042 opened this issue Sep 15, 2024 · 5 comments · Fixed by #10128
Closed

CI sometimes fails to read swingset config file #10092

gibson042 opened this issue Sep 15, 2024 · 5 comments · Fixed by #10128
Labels
bug Something isn't working

Comments

@gibson042
Copy link
Member

gibson042 commented Sep 15, 2024

Describe the bug

As seen at https://github.com/Agoric/agoric-sdk/actions/runs/10873209314/job/30169121954?pr=10091#step:5:1

Run cd packages/boot && yarn test | $TEST_COLLECT
  cd packages/boot && yarn test | $TEST_COLLECT
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    AGORIC_AVA_USE_TAP: true
    TEST_COLLECT: tee -a _testoutput.txt
    NODE_V8_COVERAGE: coverage
    GH_ENGINE: 18.x
    CI_NODE_INDEX: 1
    CI_NODE_TOTAL: 4
    ESM_DISABLE_CACHE: true
yarn run v1.22.22
$ ava
TAP version 13
failed to load bundles/decentral-itest-vaults-config.json
not ok 1 - net-ibc-upgrade › before hook %ava-dur=96ms
#   REJECTED from ava test.before(): (SyntaxError#1)
#   SyntaxError#1: Unexpected end of JSON input
#       at JSON.parse (<anonymous>)
#       at loadSwingsetConfigFile (/home/runner/work/agoric-sdk/agoric-sdk/packages/SwingSet/src/controller/initializeSwingset.js:239:25)
#       at async ensureSwingsetInitialized (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:146:18)
#       at async buildSwingset (file:///home/runner/work/agoric-sdk/agoric-sdk/packages/cosmic-swingset/src/launch-chain.js:218:32)
#       at makeSwingsetTestKit (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/tools/supports.ts:501:48)
#       at makeTestContext (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:37:27)
#       at <anonymous> (/home/runner/work/agoric-sdk/agoric-sdk/packages/boot/test/bootstrapTests/net-ibc-upgrade.test.ts:49:15)

I suspect this can be fixed by adding a flush: true option to the writeFile call in packages/boot/tools/supports.ts, but only after raising the Node.js version bar to at least v20.10.0 (cf. nodejs/node#50009 for background). And if so, then we may want to consider using it in even more writeFiles. A possible alternative would be reading back the expected size, but that seems too heavyweight.

Subsequent conversation on Slack identified a more likely culprit as concurrently running tests writing and reading against the same file name, a hypothesis which was locally reproducible:

node --input-type=module -e '
import fs from "node:fs";
import fsp from "node:fs/promises";
const sink = () => {};
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
const waitFor = thunk => new Promise(async (resolve, reject) => {
  try {
    while (!(await thunk())) await delay(100);
    resolve();
  } catch (err) { reject(err); }
});

const path = "./dummy";
const data = Object.fromEntries(
  Array.from({ length: 10000 },
  (_, i) => [`key${i.toString().padStart(6, "0")}`, i]),
);
const fn = async worker => {
  for (let i = 0; i < 1000; i++) {
    await fsp.unlink(path).then(sink, sink);
    await waitFor(() => fs.statSync(path, { throwIfNoEntry: false }) === undefined);
    await fsp.writeFile(path, JSON.stringify(data), "utf-8");
    try {
      JSON.parse(fs.readFileSync(path, "utf-8"));
    } catch (err) {
      if (err.code === "ENOENT") continue;
      throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err });
    }
  }      
  return;                       
};
for (let i = 0; i < 2; i++) fn(i);
'
file:///tmp/[eval1]:27
      throw Error(`worker ${worker} failed on attempt ${i}`, { cause: err });
            ^

Error: worker 0 failed on attempt 43
    at fn (file:///tmp/[eval1]:27:13) {
  [cause]: SyntaxError: Unexpected end of JSON input
      at JSON.parse (<anonymous>)
      at fn (file:///tmp/[eval1]:24:12)
}

Node.js v18.18.2

We should instead introduce a differentiating component in these Swingset config file names, e.g. random and/or otherwise unique per potentially-concurrent test.

@gibson042 gibson042 added the bug Something isn't working label Sep 15, 2024
@mhofman
Copy link
Member

mhofman commented Sep 16, 2024

I am very confused by what nodejs/node#50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

@mhofman
Copy link
Member

mhofman commented Sep 16, 2024

From a conversation on Slack, it's a lot more likely that ava concurrency results in the same config file being written by concurrent tests, potentially causing tears since the file being written may take multiple write syscalls.

The solution is likely to add a random component to testConfigPath

@siarhei-agoric
Copy link
Contributor

siarhei-agoric commented Sep 16, 2024

I am very confused by what nodejs/node#50009 attempts to fix. A fsync should only be necessary if you attempt to read from another OS, e.g. network mounted system, or if your computer crashes. I believe the linux kernel guarantees that the same file being read by another process will use the most recently written data, even if it wasn't committed to disk.

There are several things to consider even when a single file is used by a single process on a single host:

  1. Linux kernel only guarantees consistent visibility of writes which are completed via kernel's syscall interface. However, there may be several other layers of abstraction, transformation, and caching between the fs interface in JS and the actual Kernel syscall. A flush from the highest-level interface would guarantee that all of the data actually makes it to its final backing store before the top-level execution proceeds further.
  2. Some applications (such as databases) often require strict ordering of separate writes to different places of the same file, or even groups of different files. This is typically required to guarantee atomicity of transactions (potentially across a crash/reboot).
  3. A set of different processes running on a same host may use a file as a mail box to send messages to each other. Just like in 1 and 2 above, an explicit flush would provide a mechanism to enforce completeness, ordering, and atomicity of these messages.
  4. An fsync on a file would force all of the other outstanding writes to the same file complete and return back with appropriate success/error codes.

With that in mind, only option 1 could be applicable to a simple config file write/read by a single process, and only in case of application crash/abort.

@mhofman
Copy link
Member

mhofman commented Sep 21, 2024

@siarhei-agoric
Copy link
Contributor

siarhei-agoric commented Sep 22, 2024

We should instead introduce a random component in these Swingset config file names.

A unique but predictable and meaningful filename could also be useful in post-mortem investigations. May be some combination of high resolution timestamp in a human-readable, naturally sortable format and a pid (or a run sequence number). Something to save an extra search/lookup when trying to match the file to a run. A random component can still be used just to be absolutely sure.

gibson042 added a commit that referenced this issue Sep 23, 2024
gibson042 added a commit that referenced this issue Sep 23, 2024
gibson042 added a commit that referenced this issue Sep 24, 2024
@mergify mergify bot closed this as completed in #10128 Sep 24, 2024
@mergify mergify bot closed this as completed in bb3f454 Sep 24, 2024
mergify bot added a commit that referenced this issue Sep 24, 2024
…in concurrent testing (#10128)

Fixes #10092

## Description
Rather than having concurrently-running tests fight over e.g. **bundles/decentral-itest-vaults-config.json**, add some entropy to the file name. In this PR, I've chosen **bundles/config.${optionalLabel}.${dateTime}.${randomDigits}.decentral-itest-vaults-config.json**, although no current test is taking advantage of the label option.

### Security Considerations
n/a

### Scaling Considerations
There's a chance that local `bundles` directories fill up with identical config files. They're easily cleaned up by hand, but if this becomes an actual problem then we can add package.json `clean` scripts for that, or even update `getNodeTestVaultsConfig` to check for and reuse such files.

### Documentation Considerations
n/a

### Testing Considerations
n/a

### Upgrade Considerations
n/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants