-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Concurrent Testing Scalability - Find Blocking Code, Make them Non-Blocking #264
Comments
It's also interesting that we are hitting testing-scalability problems. We may need to simply invest in bigger computers too to be able to do the tests well. |
We should get all tests that use the |
It is no longer feasible to expect devs to always be doing full To ensure development tempo, we will need to solve this problem. The number of tests are only going to grow as the application gets larger. We may need to do a sort of hierarchy. Invest in bigger CI/CD machines that can do full integration testing that uses many cores and many machines, while devs continue to focus on domain-specific testing. That way devs can work on weaker machines and centralise the computational effort. |
We should also identify slow tests and a proper threshold. All tests should complete sooner than 5 seconds, any tests larger than that should be looked into broken down. When a test cannot be broken down, it needs to be isolated and separated into a separate section where they are longer tests. Certain domains are inherently "integrative". The |
The
|
In general though, |
Can we separate these tests, that would enable concurrent execution.
In addition to this, jest has the ability to mock timers. https://jestjs.io/docs/timer-mocks and we should be setting the timeouts to be much shorter in our test cases compared to production.
Right now running domain level tests take too long at least for nodes.
On 27 October 2021 9:33:33 am AEDT, Josh ***@***.***> wrote:
`pings node` has that crazy time mostly because it's checking the failure case too (where a node is offline). There's a few cases of this:
1. node is offline: this takes approximately 20 seconds for the connection attempt to timeout (and therefore, the ping to fail)
2. existing connection goes offline: I believe it requires approximately 30 seconds for an existing connection to be "dropped" in the networking domain (related to the keep-alive packets?)
In general though, `NodeManager.test.ts` is also slow because it hasn't been migrated to using only a single `PolykeyAgent` across the larger integration tests. This will need to be done at some stage.
--
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub:
#264 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Note for myself, when splitting the nodes tests and making these more efficient (particularly the
static async createForwardProxy({
authToken,
connConnectTime = 20000,
connTimeoutTime = 20000,
connPingIntervalTime = 1000,
logger,
}: {
authToken: string;
connConnectTime?: number;
connTimeoutTime?: number;
connPingIntervalTime?: number;
logger?: Logger;
}): Promise<ForwardProxy> {
|
When refactoring the tests, ideally our domain-level tests should always aim to be unit tests (i.e. minimal tests on the functionality, and that mock things that are required to be injected/used). Timer mocks should also be looked into (see above #264 (comment)) such that we don't have to wait for the full timeout expected in production (e.g. connection timeouts especially). Any longer/integration tests should be separately executed, such that they aren't executed in the same test suite as these smaller unit tests. We should look into jest's tagging solution (https://www.npmjs.com/package/jest-runner-groups) that would allow us to tag via comments and run tests across all domains, whilst separately executing the longer integration tests. This way we wouldn't have to keep them in separate directories either. |
This would most likely be a good issue for @emmacasolin to act in a supporting role too once she's back from leave. |
I think I've worked out why this wasn't working - it looks like the timeout needs to be set on both the forward proxy of the node doing the pinging AND the reverse proxy of the node being pinged. Setting the timeout on both ends to one second reduces the time the nodes ping test takes to just under a minute (a reduction of 20-30 seconds), however the need to start, stop, and restart the remote node during the test is keeping the test time high. Moving the initial setup and destruction of the remote node to before/after blocks cuts off another 10 seconds, but if we move to a different method of mocking keynodes then this additional setup time shouldn't be a problem anyway. The field I'm setting is |
Mocking the reverse proxy so that we don't even use a remote keynode brings the test time down to 10ms! The question is does doing it this way test everything we need it to? If we assume all the connection-side stuff is being tested in the network tests then maybe this is fine? It may also be possible to mock something a little further along in the call so that more of the actual functionality is tested. import type { CertificatePem, KeyPairPem } from '../src/keys/types';
import type { Host, Port } from '../src/network/types';
import type { NodeAddress } from '../src/nodes/types';
import os from 'os';
import path from 'path';
import fs from 'fs';
import Logger, { LogLevel, StreamHandler } from '@matrixai/logger';
import { DB } from '@matrixai/db';
import { KeyManager, utils as keysUtils } from '../src/keys';
import { NodeManager } from '../src/nodes';
import { ForwardProxy, ReverseProxy } from '../src/network';
import { Sigchain } from '../src/sigchain';
import { makeNodeId } from '../src/nodes/utils';
import { makeCrypto } from './utils';
import { ErrorConnectionStart } from '@/errors';
const offline = new ErrorConnectionStart();
const mockValue = jest.fn().mockRejectedValueOnce(offline).mockResolvedValue(null);
jest.mock('../src/network/ForwardProxy', () => {
return jest.fn().mockImplementationOnce(() => {
return {openConnection: mockValue}
});
});
describe('NodeManager', () => {
const password = 'password';
const logger = new Logger('NodeManagerTest', LogLevel.WARN, [
new StreamHandler(),
]);
let dataDir: string;
let nodeManager: NodeManager;
let fwdProxy: ForwardProxy;
let revProxy: ReverseProxy;
let keyManager: KeyManager;
let keyPairPem: KeyPairPem;
let certPem: CertificatePem;
let db: DB;
let sigchain: Sigchain;
const serverHost = '::1' as Host;
const serverPort = 1 as Port;
const nodeId1 = makeNodeId(
'vrsc24a1er424epq77dtoveo93meij0pc8ig4uvs9jbeld78n9nl0',
);
beforeAll(async () => {
dataDir = await fs.promises.mkdtemp(
path.join(os.tmpdir(), 'polykey-test-'),
);
const keysPath = `${dataDir}/keys`;
keyManager = await KeyManager.createKeyManager({
password,
keysPath,
logger,
});
const cert = keyManager.getRootCert();
keyPairPem = keyManager.getRootKeyPairPem();
certPem = keysUtils.certToPem(cert);
fwdProxy = new ForwardProxy({
authToken: 'abc',
connTimeoutTime: 1000,
logger: logger,
});
revProxy = new ReverseProxy({
logger: logger,
});
await revProxy.start({
serverHost,
serverPort,
tlsConfig: {
keyPrivatePem: keyPairPem.privateKey,
certChainPem: certPem,
},
});
const dbPath = `${dataDir}/db`;
db = await DB.createDB({ dbPath, logger, crypto: makeCrypto(keyManager) });
sigchain = await Sigchain.createSigchain({ keyManager, db, logger });
nodeManager = await NodeManager.createNodeManager({
db,
sigchain,
keyManager,
fwdProxy,
revProxy,
logger,
});
await nodeManager.start();
});
afterAll(async () => {
await nodeManager.stop();
await nodeManager.destroy();
await sigchain.stop();
await sigchain.destroy();
await db.stop();
await db.destroy();
await keyManager.stop();
await keyManager.destroy();
await revProxy.stop();
await fs.promises.rm(dataDir, {
force: true,
recursive: true,
});
});
test(
'pings node',
async () => {
const serverNodeId = nodeId1;
let serverNodeAddress: NodeAddress = {
ip: serverHost,
port: serverPort,
};
await nodeManager.setNode(serverNodeId, serverNodeAddress);
// Check if active
// Case 1: cannot establish new connection, so offline
const active1 = await nodeManager.pingNode(serverNodeId);
expect(active1).toBe(false);
// // Case 2: can establish new connection, so online
const active2 = await nodeManager.pingNode(serverNodeId);
expect(active2).toBe(true);
}
);
});
|
Will take a look at this after seed nodes. |
Tests involving a global agent has been done in The limiting factor is that creating a polykey agent is expensive. Mostly due to key generation process among many other factors like creating network servers... etc. If the key generation process wasn't that expensive, then we could just create new polykey agents each time we wanted to test something. Here are the most expensive things that happen when creating a polykey agent (probably in order of cost):
Key generation process might be improved with #168. For now though, other tests domains also make use of potentially multiple polykey agents. Rather than having all tests synchronise on one global agent. Each test domain can make their own decision here and create their "scoped" global PK agent. This could take place in these forms:
For 3. the trick is to ensure that the same PK agent is shared across the jest worker pool which requires that all jest workers agree on the same "directory" for the node path. Right now I've done this through There's a solution to using But this also means that all relevant test domains need to be aware of which global we are using. So the priorities are:
Each domain needs to indicate their strategy here. |
Doing the above will also potentially refactor the
As these are primarily used to deal with remote keynodes. We would want to standardise on the logic we have setup for Things like using the test provider and ensuring that works too would be relevant as per #278. |
Note that by mocking the keypair generation creating polykey agents are alot faster. So much so that in many cases the global agent is not necessary. I've just done this in |
For sharing a global agent across a test directory, refer to the new |
My tests show that with mocked global keypair starting and stopping the PK agent takes about 3 seconds.
So indeed for alot of tests it would be sufficient to create their own pk agent at least for sharing in a test module without having to share the full global agent. |
This has been achieved in a number of ways.
|
Specification
Right now running all the tests multi-core results in tests timing out.
This is simply due to sheer size of the tests. Asynchronous tests have timeouts applied to each test.
A quick solution would be remove timeouts for some of our asynchronous tests, but then we don't know if the tests are making progress.
There are subsection of tests that involve OS side-effects and these seem the source of concurrent test timeouts (probably because they are blocked on a OS-kernel/syscall and the OS is overloaded and thus cannot return in time):
tests/bootstrap/bootstrap.test.ts
tests/agent/utils.test.ts
tests/bin/agent.test.ts
These could involve filesystem side effects, process lifecycle side effects and locking side-effects.
If the OS gets overloaded, these things can slow down as we rely on external system, the OS is essentially slowed down and therefore the tests get timed out.
Right now we are forced to do
npm test -- --runInBand
which slows down testing considerably. Ideally we can use all the cores.Additional context
Tasks
beforeAll
to share resources like objects, but these can be conflicting withbeforeAll
used in other test files, so these need to be managed properlydescribe
, alsotest.concurrent
can be used within a single test file to do concurrent testing, but it should be compared with justPromise.all
toojest
.The text was updated successfully, but these errors were encountered: