Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Instrument Runner Service #602

Merged
merged 18 commits into from
Mar 15, 2024
Merged

feat: Instrument Runner Service #602

merged 18 commits into from
Mar 15, 2024

Conversation

darunrs
Copy link
Collaborator

@darunrs darunrs commented Mar 13, 2024

Runner is lacking instrumentation. It is responsible for many things and it's become hard to understand what tasks contribute to the overall latency of an indexer. In addition, we are now at a point where we need to drive down latencies to facilitate new * indexer use cases such as access keys.

I've chosen to instrument Runner with OpenTelemetry. Tracing generally requires 3 items: An instrumented service, a trace collector, and a trace visualizer. The service is responsible for collecting and transmitting trace data to the collector. The collector should be able to receive trace data with little fuss to prevent performance impacts to the instrumented service. The collector then processes the trace data and transmits the processed data to the visualizer. The visualizer visualizes trace data and allows for filtering on traces.

The benefit of OpenTelemetry over other options like Zipkin and Jaeger is that GCP already supports ingesting OpenTelemetry data. As such, we don't need to provision a collector ourselves, and can instead leverage GCP's existing collector & visualizer Tracing service. For local development, traces can be output to console, a Zipkin all-in-one container or to GCP (Requires Cloud Trace Agent role and specifying project ID). This is done by simply initializing the NodeSDK differently.

In addition, we do not want to enable traces in prod yet, so by not specifying any exporter. This creates a No-Op Trace Exporter which won't attempt to record traces.

No code changes were made changing code execution path. All tests pass with no changes, aside from having to replace snapshots due to changes in tabbing of mutation strings. I have manually verified mutation strings are still the same by stripping whitespace and checking against original.

@darunrs darunrs marked this pull request as ready for review March 14, 2024 17:47
@darunrs darunrs requested a review from a team as a code owner March 14, 2024 17:47
@darunrs darunrs linked an issue Mar 14, 2024 that may be closed by this pull request
try {
this.database_connection_parameters = this.database_connection_parameters ??
await this.deps.provisioner.getDatabaseConnectionParameters(hasuraRoleName);
} catch (e) {
const error = e as Error;
simultaneousPromises.push(this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Failed to get database connection parameters', error.message));
throw error;
} finally {
credentialsFetchSpan.end();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must end the spans ourselves. Spans that aren't explicitly ended are ended upon garbage collection. In the cases where we might expect a failure in an uncommon case, I used finally to ensure these spans end. In some cases, it's ok letting it be garbage collected since the trace itself isn't valuable.

await this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Error running IndexerFunction', error.message);
throw e;
}
await this.tracer.startActiveSpan('run indexer code', async (runIndexerCodeSpan: Span) => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use setActiveSpan when I anticipate more traces to be made inside the code block that I want to establish a child status for. In this case, any context.db or hasura calls made in the vm should be children of the run indexer code span.

traceExporter: new TraceExporter(),
spanProcessors: [new BatchSpanProcessor(new TraceExporter(
{
projectId: process.env.GCP_PROJECT_ID ?? ''
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty project ID will send the traces nowhere. I figured this was the safer default to use.

import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';

export default function setUpTracerExport (): void {
switch (process.env.TRACING_EXPORTER) {
Copy link
Collaborator Author

@darunrs darunrs Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change here is what exporter we set below. The rest of the code is the same. So, swapping out the trace endpoint is actually quite cheap.

const postRunSpan = tracer.startSpan('Delete redis message and shift queue', {}, context.active());
parentPort?.postMessage({ type: WorkerMessageType.STATUS, data: { status: Status.RUNNING } });
await workerContext.redisClient.deleteStreamMessage(streamKey, streamMessageId);
workerContext.queue.shift();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this errors will it escape this function's try/catch since it's not awaited?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the queue shift? If so, yep I spotted that and put an await on it. startSpan is sync. It doesn't need an await. I'll be pushing the change up. I'm actually reverting to not modify the QueueMessage. Fewer changes to worker and putting the block height in the root span doesn't help much since we sample only 10% anyway. The block height is listed in a child span later.

@darunrs darunrs merged commit 10914d7 into main Mar 15, 2024
5 checks passed
@darunrs darunrs deleted the add-tracing-to-runner branch March 15, 2024 20:56
@darunrs darunrs mentioned this pull request Mar 28, 2024
darunrs added a commit that referenced this pull request Apr 3, 2024
- feat: Instrument Runner Service (#602)
- Support WHERE col IN (...) in context.db.table.select and delete
(#606)
- feat: Include indexer name in context db build failure warning (#611)
- Cache provisioning status (#607)
- Fix ESLint on DmlHandler (#612)
- fix: Substitution 'node-sql-parser' with a forked version until Apri
1st (#597)
- feat: Add pgBouncer to QueryApi (#615)
- feat: Expose near-lake-primitives to VM (#613)

---------

Co-authored-by: Pavel Kudinov <mrkroz@gmail.com>
Co-authored-by: Pavel Kudinov <pavel@near.org>
Co-authored-by: Kevin Zhang <42101107+Kevin101Zhang@users.noreply.github.com>
Co-authored-by: Morgan McCauley <morgan@mccauley.co.nz>
This was referenced Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Instrumentation to Runner
2 participants