feat: Instrument Runner Service #602

darunrs · 2024-03-13T21:06:20Z

Runner is lacking instrumentation. It is responsible for many things and it's become hard to understand what tasks contribute to the overall latency of an indexer. In addition, we are now at a point where we need to drive down latencies to facilitate new * indexer use cases such as access keys.

I've chosen to instrument Runner with OpenTelemetry. Tracing generally requires 3 items: An instrumented service, a trace collector, and a trace visualizer. The service is responsible for collecting and transmitting trace data to the collector. The collector should be able to receive trace data with little fuss to prevent performance impacts to the instrumented service. The collector then processes the trace data and transmits the processed data to the visualizer. The visualizer visualizes trace data and allows for filtering on traces.

The benefit of OpenTelemetry over other options like Zipkin and Jaeger is that GCP already supports ingesting OpenTelemetry data. As such, we don't need to provision a collector ourselves, and can instead leverage GCP's existing collector & visualizer Tracing service. For local development, traces can be output to console, a Zipkin all-in-one container or to GCP (Requires Cloud Trace Agent role and specifying project ID). This is done by simply initializing the NodeSDK differently.

In addition, we do not want to enable traces in prod yet, so by not specifying any exporter. This creates a No-Op Trace Exporter which won't attempt to record traces.

No code changes were made changing code execution path. All tests pass with no changes, aside from having to replace snapshots due to changes in tabbing of mutation strings. I have manually verified mutation strings are still the same by stripping whitespace and checking against original.

darunrs · 2024-03-14T18:46:04Z

runner/src/indexer/indexer.ts

        try {
          this.database_connection_parameters = this.database_connection_parameters ??
            await this.deps.provisioner.getDatabaseConnectionParameters(hasuraRoleName);
        } catch (e) {
          const error = e as Error;
          simultaneousPromises.push(this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Failed to get database connection parameters', error.message));
          throw error;
+        } finally {
+          credentialsFetchSpan.end();


We must end the spans ourselves. Spans that aren't explicitly ended are ended upon garbage collection. In the cases where we might expect a failure in an uncommon case, I used finally to ensure these spans end. In some cases, it's ok letting it be garbage collected since the trace itself isn't valuable.

darunrs · 2024-03-14T18:47:20Z

runner/src/indexer/indexer.ts

-          await this.writeLog(LogLevel.ERROR, functionName, blockHeight, 'Error running IndexerFunction', error.message);
-          throw e;
-        }
+        await this.tracer.startActiveSpan('run indexer code', async (runIndexerCodeSpan: Span) => {


I use setActiveSpan when I anticipate more traces to be made inside the code block that I want to establish a child status for. In this case, any context.db or hasura calls made in the vm should be children of the run indexer code span.

darunrs · 2024-03-14T18:51:52Z

runner/src/instrumentation/tracer.ts

+    traceExporter: new TraceExporter(),
+    spanProcessors: [new BatchSpanProcessor(new TraceExporter(
+      {
+        projectId: process.env.GCP_PROJECT_ID ?? ''


An empty project ID will send the traces nowhere. I figured this was the safer default to use.

darunrs · 2024-03-14T18:52:35Z

runner/src/instrumentation/tracer.ts

+import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';
+
+export default function setUpTracerExport (): void {
+  switch (process.env.TRACING_EXPORTER) {


The only change here is what exporter we set below. The rest of the code is the same. So, swapping out the trace endpoint is actually quite cheap.

gabehamilton · 2024-03-14T18:59:22Z

runner/src/stream-handler/worker.ts

+        const postRunSpan = tracer.startSpan('Delete redis message and shift queue', {}, context.active());
+        parentPort?.postMessage({ type: WorkerMessageType.STATUS, data: { status: Status.RUNNING } });
+        await workerContext.redisClient.deleteStreamMessage(streamKey, streamMessageId);
+        workerContext.queue.shift();


If this errors will it escape this function's try/catch since it's not awaited?

Are you referring to the queue shift? If so, yep I spotted that and put an await on it. startSpan is sync. It doesn't need an await. I'll be pushing the change up. I'm actually reverting to not modify the QueueMessage. Fewer changes to worker and putting the block height in the root span doesn't help much since we sample only 10% anyway. The block height is listed in a child span later.

- feat: Instrument Runner Service (#602) - Support WHERE col IN (...) in context.db.table.select and delete (#606) - feat: Include indexer name in context db build failure warning (#611) - Cache provisioning status (#607) - Fix ESLint on DmlHandler (#612) - fix: Substitution 'node-sql-parser' with a forked version until Apri 1st (#597) - feat: Add pgBouncer to QueryApi (#615) - feat: Expose near-lake-primitives to VM (#613) --------- Co-authored-by: Pavel Kudinov <mrkroz@gmail.com> Co-authored-by: Pavel Kudinov <pavel@near.org> Co-authored-by: Kevin Zhang <42101107+Kevin101Zhang@users.noreply.github.com> Co-authored-by: Morgan McCauley <morgan@mccauley.co.nz>

darunrs added 14 commits March 11, 2024 18:00

feat: Add OpenTelemetry to Runner and add simple init

f3c6576

feat: Instrument worker with zipkin collector

cdf44db

feat: Add Zipkin to docker compose

9af2e4b

feat: Add try finally around block wait

4422517

feat: Refactor tracer file

be9df1f

feat: Trace code in runFunctions with correct parent span

a5d28ab

feat: Instrument runFunctions

1ba474b

feat: All tests pass

41cd63b

feat: Rework span names

4fc1a75

feat: Trace failing indexer correctly

09b0f9e

feat: Implement GCP Exporter and Sampling

a6355f6

feat: Working GCP Trace Export

af3db61

Set up zipkin endpoint for docker

940c726

Remove testing changes

4b6759f

darunrs marked this pull request as ready for review March 14, 2024 17:47

darunrs requested a review from a team as a code owner March 14, 2024 17:47

darunrs linked an issue Mar 14, 2024 that may be closed by this pull request

Add Instrumentation to Runner #530

Closed

Replace some ActiveSpans with Spans and reword some spans

1fb9324

darunrs commented Mar 14, 2024

View reviewed changes

gabehamilton reviewed Mar 14, 2024

View reviewed changes

gabehamilton approved these changes Mar 14, 2024

View reviewed changes

darunrs added 3 commits March 14, 2024 14:42

Revert changes to QueueMessage and await queue shift

f343ee6

Rename blockPromise

2734d8c

Add account to trace attributes

259f2c8

darunrs merged commit 10914d7 into main Mar 15, 2024

darunrs deleted the add-tracing-to-runner branch March 15, 2024 20:56

darunrs mentioned this pull request Mar 28, 2024

Prod Release 02/04/2024 #623

Merged

This was referenced Apr 22, 2024

Prod Release 23/04/24 #686

Merged

test stable branch git fix up #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Instrument Runner Service #602

feat: Instrument Runner Service #602

Uh oh!

darunrs commented Mar 13, 2024 •

edited

Loading

Uh oh!

darunrs Mar 14, 2024

Uh oh!

darunrs Mar 14, 2024

Uh oh!

darunrs Mar 14, 2024

Uh oh!

darunrs Mar 14, 2024 •

edited

Loading

Uh oh!

gabehamilton Mar 14, 2024

Uh oh!

darunrs Mar 14, 2024

Uh oh!

Uh oh!

feat: Instrument Runner Service #602

feat: Instrument Runner Service #602

Uh oh!

Conversation

darunrs commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darunrs Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

darunrs Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

darunrs Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

darunrs Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabehamilton Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

darunrs Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

darunrs commented Mar 13, 2024 •

edited

Loading

darunrs Mar 14, 2024 •

edited

Loading