Parallel processing with Jetstream #342

mwdomino · 2024-01-08T21:41:13Z

Hey team, apologies for the poor title on this issue but I wanted to start a conversation around some performance bottlenecks we've run into and how we could potentially work around them.

Background

We are using gnmic as a datasource for a network telemetry API to be used for queries about the state of our internal network. The current method is that we have a simple message (like Arista EOS version, for example) sent to a Jetstream subject telemetry.inventory. Specific state messages (BGP status, interface status, etc) are sent to telemetry.gnmic.<device_name>.<subscription>. When a message is received on telemetry.inventory we spawn a worker goroutine that subscribes to the telemetry.gnmic.<device_name>.> for the given device name.

The reason we follow a worker thread per device is that it is important to us to ensure that the ordering of messages is maintained. We don't want two messages in short succession to be received by gnmic in order, and then received in an incorrect order by our application. For example: a flapping interface sends multiple up/down events and in this case it is most important that we store the last status for querying.

This solution allows us to maintain ordering per-device, while also processing the total queue in parallel.

Current Issue

While working through some bottlenecks we've found what we believe to be causing some issues within gnmic itself. From our investigation we see that each Jetstream output has an unbuffered msgChan which will block until the message has been processed and successfully written to the Jetstream.

Single unbuffered msgChan - https://github.com/openconfig/gnmic/blob/main/pkg/outputs/nats_outputs/jetstream/jetstream_output.go#L139
Blocking Behavior - https://github.com/openconfig/gnmic/blob/main/pkg/outputs/nats_outputs/jetstream/jetstream_output.go#L241

We had been seeing some missing messages which we had attributed to NATS itself dropping the messages, but after investigation we believe that gnmic was timing out, blocking the channel, and the messages were being lost. By increasing our write-timeout to 30s, all messages are delivered with no issues.

Question

Do you have any recommendations for how we could utilize gnmic to process messages in parallel while also maintaining ordering?

One idea we had was to have an output per-target or per-subscription. Each output would have the same configuration and be created solely to increase the number of msgChan created. This would maintain our ordering requirements within the context of a target or subscription.

We'd looked at increasing the num-workers as we currently only use 1, but are concerned that while it would guarantee that gnmic processes the messages in order, it's possible that network writes to the Jetstream could arrive out of order.

Have you seen other users build a similar pattern before? Or do you have an other ideas we could use?

The text was updated successfully, but these errors were encountered:

karimra · 2024-01-09T01:45:45Z

We are using gnmic as a datasource for a network telemetry API to be used for queries about the state of our internal network. The current method is that we have a simple message (like Arista EOS version, for example) sent to a Jetstream subject telemetry.inventory. Specific state messages (BGP status, interface status, etc) are sent to telemetry.gnmic.<device_name>.. When a message is received on telemetry.inventory we spawn a worker goroutine that subscribes to the telemetry.gnmic.<device_name>.> for the given device name.

Great use of the jetstream output!

I'm not sure I see where/when out of order per target with your setup. If you have a goroutine per target subscribed to the subject telemetry.gnmic.<device_name>.> the messages should be in order. Or am I missing something ?

mwdomino · 2024-01-09T14:26:36Z

Ah, I should have clarified what exactly I was asking for feedback on :)

We definitely do not have any ordering issues right now and it's all working great. What we are worried about is that as we increase the number of targets and subscriptions we may have to continue to increase the timeouts to keep from blocking messages.

We had discussed maybe using an output for each device or subscription, with each output just having an identical configuration and dropping the events in telemetry.gnmic.<device> as it currently does. Our hope is that this would allow us to maintain multiple connections from gnmic to Jetstream so that we could have multiple writes coming from gnmic at the same time. Basically our application manages a goroutine per device so we were thinking it may make sense for gnmic to also allow parallel writes and ensure ordering for each device (or subscription).

Also happy to hear any other solution you could think of for parallelizing that kind of work.

Due to our design and the data we are collecting we aren't worried about maintaining order for every message. We're happy to only ensure ordering that all messages from a subscription or device are received in order.

karimra · 2024-01-09T19:20:18Z

What we are worried about is that as we increase the number of targets and subscriptions we may have to continue to increase the timeouts to keep from blocking messages.

We can adda buffer size setting (would make the msgChan buffered) if that works better that timeouts. But the timeouts will stay as a protection mechanism.

Btw, you can set a buffer size per target as well (default to 100):

targets:
  target1:
     ###
     buffer-size: 100

mwdomino · 2024-01-10T15:26:57Z

Thanks a ton for the input! We aren't having any issues at the moment but I'll keep the buffers in mind as a potential knob we can tune as our load increases some more.

* follow-up to Parallel processing with Jetstream openconfig#342 * added option to provide a buffer size to go channel on output * updaed user guides for Jetstream and NATS outputs

mwdomino closed this as completed Jan 10, 2024

wendall-robinson mentioned this issue Nov 5, 2024

Added BufferSize to Jetsream and NATS outputs #549

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing with Jetstream #342

Parallel processing with Jetstream #342

mwdomino commented Jan 8, 2024 •

edited

Loading

karimra commented Jan 9, 2024

mwdomino commented Jan 9, 2024

karimra commented Jan 9, 2024

mwdomino commented Jan 10, 2024

Parallel processing with Jetstream #342

Parallel processing with Jetstream #342

Comments

mwdomino commented Jan 8, 2024 • edited Loading

Background

Current Issue

Question

karimra commented Jan 9, 2024

mwdomino commented Jan 9, 2024

karimra commented Jan 9, 2024

mwdomino commented Jan 10, 2024

mwdomino commented Jan 8, 2024 •

edited

Loading