You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.
This is applicable when: BufferInstanceWrites == true.
I recently added some counters to monitor the number of time InstancePollSeconds gets exceeded during discovery. The number seen should normally be quite low but I've seen that on a busy orchestrator server, especially when talking to a orchestrator backend in a different datacentre that the number of times this happens can jump significantly.
Consequently better management and monitoring of this is needed.
Thoughts involve:
ensuring that the configuration parameters used are dynamically configurable via SIGHUP calls and thus do not require orchestrator to be restarted. This affects the 2 variables: InstanceFlushIntervalMilliseconds and InstanceWriteBufferSize.
adding extra monitoring of the time taken for flushInstanceWriteBuffer to run. A single metric every minute is useless so I need to collect metrics and then be able to provide aggregate data and percentile timings in a similar way to how the discovery timings are handled.
parallelising this function to run against the backend orchestrator server a number of times. (completely serialising this even though the writes are batched is not fully efficient but we should ensure that writes for the same instance are never done through different connections at the same time)
With these changes it should be easier to see where the bottleneck is and to be able to adjust the configuration "dynamically" to ensure the required performance is achieved.
The text was updated successfully, but these errors were encountered:
The two graphs above show the issue seen, together with a normal situation. Changing the orchestrator configuration to talk to a local orchestrator backend resolves the problem but any orchestrator server in the cluster should be able to write properly to the backend.
The solution is not yet fully clear but dynamic adjustment of the parameters will make it much easier to monitor the effect of the changes without restarting the active node to see which settings are better.
This is applicable when:
BufferInstanceWrites == true
.I recently added some counters to monitor the number of time
InstancePollSeconds
gets exceeded during discovery. The number seen should normally be quite low but I've seen that on a busy orchestrator server, especially when talking to a orchestrator backend in a different datacentre that the number of times this happens can jump significantly.Consequently better management and monitoring of this is needed.
Thoughts involve:
InstanceFlushIntervalMilliseconds
andInstanceWriteBufferSize
.flushInstanceWriteBuffer
to run. A single metric every minute is useless so I need to collect metrics and then be able to provide aggregate data and percentile timings in a similar way to how the discovery timings are handled.With these changes it should be easier to see where the bottleneck is and to be able to adjust the configuration "dynamically" to ensure the required performance is achieved.
The text was updated successfully, but these errors were encountered: