Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Better handling of buffered writes needed #113

Open
sjmudd opened this issue Mar 24, 2017 · 2 comments
Open

Better handling of buffered writes needed #113

sjmudd opened this issue Mar 24, 2017 · 2 comments

Comments

@sjmudd
Copy link
Collaborator

sjmudd commented Mar 24, 2017

This is applicable when: BufferInstanceWrites == true.

I recently added some counters to monitor the number of time InstancePollSeconds gets exceeded during discovery. The number seen should normally be quite low but I've seen that on a busy orchestrator server, especially when talking to a orchestrator backend in a different datacentre that the number of times this happens can jump significantly.

Consequently better management and monitoring of this is needed.

Thoughts involve:

  • ensuring that the configuration parameters used are dynamically configurable via SIGHUP calls and thus do not require orchestrator to be restarted. This affects the 2 variables: InstanceFlushIntervalMilliseconds and InstanceWriteBufferSize.
  • adding extra monitoring of the time taken for flushInstanceWriteBuffer to run. A single metric every minute is useless so I need to collect metrics and then be able to provide aggregate data and percentile timings in a similar way to how the discovery timings are handled.
  • parallelising this function to run against the backend orchestrator server a number of times. (completely serialising this even though the writes are batched is not fully efficient but we should ensure that writes for the same instance are never done through different connections at the same time)

With these changes it should be easier to see where the bottleneck is and to be able to adjust the configuration "dynamically" to ensure the required performance is achieved.

@sjmudd
Copy link
Collaborator Author

sjmudd commented Mar 24, 2017

Below are some graphs from a test cluster.
screen shot 2017-03-24 at 08 56 07

screen shot 2017-03-24 at 08 57 37

The two graphs above show the issue seen, together with a normal situation. Changing the orchestrator configuration to talk to a local orchestrator backend resolves the problem but any orchestrator server in the cluster should be able to write properly to the backend.

@sjmudd
Copy link
Collaborator Author

sjmudd commented Mar 24, 2017

The solution is not yet fully clear but dynamic adjustment of the parameters will make it much easier to monitor the effect of the changes without restarting the active node to see which settings are better.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant