Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's understand how Oxidized creates threads #457

Closed
ElvinEfendi opened this issue Jun 2, 2016 · 17 comments
Closed

Let's understand how Oxidized creates threads #457

ElvinEfendi opened this issue Jun 2, 2016 · 17 comments

Comments

@ElvinEfendi
Copy link
Contributor

I've realized that even if I have over 100 devices in the router.db Oxidized still uses a single thread and fetches the configs one after another. By looking at https://github.com/ytti/oxidized/blob/master/lib/oxidized/jobs.rb#L36 it seems Oxidized considers the interval too when calculating the number of threads to be created. Does this mean that Oxidized will never create a parallel thread unless it thinks that Oxidized.config.interval is not enough time to fetch all configs sequentially? What is the rationale behind this decision? Why not just while @jobs.size < Oxidized.config.threads at https://github.com/ytti/oxidized/blob/master/lib/oxidized/worker.rb#L16

@ytti

@ytti
Copy link
Owner

ytti commented Jun 2, 2016

If single thread is sufficient to meet interval, it will never launch more threads. If average config fetch time implies interval cannot be met, more threads are started, until interval is met.

Essentially user will decide how old can configuration backup be, configures this as interval, and lets it run.

For L36, if we have 100 nodes and average duration is 10s, it'll take 1000s to fetch them. Then we'll divide the aggregated time with desired interval, to arrive on how many threads we need to accomplish that.

@danilopopeye
Copy link
Contributor

Why not just while @jobs.size < Oxidized.config.threads at https://github.com/ytti/oxidized/blob/master/lib/oxidized/worker.rb#L16

Maybe we could introduce something like a :min_threads config, instead of just try to hit the maximum number directly. Not sure which one is better.

@ytti
Copy link
Owner

ytti commented Jun 3, 2016

What problem are we solving?

++ytti

On 03 Jun 2016, at 02:12, Danilo Sousa notifications@github.com wrote:

Why not just while @jobs.size < Oxidized.config.threads at https://github.com/ytti/oxidized/blob/master/lib/oxidized/worker.rb#L16

Maybe we could introduce something like a :min_threads config, instead of just try to hit the maximum number directly. Not sure which one is better.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@danilopopeye
Copy link
Contributor

danilopopeye commented Jun 3, 2016

What problem are we solving?

Take less time to fetch all nodes when you have loads of them?

@ytti
Copy link
Owner

ytti commented Jun 3, 2016

Elaborate? Do you have case where configured interval is not being met? If you want to fetch nodes faster, that means you want backed up config to be younger, i.e. you want interval to be smaller?

@yesbox
Copy link
Contributor

yesbox commented Jun 8, 2016

The time a user want to wait between complete backups of all devices and time to fetch all configs when doing a complete backup can be seen as two different wishes, whereas now it's assumed they are approximately the same.

As the fetching time becomes shorter the entirety of the backup becomes closer to a snapshot of the network at the time rather than a more continuous stream of configs during the interval time. This could be desirable as it may be easier to reason about a snapshot when checking configs relation to each other or restoring an entire environment, or at least multiple devices, from backup.

This could be done by forcing a minimum number of threads but perhaps what is really being asked for is two timers; the interval between complete backups as well as the time to fetch, which is smaller this the interval, to aim for. That could be used to calculate the number of threads, much like today. If time to fetch is not set, make it equal to the interval and the behavior is the same as today. This doesn't prevent there from being a minimum threads settings though.

@ytti
Copy link
Owner

ytti commented Jun 8, 2016

If I understood correctly, we can't change time-to-fetch, we're only using single thread to talk to single device, we do it as fast as we can, like 99% of time is obviously I/O wait. So if some improvements could be made to time-to-fetch, it would fall within 1% improvement.

What we can do, is try to guarantee config is no older than N time, which is what we do.

Sometimes N might be temporarily too long, then you can do /next to move device to head-of-queue, to force instant fetch.

@yesbox
Copy link
Contributor

yesbox commented Jun 8, 2016

Sorry, I think I wasn't totally clear. The suggestion (didn't actually mean to write a feature request but here we are, I think I would use this but I'm not sure it's terribly important :)) is not to make the time to fetch faster by optimizing the code but to have another configurable timer which I refereed to as "time to fetch". Like the interval this timer describes what the user desires and the thread algorithm adjusts the number of threads to make it finish on time, so really what I mean by this is "desired time to fetch from all devices in total during one iteration/interval".

The current algorithm ought to be able to handle that without big changes (said without looking at the code...). Basically you'd allocate threads like you already do today but using the "desired time to fetch everything" instead of the "interval" to find how many threads are needed to finish within that time, but then you'd still wait until the next "interval" to begin another fetch.

That way you could ask Oxidized not only to "get the configs from all devices every X minutes and I want you to finish within that same interval, please adjust the number of concurrent fetches to make it so", you could also say, if you want to; "get the configs from all devices every X minutes but I want you to finish getting them in Y minutes (which is lower than X), please adjust the number of concurrent fetches to make it so".

Again, doesn't exclude the possibility of configuring a min/max amount of threads but I think that would cover the perceived need to do so in perhaps a better way than plain overriding the thread algorithm.

@ytti
Copy link
Owner

ytti commented Jun 8, 2016

Apologies I'm being thick. I cannot understand your new use case get the configs from all devices every X minutes but I want you to finish getting them in Y minutes (which is lower than X) how is this different than setting current interval to Y?

Maybe some other example, or perhaps really concrete with nodes and exact time for each being fetched, in both scenarios would help me wrap my head around the request.

@danilopopeye
Copy link
Contributor

I'll try to use my case as an example of why I said about a min_threads config.

We have (almost) finished configuring all the ~2.6k elements that we need to backup. For connectivity issues we need 3 machines to cover all nodes. The first will have ~900, second ~600 and the last ~1100 elements each.

We set the interval to 12 hours (43200 seconds), since we don't touch most of the devices during the day. Then restarted each Oxidized around midnight. Since our interval is really long, only 1 thread is used until we hit a firewall that takes longer than 300 seconds, then a second thread is started.

My problem here is: It would be ideal to finish all fetches before 6 am, but I don't have this kind of control today hence why I suggested the use of a min_threads config. Nothing as fancy as what @yesbox suggested, but could easily achieve a similar result.

@ytti
Copy link
Owner

ytti commented Jun 8, 2016

If you want to finish by 6am, and you start at midnight, Shouldn't you set the interval to 6h?

Or is it one off? You want to get boxes now by X, but subsequently within 12h?

@yesbox
Copy link
Contributor

yesbox commented Jun 8, 2016

Let's try an extreme but perhaps not unreasonable example. Say you want to get all configs in a deployment of 100 devices all backed up in a one minute span, because you don't want the configs on the devices to diverge more than one minute in any one iteration. You'd like to do this every 2 hours.

You can make sure to get all configs in one minute by setting the interval to one minute and that will attempt to get all configs in one minute but it will also do so every minute - you just wanted to backup the configs every 2 hours, now you're hitting the devices much more often that you wished since the time between getting the first and last config in your list of devices is tied to how often you do so.

In that case you instead set interval to 2 hours and fetch_time/fetch_spread/time_to_fetch_all_configs to 1 minute. That will effectively set what is today interval to one minute and once done, sleep for 1 hour and 59 minutes before starting the next internval, assuming it actually did finish in one minute.

Time axis ->
| new interval begins
- currently fetching using one thread
= currently fetching using many threads
^ fetch_time amount of time passed since new interval started
Always the same amount of devices.

One type of user wants this behavior:

Short interval set to be up to date with changes quickly.
Needs many threads to keep up, busy all the time. Works as desired.
|=|=|=|=|=|=|=|=|=

Longer interval set to regularly get all devices backed up.
The devices may be quite independent or you don't make many changes so we don't
care how much (measured in time) the configs may have diverged.
Finishes in time for the next interval so one thread is used. Works as desired.
|---------      |---------      |---------      


Another type of user wants this:

Short interval set to get all configs without much time passed between the first and last.
We want something closer to a snapshot or for some other reason want it to finish quickly.
Does what we asked but does it all the time. Didn't do what we wanted.
|=|=|=|=|=|=|=|=|=

Longer interval set to regularly get all devices backed up, not to hit them all the time.
Does what we asked but now with fewer or a single thread, so it didn't finish quick like it
did with short interval. Didn't do what we wanted.
|---------      |---------      |---------      

Suggestion: Longer interval (timer to start fetching from first device) set to regularly get all devices backed up and
a short fetch_time (timer used to adjust the amount of threads to finish fetching the last device config in this amount of
time since the first device config started being fetched) to
finshing getting them quickly so that they do not diverge (or whatever your reason might be).
This is what this type of user wanted, a combination of the two by decoupling the time between beginning a new interval and
the time to finishing getting the last device config.
|=^             |=^             |=^             

@danilopopeye
Copy link
Contributor

Or is it one off? You want to get boxes now by X, but subsequently within 12h?

We can only run 2 times a day for now, but shouldn't pass after 6 am.
(Probably we will change this to run only once every 24h at midnight)

@ytti
Copy link
Owner

ytti commented Jun 8, 2016

So essentially what is wanted is bursty behaviour. My initial thought was that device could be provisioned with predictable CPU time requirements, so that CPU use is constant over time.

But from @yesbox example I hear that crucial is that all configs are relatively near same time, but need not be collected very often.
Thank you, now I understand the desire.

Could we satisfy both requirements by doing absolutely no periodic fetch at all. Instead have API call to run one rotation at max_threads? I guess it could be config options too.

@ElvinEfendi
Copy link
Contributor Author

The purpose of the issue was to understand how Oxidized creates threads and motivation behind it, and I think we achieved it. So closing this.

@athompson-merlin
Copy link

I also have the need to collect as much of a "snapshot" as possible, so I would prefer that - at the interval time - Oxidized should spin up as many threads as possible in order to accomplish data collection as rapidly as possible.
(Oxidized is not resource-limited in this environment - CPU/RAM/IO usage is not a concern at all.)

I also would like this feature in order to troubleshoot my production instance more easily - when something doesn't work right and I have to restart Oxidized (e.g. editing a custom model), it can take almost an hour before Oxidized finishes its initial single-threaded poll of all the devices and reaches steady-state, at which point I can begin troubleshooting usefully.

Did anything ever get added to Oxidized to force multi-threaded operation? Ideal (for me) would be a new config option like use_max_available_threads: [yes|no] that could be toggled and reloaded at runtime, but I'm not seeing anything like that.

@davama
Copy link
Contributor

davama commented Apr 27, 2022

@athompson-merlin

I would open new issue and reference this.

Would advise against necrobumping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants