-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's understand how Oxidized creates threads #457
Comments
If single thread is sufficient to meet interval, it will never launch more threads. If average config fetch time implies interval cannot be met, more threads are started, until interval is met. Essentially user will decide how old can configuration backup be, configures this as interval, and lets it run. For L36, if we have 100 nodes and average duration is 10s, it'll take 1000s to fetch them. Then we'll divide the aggregated time with desired interval, to arrive on how many threads we need to accomplish that. |
Maybe we could introduce something like a |
What problem are we solving? ++ytti
|
Take less time to fetch all nodes when you have loads of them? |
Elaborate? Do you have case where configured interval is not being met? If you want to fetch nodes faster, that means you want backed up config to be younger, i.e. you want interval to be smaller? |
The time a user want to wait between complete backups of all devices and time to fetch all configs when doing a complete backup can be seen as two different wishes, whereas now it's assumed they are approximately the same. As the fetching time becomes shorter the entirety of the backup becomes closer to a snapshot of the network at the time rather than a more continuous stream of configs during the interval time. This could be desirable as it may be easier to reason about a snapshot when checking configs relation to each other or restoring an entire environment, or at least multiple devices, from backup. This could be done by forcing a minimum number of threads but perhaps what is really being asked for is two timers; the interval between complete backups as well as the time to fetch, which is smaller this the interval, to aim for. That could be used to calculate the number of threads, much like today. If time to fetch is not set, make it equal to the interval and the behavior is the same as today. This doesn't prevent there from being a minimum threads settings though. |
If I understood correctly, we can't change time-to-fetch, we're only using single thread to talk to single device, we do it as fast as we can, like 99% of time is obviously I/O wait. So if some improvements could be made to time-to-fetch, it would fall within 1% improvement. What we can do, is try to guarantee config is no older than N time, which is what we do. Sometimes N might be temporarily too long, then you can do /next to move device to head-of-queue, to force instant fetch. |
Sorry, I think I wasn't totally clear. The suggestion (didn't actually mean to write a feature request but here we are, I think I would use this but I'm not sure it's terribly important :)) is not to make the time to fetch faster by optimizing the code but to have another configurable timer which I refereed to as "time to fetch". Like the interval this timer describes what the user desires and the thread algorithm adjusts the number of threads to make it finish on time, so really what I mean by this is "desired time to fetch from all devices in total during one iteration/interval". The current algorithm ought to be able to handle that without big changes (said without looking at the code...). Basically you'd allocate threads like you already do today but using the "desired time to fetch everything" instead of the "interval" to find how many threads are needed to finish within that time, but then you'd still wait until the next "interval" to begin another fetch. That way you could ask Oxidized not only to "get the configs from all devices every X minutes and I want you to finish within that same interval, please adjust the number of concurrent fetches to make it so", you could also say, if you want to; "get the configs from all devices every X minutes but I want you to finish getting them in Y minutes (which is lower than X), please adjust the number of concurrent fetches to make it so". Again, doesn't exclude the possibility of configuring a min/max amount of threads but I think that would cover the perceived need to do so in perhaps a better way than plain overriding the thread algorithm. |
Apologies I'm being thick. I cannot understand your new use case Maybe some other example, or perhaps really concrete with nodes and exact time for each being fetched, in both scenarios would help me wrap my head around the request. |
I'll try to use my case as an example of why I said about a We have (almost) finished configuring all the ~2.6k elements that we need to backup. For connectivity issues we need 3 machines to cover all nodes. The first will have ~900, second ~600 and the last ~1100 elements each. We set the interval to 12 hours (43200 seconds), since we don't touch most of the devices during the day. Then restarted each Oxidized around midnight. Since our interval is really long, only 1 thread is used until we hit a firewall that takes longer than 300 seconds, then a second thread is started. My problem here is: It would be ideal to finish all fetches before 6 am, but I don't have this kind of control today hence why I suggested the use of a |
If you want to finish by 6am, and you start at midnight, Shouldn't you set the interval to 6h? Or is it one off? You want to get boxes now by X, but subsequently within 12h? |
Let's try an extreme but perhaps not unreasonable example. Say you want to get all configs in a deployment of 100 devices all backed up in a one minute span, because you don't want the configs on the devices to diverge more than one minute in any one iteration. You'd like to do this every 2 hours. You can make sure to get all configs in one minute by setting the interval to one minute and that will attempt to get all configs in one minute but it will also do so every minute - you just wanted to backup the configs every 2 hours, now you're hitting the devices much more often that you wished since the time between getting the first and last config in your list of devices is tied to how often you do so. In that case you instead set interval to 2 hours and fetch_time/fetch_spread/time_to_fetch_all_configs to 1 minute. That will effectively set what is today interval to one minute and once done, sleep for 1 hour and 59 minutes before starting the next internval, assuming it actually did finish in one minute.
|
We can only run 2 times a day for now, but shouldn't pass after 6 am. |
So essentially what is wanted is bursty behaviour. My initial thought was that device could be provisioned with predictable CPU time requirements, so that CPU use is constant over time. But from @yesbox example I hear that crucial is that all configs are relatively near same time, but need not be collected very often. Could we satisfy both requirements by doing absolutely no periodic fetch at all. Instead have API call to run one rotation at max_threads? I guess it could be config options too. |
The purpose of the issue was to understand how Oxidized creates threads and motivation behind it, and I think we achieved it. So closing this. |
I also have the need to collect as much of a "snapshot" as possible, so I would prefer that - at the I also would like this feature in order to troubleshoot my production instance more easily - when something doesn't work right and I have to restart Oxidized (e.g. editing a custom model), it can take almost an hour before Oxidized finishes its initial single-threaded poll of all the devices and reaches steady-state, at which point I can begin troubleshooting usefully. Did anything ever get added to Oxidized to force multi-threaded operation? Ideal (for me) would be a new config option like |
I would open new issue and reference this. Would advise against necrobumping |
I've realized that even if I have over 100 devices in the router.db Oxidized still uses a single thread and fetches the configs one after another. By looking at https://github.com/ytti/oxidized/blob/master/lib/oxidized/jobs.rb#L36 it seems Oxidized considers the interval too when calculating the number of threads to be created. Does this mean that Oxidized will never create a parallel thread unless it thinks that
Oxidized.config.interval
is not enough time to fetch all configs sequentially? What is the rationale behind this decision? Why not justwhile @jobs.size < Oxidized.config.threads
at https://github.com/ytti/oxidized/blob/master/lib/oxidized/worker.rb#L16@ytti
The text was updated successfully, but these errors were encountered: