Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate Iptables Rules with Dynamic ASG's #101

Merged
merged 2 commits into from
Jan 2, 2024

Conversation

klapkov
Copy link
Contributor

@klapkov klapkov commented Nov 14, 2023

No description provided.

@@ -59,6 +59,22 @@ func (c *RuleConverter) BulkConvert(ruleSpec []Rule, logChainName string, global
return iptablesRules
}

func (c *RuleConverter) DeduplicateRules(iptablesRules []rules.IPTablesRule) []rules.IPTablesRule {
keys := make(map[string]bool)
deduped_rules := []rules.IPTablesRule{}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we creating a copy of the rules set here? What would be the memory impact, we have seen cases with millions of rules?

Copy link
Contributor Author

@klapkov klapkov Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a map, in which we use the rules as keys and the value is true or false. If on the current iteration of the rules list we have a rule which is not part of the keys map, we add it to keys and set the value to true. This way if a duplicate of that rule is present, we will skip adding it to the dedupedRules, since the value of that rule is set to true. As to the memory concern, this way of removing duplicates was the one I saw used the most in go. Honestly I do not know how will this impact performance. Open to suggestions here

@ameowlia ameowlia self-requested a review December 8, 2023 17:31
@ameowlia
Copy link
Member

ameowlia commented Dec 8, 2023

In general, I am okay with this fix, but I am worried about performance implications.

❓ At the WG meeting someone mentioned that SAP has ~1 million iptables rules per cell. How long does this calculation take for all of those rules?

❓ Did you investigate a way to de-dupe the policies closer to the "root" in the internal-policy-server? Idk if that would be any more performant, but it would be closer to the root issue.

@vlast3k
Copy link
Contributor

vlast3k commented Dec 8, 2023

@ameowlia , currently we have implemented a python script and a custom bosh-release, which replaces iptables-restorer . So what happens is that the vxlan-policy-agent generates the list with rules for one or several LRPs. It is then fed through the python script, deduplicated, and then send to the original iptables-restorer. The performance impact of the python script is negligible - e.g. less than a second. One of the reasons is perhaps, that in fact the deduplication is not done for 1 mil entries, but for few 10k. This is how much we get per LRP. and since the rules are applied per LRP, they never get too much in the iptables db. We have enabled this on our worst (most garbage-polluted) landscape since few months an it is working quite stable

We also checked if we can do it closer to the problem. The reason we did not do it there is because it works this way for Static ASGs. But there - the binary that is used to apply them already does deduplication. This is why - in order to make the smallest change and not change other logic, we decided to do the same - as close as posible to the actual call to the binary.

Doing the deduplication on an earlier phase, will change a bit the semantic for both cases. And perhaps would also come with a set of different side-effects that we need to analyze.

@klapkov
Copy link
Contributor Author

klapkov commented Dec 13, 2023

Hello @maxmoehl,
Thanks for the suggestions. I have modified DeduplicateRules and I did a benchmark test. See: klapkov@2bd074b

The benchmark performs the de-duplication with input of 10000 rules. ~7000 of them are duplicates.

Results from benchmark with initial format with 1s benchtime:

BenchmarkDeduplicateRules
BenchmarkDeduplicateRules/initial
BenchmarkDeduplicateRules/initial-12         	     594	   2147544 ns/op	 1621821 B/op	   10079 allocs/op
PASS
ok  	code.cloudfoundry.org/cni-wrapper-plugin/netrules	1.718s

Results with optimised format with 1s benchtime:

BenchmarkDeduplicateRules
BenchmarkDeduplicateRules/optimised
BenchmarkDeduplicateRules/optimised-12         	     662	   1736750 ns/op	 1279523 B/op	   10002 allocs/op
PASS
ok  	code.cloudfoundry.org/cni-wrapper-plugin/netrules	1.685s 

There is an improvement of ~0,4ms. Not sure how significant it is though.

So the deduplication does not significantly affect the overall execution time for https://github.com/klapkov/silk-release/blob/develop/src/code.cloudfoundry.org/cni-wrapper-plugin/netrules/netout_chain.go#L57

But greatly improves overall performance by removing these duplicate rules that are not needed.

@ameowlia
Copy link
Member

@klapkov thanks for adding the benchmark.

❓ can you run the benchmark with 1 million rules and try some of the optimizations that @maxmoehl mentioned here?

I want to make sure it works at the level that we expect it to run at.

@vlast3k
Copy link
Contributor

vlast3k commented Dec 14, 2023

@ameowlia i will put the numbers above a bit in perspective, because it does not become evident from the example
The test shows that the current code is able to deduplicate 6 millions rules per second.
The only significant part of the code is the key-generation, where several strings are joined.
Adding a string to a map and increasing the map to a million entries is not significant, as it takes just < 20 resizes (doubling the size each time)

The proposal from Max above for reusing the sting variable, is ratheer a misunderstanding, as this is just assignment. The new string is created as a result of the join operation.

The other proposal, to pre-allocate the array with the maximum length of resulting strings in fact decreases the performance (e.g. 1-2%), because most of the time we are talking about 10x+ reduction of the size of the rules.

A solution that we considered, but consciously did not implement, because we did not need improvement from 1 ms to 10ns - is to implement Equals and Hashcode methods to type rules.IPTablesRule, so that it can be used as key to the map.

In general memory management was a problem in Java's infancy which to a large extent has been resolved since Java 8 (almost 10 years ago). Go allocates local variables on the stack, and thus is much more efficient that any version of Java.


And again i would like to put this into perspective, how the problem arises, and what are the numbers we've observed (knowing that deduplicating 6 mil rules will take 1 sec on an old Mac core)

More than 100k iptable rules on a Diego Cell results in measurable decrease of creating containers. Up until 500k it is bearable, with times from 20+ seconds to create a container. (When Dynamic ASGs) are used.

Above this - all iptables operations become too slow, and we were not able to enable Dynamic ASGs on those landscapes.

The problem stems from the fact, that with Static ASGs, duplicates were removed by the binary, so the iptable rules never became more than 20-30k in our case.

With dynamic ASGs, no deduplication was done by the binary, and on some landscapes, due to special stakeholder usage pattern (Using Azure DBs) we realized that the old binaries were removing a lot of rules.

So this resulted in explosion of the number of rules, e.g. from 20-30-50k to 500-1000-2000k

We made a fix last year, that significantly reduced the time, and this allowed us to enable Dynamic ASGs, on the landscapes with <500k rules. But not the larger ones.

On the affected diego-cells, we have up to 100 containers, each of them having 10-20k rules. As soon as the total rules on the VM cross 500k, adding new ones becomes exponentially slower, taking 10s of seconds.

So essentially, we do not care about the performance of deduplicating millions of rules. What we care is that the total number of rules does not rise, so that iptables stays performant.

With the current proposal we achieve this without performance overhead (< 1ms per LRP)


Assuming that our usecase is special - many duplicates, etc- so what would happen in another case, where there are indeed LRPs with millions of rules -> Well they would have never worked in first place, as also as Static ASGs they would have broken iptables. So indeed the scenario that we are talking about is - low number of rules with Static ASGs and explosion with Dynamic ASGs -> that is - many duplicates in one or many LRPs that eventually add up.

@PlamenDoychev
Copy link

Hi @ameowlia, @maxmoehl,

CC @vlast3k, @klapkov

If there is no other feedback, let's proceed with the merge of the PR?

@ameowlia ameowlia merged commit df5b0b1 into cloudfoundry:develop Jan 2, 2024
1 check passed
@PlamenDoychev
Copy link

Thanks @ameowlia!

PlamenDoychev added a commit to PlamenDoychev/community that referenced this pull request Mar 9, 2024
[DIEGO-RELEASE] Add locket client keepalive time and timeout to jobs: cloudfoundry/diego-release#722
[DIEGO-RELEASE] Make max_containers prop configurable: cloudfoundry/diego-release#876
[BBS] Add request metrics for BBS - still open: cloudfoundry/bbs#80
[BBS] Use scheduling info instead of the whole desiredLRP: cloudfoundry/bbs#79
[BBS] Add routing info endpoint: 
cloudfoundry/bbs#66
[BBS] Remove cpu_weight limit: 
cloudfoundry/bbs#81
[EXECUTOR] Improve error handling on process start - still open: cloudfoundry/executor#91
[LOCKET]  Add a keepalive timeout on the locket client: cloudfoundry/locket#12
[GROOTFS] UsedVolumesSize and UsedStoreInBytes metrics: cloudfoundry/grootfs#155
[ROUTE-EMITTER] Use routing info bbs endpoint when syncing: cloudfoundry/route-emitter#23
[ROUTE-EMITTER]  Use routing_info for desired_lrp's when there are missing actual_lrp's: 
cloudfoundry/route-emitter#26
[SILK-RELEASE] Deduplicate Iptables Rules with Dynamic ASG's: cloudfoundry/silk-release#101
[SILK-RELEASE] Make container_metadata_file_check_timeout on silk-shutdown configurable: 
cloudfoundry/silk-release#111
ameowlia pushed a commit to cloudfoundry/community that referenced this pull request Mar 12, 2024
[DIEGO-RELEASE] Add locket client keepalive time and timeout to jobs: cloudfoundry/diego-release#722
[DIEGO-RELEASE] Make max_containers prop configurable: cloudfoundry/diego-release#876
[BBS] Add request metrics for BBS - still open: cloudfoundry/bbs#80
[BBS] Use scheduling info instead of the whole desiredLRP: cloudfoundry/bbs#79
[BBS] Add routing info endpoint:
cloudfoundry/bbs#66
[BBS] Remove cpu_weight limit:
cloudfoundry/bbs#81
[EXECUTOR] Improve error handling on process start - still open: cloudfoundry/executor#91
[LOCKET]  Add a keepalive timeout on the locket client: cloudfoundry/locket#12
[GROOTFS] UsedVolumesSize and UsedStoreInBytes metrics: cloudfoundry/grootfs#155
[ROUTE-EMITTER] Use routing info bbs endpoint when syncing: cloudfoundry/route-emitter#23
[ROUTE-EMITTER]  Use routing_info for desired_lrp's when there are missing actual_lrp's:
cloudfoundry/route-emitter#26
[SILK-RELEASE] Deduplicate Iptables Rules with Dynamic ASG's: cloudfoundry/silk-release#101
[SILK-RELEASE] Make container_metadata_file_check_timeout on silk-shutdown configurable:
cloudfoundry/silk-release#111

wip
ameowlia pushed a commit to cloudfoundry/community that referenced this pull request Mar 12, 2024
[DIEGO-RELEASE] Add locket client keepalive time and timeout to jobs: cloudfoundry/diego-release#722
[DIEGO-RELEASE] Make max_containers prop configurable: cloudfoundry/diego-release#876
[BBS] Add request metrics for BBS - still open: cloudfoundry/bbs#80
[BBS] Use scheduling info instead of the whole desiredLRP: cloudfoundry/bbs#79
[BBS] Add routing info endpoint:
cloudfoundry/bbs#66
[BBS] Remove cpu_weight limit:
cloudfoundry/bbs#81
[EXECUTOR] Improve error handling on process start - still open: cloudfoundry/executor#91
[LOCKET]  Add a keepalive timeout on the locket client: cloudfoundry/locket#12
[GROOTFS] UsedVolumesSize and UsedStoreInBytes metrics: cloudfoundry/grootfs#155
[ROUTE-EMITTER] Use routing info bbs endpoint when syncing: cloudfoundry/route-emitter#23
[ROUTE-EMITTER]  Use routing_info for desired_lrp's when there are missing actual_lrp's:
cloudfoundry/route-emitter#26
[SILK-RELEASE] Deduplicate Iptables Rules with Dynamic ASG's: cloudfoundry/silk-release#101
[SILK-RELEASE] Make container_metadata_file_check_timeout on silk-shutdown configurable:
cloudfoundry/silk-release#111

wip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants