-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS round-robin based on SRV weight/priority #1088
Comments
This is an interesting idea, and is definitely possible. If you want to start a design document to sketch out the idea and how it would work more formally, that could be helpful. |
I think we may use rfc2782 as start point for implementation:
Default priority value is 0. I think it would be useful have separate config option for it, something like "srv_priority" in "dns_config" section. Cool feature of srv_priority will be an ability to use it as some kind of grouping. Let say we need to have some group of servers to test new feature, we set srv_priority=1 for them. Now, when server with srv_priority ask consul about NS records for .service. group and this group has some servers with same srv_priority set, consul will use only servers with same srv_priority set in DNS reply. If there is no servers with same srv_priority, or srv_priority isn't set, consul will use all available servers in DNS reply for .service. request. For Weight value it would be nice to have additional check definition, because this check isn't indicate that node live or dead. So it runs in same way as script check, but returned value is used as weight definition for SRV record, it will have more then 0,1,">1" values. Now, when we have all values in one place we can use described in rfc algorithm to calculate order. We need to have an ability to set both more and less preferable servers by weight, then servers with default weight value. Lets say default will be 100 (rfc told us about 0, but in this case we can't reduce this value), so if we want to reduce selected server weight compared to other servers with default value it weight will be almost % value of chance to be contacted. |
Hi, I tested today DNS SRV and some phone integrations. Weight + Priority it's needed for VoIP world. I'm using Consul to discover some Voice servers, and it's awesome, but in the VoIP world we need to send the calls always to the same "endpoint". I mean, I'm using this for a voice conference, so the first caller join the conference in server A, but the second caller should be in the same server. So we're using DNS-SRV for HA purposes. Our DNS-SRV manually looks like this:
Could we add here the priority based on when the node join the service? Does it work for you? Do you have any other approach? I can spend time on this, it's a priority to get this project done, so if you point me in the right direction I can start to code ;-) Regards |
@eloycoto I think unfortunately to do this the right way, we need to support arbitrary K/V attributes on nodes and services. This lets us much more cleanly support something like "dns_weight=2" and then have that parsed and respected. Anything built on the existing API would be a huge hack like "dns-weight-2" tag, which I'd oppose. So unfortunately, I think to do it right requires a lot more rethinking of things outside the scope of this one feature. We want to get there, but it will take a little more time for us to firm up the foundation. |
Hi @armon, Make sense, it's a big change. Ping me if you need help I can spend time on this + QA time. Regards |
@eloycoto Out of curiosity how did you do the |
+1 Or is this use-case in scope of existing configuration? In example: keep the "master" tag only one one host; an algorithm to choose the proper one?; migrate the tag, as the current "master" fails?; instantly; |
+1. I would like to use SRV based RR for XMPP. |
+1 for this. Is there any timeline on when this is going to come out ? |
@epcim about HA environments, nowadays I'm doing like this https://www.youtube.com/watch?v=t3O5b2sweYs Regards |
Consul's DNS SRV is a great idea. How is the weight and priority determined ? When a new node registers a service , how can it set the weight/priority? Can those be modified later on the consul server? |
@gfrankliu weights are currently not supported. Right now Consul randomizes the results of DNS queries for load balancing, and removes nodes with failing health checks, but does not allow you to set the weights and priorities. |
@slackpad that's too bad. It will make the DNS SRV less useful. Is the support on the road map? I guess the workaround is not to use DNS but use http, and create tags to store "weight", etc. information. |
Yes I'm not sure of the current timeframe but we'd like to add this. Will keep this issue updated! |
+1 |
+1 |
@slackpad any update on this? I'm guessing there hasn't been any progress in this area. |
+1 |
6 similar comments
+1 |
+1 |
+1 |
+1 |
+1 |
+1 |
+1 |
+2 |
Please everyone stop using |
@majormoses I've come to the conclusion that these kind of requests don't work. People will use what they want to register their interest and personally I don't care since it is the engagement that counts. I find it more important to see that there is still demand. Also, one difference is that we do get notified on tickets when someone adds |
+1 - We would like to use it with our DB Cluster (1 master 3 slaves):
|
+1 |
Adding this datastructure will allow us to resolve the issues hashicorp#1088 and hashicorp#4198 This new structure defaults to values: ``` { Passing: 1, Warning: 0 } ``` Which means, use weight of 0 for a Service in Warning State while use Weight 1 for a Healthy Service. Thus it remains compatible with previous Consul versions.
* Implementation of Weights Data structures Adding this datastructure will allow us to resolve the issues #1088 and #4198 This new structure defaults to values: ``` { Passing: 1, Warning: 0 } ``` Which means, use weight of 0 for a Service in Warning State while use Weight 1 for a Healthy Service. Thus it remains compatible with previous Consul versions. * Implemented weights for DNS SRV Records * DNS properly support agents with weight support while server does not (backwards compatibility) * Use Warning value of Weights of 1 by default When using DNS interface with only_passing = false, all nodes with non-Critical healthcheck used to have a weight value of 1. While having weight.Warning = 0 as default value, this is probably a bad idea as it breaks ascending compatibility. Thus, we put a default value of 1 to be consistent with existing behaviour. * Added documentation for new weight field in service description * Better documentation about weights as suggested by @banks * Return weight = 1 for unknown Check states as suggested by @banks * Fixed typo (of -> or) in error message as requested by @mkeeler * Fixed unstable unit test TestRetryJoin * Fixed unstable tests * Fixed wrong Fatalf format in `testrpc/wait.go` * Added notes regarding DNS SRV lookup limitations regarding number of instances * Documentation fixes and clarification regarding SRV records with weights as requested by @banks * Rephrase docs
I was expecting to see p.s.: I understand that the script can produce a WARNING when, for instance, the CPU usage is too high, and I can give the warning a lower weight. |
@maxadamo this is the intended behavior, you can do script that compute passing or warning state base on metrics. SRV records expose this as well as HTTP catalog. However, it is true that DNS A queries don't do it. But if your LB does respect SRV weights, it does work already |
The caveat I'd say with that is that updating the weights requires Raft commits on the servers so you should be careful how often that can happen otherwise it could kill Consul servers with the load once things start to get busy and frequently update their weights. For example if every application instance has a script that checks the CPU every 5 seconds and updates the weights, then you might be fine with 50 instances (10 writes/second) maybe even 500 (100 writes/second) but you very quickly get into Consul server scaling issues where you may never have seen before when you only made changes minutes our hours apart. In general Consul does it's best to leave the server state unchanged as long as possible - that's why we only sync the output of a script check periodically (ever few minutes) rather than every time the output changes for example. So dynamic is OK, but watch your update frequency to give your servers a chance! |
Thank you both and thanks for the comprehensive explanation. The use case is quite interesting. Jruby of the puppet server can start eating all the resources and the agent which is picking the server with a CPU peak will be slow to run the agent. But, since puppet supports SRV, I can point the agents to the server which has less then 90% of CPU usage. |
@maxadamo On our side, what we are doing is setting up weights accordingly to performance of machine, ex:
Then, some scripts at node level checks for CPU usage, when CPU is above 95% for more than 5min, use Warning. The big advantage is that is load might be able to auto-regulate (instance too heavily loaded will receive less trafic and be able to recover much more quickly, and disruption might be lower (less requests will go to a saturated node, so more correct answers from service). Not using critical (means weight:=0) allow us to be sure that even in case of heavy load on all instances of a service, all instances will continue to try serving service, at worse, all instances are in Warning State (for instance, think to cases when DC is saturated by requests). |
@pierresouchay about p.s.: I can't set it to critical. I wrote something wrong. If both servers go to critical the service will be unavailable. |
Priority is not weight, consul let you set the weight, not the priority |
@pierresouchay I did copy paste without checking. You're right. |
from RFC 2782: |
It really depends of implementation. Weights be might be used by DNS SRV, but also by systems interacting directly with Consul HTTP API (this is what we do). |
Since consul already have SRV records support, is it possible to implement following RR scheme:
The text was updated successfully, but these errors were encountered: