-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: interface conversion: interface is nil, not balancer.SubConn goroutine 326 [running] #6453
Comments
|
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
We also encountered this issue during use, using gRPC version 1.45
|
Line 30 in a758b62
grpc-go/balancer/base/balancer.go Line 124 in a758b62
|
@dfawley PTAL. I can provide more information |
Can you provide a minimal reproduction for this? |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
We're seeing a similar panic coming from
|
I looked at the stack trace from @dylan-bourque, but I still can't see how it's possible. The only thing I could see is either If there is a way I can reproduce this, then I can figure out what is going on and fix it. Otherwise, I just don't see what the problem might be. |
Actually, looking at this closer, I do see how 1.58.* can be affected with a very small race between creating a new connection and shutting down or entering idle. I'll send a fix for that soon. I don't know about the reports from 1.45 or 1.46 however. |
Thanks @dfawley. I can have our teams test/validate a fix whenever you have one ready. |
@dylan-bourque - actually, now that I've looked even further, the race that I saw isn't actually possible. What I thought was happening was that the So, back to square one. We will definitely need a repro case to debug this any further, sorry. |
@dfawley sorry, the other day was China's National Day and I was on vacation. We have only encountered this issue once when using gRPC balancer, and we are currently unsure how to reproduce it. But in fact, we only need to judge the results to avoid this problem, do you think?
|
I searched our log library and did not find any similar error messages, including in the testing and production environments. It has only encountered a problem once in the production environment, so I am currently unsure how to reproduce this problem. But it caused panic in the program, causing some of our requests to fail. |
Adding in defensive programming will prevent a panic, but it also covers up an issue that should be impossible to occur unless there is some other problem which could lead to outcomes even worse than a panic (e.g. a dead client not connected to any addresses). I'd rather leave in the potential panic to help us uncover and fix any bugs like that, but from code inspection, I don't see how anything like that is possible here. |
Yes, but I searched past logs and did not find any similar issues. I'm not sure how to reproduce this problem at the moment. I have only encountered this issue once since using gRPC, and other colleagues have not reported this issue. If it is an obvious code issue, we will quickly discover it, but depending on the frequency of the problem, it may require an extreme environment to occur. 😞 If we don't use some means to avoid it, program exit will cause some requests to fail, which is a situation we don't want to see. |
Due to the lack of relevant help logs, it is difficult for us to troubleshoot the issue solely through the stack information of the panic. Can we consider adding defensive programming and increasing log printing in abnormal situations, such as printing the current |
We have another team that's reported the same panic in |
Possibly relevant detail: we do have a custom, client-side LB implementation that's broken several times over the last few releases as the internals of the LB code has changed. I see quite a few changes inside of the |
Is it possible it's your custom LB code that is causing the panics here? Note that our custom LB support is currently still marked as experimental, meaning breaking changes and deprecation are not unexpected.
The problem is the failure mode can be even worse if we do this. If a server crashes, it's usually just restarted automatically. If a bug happens that makes it not have any open connections, then it will be stuck. |
I've thought about that but our code is not in any stack trace for these crashes.
I'm well aware. We've been bitten by this 3 or 4 times already, including one where we had to do an almost complete rewrite. To be fair, there's very little documentation around the LB stuff and nothing that says "the behavior of SubConn is changing from X to Y" so I can only take guesses about what might be going on based on digging through the gRPC code. For now we have service owners using |
If you have use cases that aren't served without using the experimental APIs then feel free to file another issue to discuss, and we'll see what we can do. |
@dfawley I have no objection to the need to exit the program to ensure the correct logic, but can you provide users with some hooks to handle before exiting or print detailed debug information before exiting for troubleshooting purposes |
I'm not sure what you have in mind here. Can you give an example? Are you using a custom LB policy? I do see https://github.com/go-kratos/kratos/blob/main/transport/grpc/balancer.go near the code from your stack trace, and it's possible it's the source of the problem. |
@dfawley we got a slightly different error today which may help pin this down (at least for us).
Our custom LB is calling The read is here and there are two writes to that map here and here. My guess, based on this latest panic, is that there's a concurrent call to |
If that's true then your custom LB policy is violating the requirements of the API: https://pkg.go.dev/google.golang.org/grpc/balancer#Balancer
You should only call into this LB policy from your LB policy in response to the above calls (since they are guaranteed to be synchronous), or you must ensure you have other synchronization in place to guarantee the above requirement is met. Note that I suspect this is also what's happening in the kratos LB policy, though I don't have the time to fully understand that system to determine whether that's true for sure. |
Thanks for the info. I'll see what updates we can make to our code. I have to mention, though, that our code has been running as-is in dozens of services and 100s of nodes without issues for years, including after I personally did a major overhaul in 2020 to make it adhere to the "new" client-side LB framework. The panics are new since we've upgraded to grpc@v1.58.x. It's certainly possible that we've just never run into this crash before by sheer luck, but I doubt it. 😞 Before I start digging, what is the "proper" way for my LB to tell gRPC to release a |
Looks like the only way to use
Not sure what the right path forward should be. |
The deprecation is support for multiple addresses per SubConn, not the method itself.
If you don't care about the state updates of that subchannel, then you can ignore all of that. If you do need the updates, then you'll need to wrap the For an example of this, you can look at where we do something similar here: grpc-go/internal/balancer/gracefulswitch/gracefulswitch.go Lines 341 to 343 in 7765221
|
@dfawley thanks for the additional info and pointers. I'm making some progress but there's still a gap. I can't figure out how to poke the base balancer to create a new sub-conn after I call The end result I need is for a sub-conn to be closed/discarded and a new one created. Is that not possible using the current API? |
You should not be calling
The current API allows you to do just about anything, but the base balancer is very limited. If you need more control than what the base balancer allows, you'll want to just do everything yourself from scratch and not use the base balancer. Basically, what the base balancer does is connect to every address it is given, and then it calls your picker builder to allow you to write the logic to determine which connection to use for which RPCs. If you don't want to connect to all addresses, you could filter out the ones you don't want to connect to, but at some point you aren't gaining much by using the base balancer, and would be better off doing everything from scratch. I still would like to hear about your use cases and learn why the LB policies we provide out of the box are insufficient, as using these APIs at all is not recommended due to their experimental status and planned upcoming breakages. |
This would work perfectly if there was a way for me to pass a new list of addresses at runtime, but that doesn't seem to be possible.
The specific use case we have is redistributing load when the back-end is scaled up or down behind a load balancer (NLB or ALB). Client connections are "sticky" so that once a connection is established all requests from that client go to the same node on the back-end. That becomes a problem for us when the clients are themselves long-running services. Assume 5 clients, A, B, C, D, E, and F, connecting to service S that's running on 10 nodes behind a LB. If we scale S up to 20 nodes, none of the traffic from those existing clients gets spread to the new nodes unless a connection to one of the original nodes is broken. What we've done is implement a client-side LB policy that creates N connections by copying the address passed into Like I mentioned before, this code has been running without issue for a very long time. The original implementation pre-dates me and I refactored it to work with the V2 client-side balancer APIs in 2020. Other than that, though, this logic has worked for us since it was put in place 5+ years ago. It seems I'll have to build a from-scratch custom balancer unless you can suggest an out-of-the-box policy that will solve for redistributing load across new nodes after a scale-up behind a NLB/ALB. |
I'm not sure what you mean by "at runtime", but if you mean "asynchronous with the channel(/ As for your use case.. What name resolver are you using? Just DNS or something custom? It sounds like you have just one address that you actually connect to (the LB), and that creates connections to the backend servers? If I'm understanding you correctly, we have a very similar architecture for Google's public cloud gRPC services. The solution they are using today is to use a connection pool of gRPC channels instead of doing anything inside the channel itself, and doing round robin across the channels. You can find their implementation here (and potentially use it for yourself directly). The only thing this is missing is the "reconnect periodically" functionality you have; we typically recommend implementing that by setting a max connection age limit on the gRPC server instead: https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.
Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.
What version of gRPC are you using?
1.46.2
What version of Go are you using (
go version
)?1.19
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
Occasionally panic occurs in the production environment
What did you expect to see?
let assert safe
What did you see instead?
function regeneratePicker in /balancer/base/balancer.go
if b.subConns.Get(addr) is nil, the next assert will panic, instead
The text was updated successfully, but these errors were encountered: