-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanner: Root cause of "Error: Unable to release unknown resource" #716
Comments
Error origin analysis Perhaps one possibility of the error causing scenario?
|
@codemart786 You could post the requested information directly here. |
@mf2199 I am using nodejs-spanner version 4.0.2 . Is there any other info which you need ? @AVaksman actually this error comes in prod env a lot(not exactly sure under what circumstances). I am not able to reproduce this error in my test env. I am not using any session pool configurations in the code, have left everything on defaults .
Thanks :) |
Would you kindly provide some more details on your environment Environment details
More specific details (if code snippet is not available)
Thanks |
Environment Details
PS: By mistake I earlier reported spanner version to be (4.0.2) .We are not using spanner@4.0.2 anywhere in prod . My Bad here . More specific details
Thanks |
@AVaksman @mf2199 any updates here ? There is one additional question which I need to ask here ... will updating the Thanks |
@codemart786 Yes, please try updating to the latest version and verify if you are still experiencing the issue. Thanks |
I also had a quick look through the logic. As @AVaksman mentioned earlier, you run into this issue if the burrowed session was deleted before you attempt to release it because it:
Without knowing much about your code, it's hard to know which of this situations you're running into. We can probably rule out (1) since you said that you're using default options. By default we call keepAlive every 50 mins. We should be pinging idle sessions periodically as part of the housekeeping logic so keepAlive() would be called. It's possible that some of these calls fail for some reason. Can you check your logs or stats to see if this is happening? For (3) I can't say without seeing your code. |
@AVaksman @mf2199 @callmehiphop @skuruppu So, there are certain updates on this issue which I would like to note here: There was a bug in our code related to r/w transactions . We were not doing transaction.end(), at places where we decided to forego commit/rollback. But after doing this too, we are still viewing the above mentioned error logs. Having said that, there is another issue which I would like to mention here is that sometimes spanner database sessions just shoots up . This happens at time of peak traffic. So we decided to tweak
This configuration has stabilised our application for certain time period but at very high load this also breaks and the api(s) start giving timeouts again. There is one more observation I have, which is when the docker container shuts down , in the grace period our process tries to close database connection by doing
Can someone here help justify this behaviour ? I am not sure, whether all these things are due to something which we are missing/doing in our code. What is going wrong here ? PS: The type of operations we are performing using spanner are already mentioned above . No new type of operation is being added. Any help in this regard is deeply appreciated . :) |
@codemart786 the For the session leak error, you can actually inspect the stack trace for each leak found. try {
await database.close();
} catch (e) {
if (e.name === 'SessionLeakError') {
e.messages.forEach(stackTrace => console.log(stackTrace));
}
} |
@callmehiphop Thanks for the update here . It helps definitely. No, we are creating different instance from different services . For each docker container there is singleton pattern in our code, 1 instance per container is created while server is being up . Having said that, it will create roughly 60 instances at a time with the database. I would also like to mention some details which we noticed in our application . The CreateSession and DeleteSessions API Latency shoots up whenever sessions count increased . At the same time, our spanner db instance cpu health looks perfectly fine(less than 20% cpu utilisation). We cannot find any reason for this behaviour . I will try out the There is one thing which is bugging me is the sessions count is abnormally high for our application . Our application goes down due to this on peak times . Is there something wrong we are doing or we are missing something very crucial here? Based on your past experience, are there some obvious scenarios that could lead to spanner session leaks or very high session count or very high CreateSession/DeleteSession API latencies? PS: I don't think we receive the kind of scale for which so many sessions are required. Sometimes the sessions count increased to 1.5 million. We have currently setup 2 spanner database nodes. We increase it from 1, hoping this may help to tackle the api(s) latency issue, but was of no help. Any Help is deeply appreciated :) Thanks Below is the metric chart for Spanner API Latencies: In some cases the latencies reach 10 minutes . In the chart, the latency is around 4.5 minutes. Light Blue Line: DeleteSession API request |
@AVaksman @skuruppu @callmehiphop @mf2199 Hey guys, few things I want to ask here, related to this issue . We did load testing in our staging environment, and have seen session count increasing from 3k to 100k within 30 mins. Having said that, here are my some of doubts which I would like to clarify:
Please help me know when the @google-cloud/spanner node module throws this stack trace? If I know in which direction to proceed further, it would be very helpful. PS: Our application has suffered several production outages, which I believe is mostly due to this issue. Please help me in solving this issue. Any help is deeply appreciated :) Thanks |
We're looking into your issue based on our offline conversation. But to answer some of your questions:
No, you don't have to call
Interesting, it seems that the stack trace is incomplete.
|
"We did load testing in our staging environment, and have seen session count increasing from 3k to 100k within 30 mins." What is the pool settings in your staging environment? How many instances are you running for your load test? The upper-bound number of sessions you will see in the GCP console = For example, if you see the number of total sessions reaches 1.5 million, because you set 150k for each instance and there are 10 running instances. When there are 1.5 million sessions with only 2 spanner nodes, I guess the latency would become much higher if you still try to create more sessions or delete sessions.
Can you try to lower the |
The reason for the above is the problem described in this document. The TLDR version is that 'developers are currently facing the problem that the (non-standard) Error.stack property in V8 only provides a truncated stack trace up to the most recent await.'. The nodejs-spanner/src/session-pool.ts Line 484 in b891a81
Which means that the stacktraces will currently not contain the call stack of the application. See also nodejs/node#11865. |
Hi @hengfengli We did load testing in staging environment with Having said that, is there any possibility that node js spanner client is not reusing the sessions? Thanks @olavloite for sharing the doc. Will give it a read. Thanks |
Yes. If transaction.end() or transaction.commit()/.rollback() are not called, it certainly causes the session to leak. Also, transaction.end() needs to be called for read-only transactions. Originally, I thought it is an issue in session pool (#750), but it turns out that my tests are not fair. I still believe that sessions are leaked somewhere. Another guess is that some transactions have very high latencies, which holds sessions for a long time, so that others do not get a chance to get an idle session.
This is because you set a very high What I can think of is to check & obverse the status of local session pool, like what I did in #750. I will continue to do some load tests trying to find out what the root cause is. |
Thanks @olavloite for fixing #755. We published version v4.4.1 which should contain the stack trace printing fix. |
Hi @codemart786 If I understand you correctly, you are (sort of) able to reproduce this problem on your staging environment, right? If so, would it be possible for you to try the following:
If you could then inspect the stacktraces ( |
Thanks for the suggestion. I will give this a try and test the same in the staging environment. Will keep you guys updated about the same Thanks |
@codemart786 let us know if the issues are fixed. We should close this bug if there are no further related topics to discuss. For any new issues, please open a separate issue. |
@skuruppu @olavloite @hengfengli Hey guys, Yes we are good to close this issue. Thanks everyone for the suggestions. We did many changes in response to session pool abnormally increase. As of now, the system is stable. |
Perfect, thanks for the update @codemart786. Glad to hear that the system is stable now. Please feel free to open issues in this repo if you come across any further problems or questions. |
From Getting “Error: Unable to release unknown resource.\n at SessionPool.release”:
The text was updated successfully, but these errors were encountered: