-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google-cloud-compute: ImagesClient does not terminate, and hangs instead #10220
Comments
I've also dug a bit into the library code and it might have something to do with gax? |
Probably not the core of the problem, but the code example is not using try-with-resources: |
I'm able to reproduce this. The regression was introduced in google-cloud-compute:1.27.0 in May, corresponding to libraries-bom:26.15.0. @lqiu96 I confirmed that this is the PR that caused the regression: googleapis/sdk-platform-java#1603 |
Hi @Martiul, thank you for reporting this. I followed the steps that you listed above and was able to reproduce this locally. With or without the client initialized in try-with resources, the client isn't able to fully close properly and the program isn't able to terminate. Thanks @meltsufin for pointing to my previous PR, I reverted gax-httpjson back to the version before the PR (v0.111.0) and was able to confirm that the program was able to terminate properly. My repro
Output:
It's a different error regarding dynamic routing headers, but I believe it shows that the program was able to terminate properly. I checked the RPC's settings and saw that it was configured with a default timeout value of 10 minutes (as defined in the RetrySettings here and then configured here). I believe this value determines when the program would actually terminate and the spring shell timeout is probably just a coincidence that matches. My previous PR introduced the concept of terminating the call if the call's duration has exceed the total timeout configured for the RPC. This is done via a "deadline schedule executor" which just schedules a timeout task once the configured timeout elapsed. The deadline scheduler's timeout tasks is scheduled for I believe the issue is with the deadlineScheduler's I ran a test locally which cancels the timeout task once a response has been received and was able to see the client terminate properly:
I believe the output above should be the intended result. I think there should be a simple fix for this, but I'll do some more investigating to confirm. |
yes. I split the example into try-catch-finally so it's easier to debug |
thanks @lqiu96 for the deepdive on the scheduler, i was also under the impression that there was some unfinished job somewhere. Looking forward to the fix |
I added a draft POC of a possible fix: googleapis/sdk-platform-java#2360. I'm taking a look into grpc-java (httpjson is mostly a mirror/ inspired from there) and other possible implementations. CC: @blakeli0 |
@Martiul @meltsufin @suztomo, I added |
Fixes: googleapis/google-cloud-java#10220 Currently, the executorService will wait for any previously submitted task to finish execution before being able to terminate. This PR attempts to fix the issue by cancelling the outstanding timeout task. If a response has been received prior to the timeout, the timeout task will be cancelled and the client should be able to terminate immediately afterwards. This fixes an issue for Clients that have RPCs with a large timeout value (i.e. 10 min). The client would wait 10 minutes before being able to terminate completely. ## Local Testing Running this with SNAPSHOT Gax version (2.41.1-SNAPSHOT) and latest Compute from Libraries-Bom v26.31.0 Logs: ``` } at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115) ... 6 more Process finished with exit code 1 ``` Running this with latest Gax version (fix not included) and latest Compute from Libraries-Bom v26.31.0: ``` } } at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115) ... 6 more ``` Missing the termination and exit code. It shows up after 10 minutes.
Fixes: googleapis/google-cloud-java#10220 Currently, the executorService will wait for any previously submitted task to finish execution before being able to terminate. This PR attempts to fix the issue by cancelling the outstanding timeout task. If a response has been received prior to the timeout, the timeout task will be cancelled and the client should be able to terminate immediately afterwards. This fixes an issue for Clients that have RPCs with a large timeout value (i.e. 10 min). The client would wait 10 minutes before being able to terminate completely. ## Local Testing Running this with SNAPSHOT Gax version (2.41.1-SNAPSHOT) and latest Compute from Libraries-Bom v26.31.0 Logs: ``` } at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115) ... 6 more Process finished with exit code 1 ``` Running this with latest Gax version (fix not included) and latest Compute from Libraries-Bom v26.31.0: ``` } } at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118) at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115) ... 6 more ``` Missing the termination and exit code. It shows up after 10 minutes.
Environment details
API: google-cloud-compute
OS type and version: MacOS Sonoma 14.2.1
Java version: 21 (temurin)
Version(s): 1.43.0
Steps to reproduce
ImagesClient
Expected behaviour: The
get()
fails because of a lack of permissions and the program terminates.Actual behaviour: The
get()
fails because of a lack of permissions and the program does NOT terminate. It just hangs. (Indicated by the IntelliJ Run toolbar's STOP button still being clickable)Code example
Stack trace
NA
External references such as API reference guides
NA
Any additional information below
For some context, I am developing a Spring Shell CLI tool, and one of the commands uses the GCE ImagesClient. Before, when a user ran the command successfully, the user would then be able to run another command. This was on
google-cloud-compute
1.13.0
. However, now when the user runs a command using theImagesClient
the command hangs, and the program ends after a 10minute idle timeout set by Spring Shell. I've been able to reproduce the issue on a barebones Maven project.Additionally, I've tried using
imagesClient.awaitTermination(10, TimeUnit.SECONDS);
but this command still hangs.Furthermore, I've used
imagesClient.isTerminated()
and have observed that in the previous versions, it would return true. But now it returns false.The text was updated successfully, but these errors were encountered: