Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google-cloud-compute: ImagesClient does not terminate, and hangs instead #10220

Closed
Martiul opened this issue Jan 11, 2024 · 8 comments · Fixed by googleapis/sdk-platform-java#2360
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Martiul
Copy link

Martiul commented Jan 11, 2024

Environment details

API: google-cloud-compute

OS type and version: MacOS Sonoma 14.2.1
Java version: 21 (temurin)
Version(s): 1.43.0

Steps to reproduce

  1. Add the following maven dependency to the pom.xml in a barebones Maven project:
<dependency>
  <groupId>com.google.cloud</groupId>
  <artifactId>google-cloud-compute</artifactId>
  <version>1.43.0</version>
</dependency>
  1. Try to use ImagesClient
 try (ImagesClient imagesClient = ImagesClient.create(ImagesSettings.newBuilder().build())) {
    imagesClient.get(GetImageRequest.newBuilder()
        .setProject("foo")
        .setImage("bar").build());
  }
  1. Run Main in IntelliJ (no permissions are required)

Expected behaviour: The get() fails because of a lack of permissions and the program terminates.

	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:118)
	... 6 more

Process finished with exit code 1

Actual behaviour: The get() fails because of a lack of permissions and the program does NOT terminate. It just hangs. (Indicated by the IntelliJ Run toolbar's STOP button still being clickable)

	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:118)
	... 6 more

Code example

public class Main {
  public static void main(String[] args) throws Exception{
    testGce();
   }
   
   public static void testGce() {
    Image i = null;
    ImagesClient imagesClient = null;
    try {
      imagesClient = ImagesClient.create(ImagesSettings.newBuilder().build());
    } catch( Exception e) {
      System.out.println("Exception!");
    } finally {
      imagesClient.shutdown();
      imagesClient.shutdownNow();
      System.out.println("Is terminated: " + imagesClient.isTerminated());  // should be true
    }
  }   
 }

Stack trace

NA

External references such as API reference guides

NA

Any additional information below

For some context, I am developing a Spring Shell CLI tool, and one of the commands uses the GCE ImagesClient. Before, when a user ran the command successfully, the user would then be able to run another command. This was on google-cloud-compute 1.13.0. However, now when the user runs a command using the ImagesClient the command hangs, and the program ends after a 10minute idle timeout set by Spring Shell. I've been able to reproduce the issue on a barebones Maven project.

Additionally, I've tried using imagesClient.awaitTermination(10, TimeUnit.SECONDS); but this command still hangs.
Furthermore, I've used imagesClient.isTerminated() and have observed that in the previous versions, it would return true. But now it returns false.

@Martiul
Copy link
Author

Martiul commented Jan 11, 2024

I've also dug a bit into the library code and it might have something to do with gax?

@suztomo
Copy link
Member

suztomo commented Jan 11, 2024

Probably not the core of the problem, but the code example is not using try-with-resources: try (ImagesClient imagesClient = ImagesClient.create(ImagesSettings.newBuilder().build())) { .

@meltsufin
Copy link
Member

I'm able to reproduce this. The regression was introduced in google-cloud-compute:1.27.0 in May, corresponding to libraries-bom:26.15.0.

@lqiu96 I confirmed that this is the PR that caused the regression: googleapis/sdk-platform-java#1603

@lqiu96
Copy link
Contributor

lqiu96 commented Jan 12, 2024

Hi @Martiul, thank you for reporting this. I followed the steps that you listed above and was able to reproduce this locally. With or without the client initialized in try-with resources, the client isn't able to fully close properly and the program isn't able to terminate. Thanks @meltsufin for pointing to my previous PR, I reverted gax-httpjson back to the version before the PR (v0.111.0) and was able to confirm that the program was able to terminate properly.

My repro pom.xml that reverts a previously working version

    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-compute</artifactId>
      <version>1.43.0</version>
    </dependency>
    <dependency>
      <groupId>com.google.api</groupId>
      <artifactId>gax-httpjson</artifactId>
      <version>0.111.0</version>
    </dependency>

Output:

Exception in thread "main" java.lang.NoSuchMethodError: 'com.google.api.gax.httpjson.HttpJsonCallSettings$Builder com.google.api.gax.httpjson.HttpJsonCallSettings$Builder.setParamsExtractor(com.google.api.gax.rpc.RequestParamsExtractor)'
	at com.google.cloud.compute.v1.stub.HttpJsonGlobalOperationsStub.<init>(HttpJsonGlobalOperationsStub.java:368)
	at com.google.cloud.compute.v1.stub.HttpJsonGlobalOperationsStub.create(HttpJsonGlobalOperationsStub.java:337)
	at com.google.cloud.compute.v1.stub.HttpJsonImagesStub.<init>(HttpJsonImagesStub.java:629)
	at com.google.cloud.compute.v1.stub.HttpJsonImagesStub.<init>(HttpJsonImagesStub.java:614)
	at com.google.cloud.compute.v1.stub.HttpJsonImagesStub.create(HttpJsonImagesStub.java:594)
	at com.google.cloud.compute.v1.stub.ImagesStubSettings.createStub(ImagesStubSettings.java:277)
	at com.google.cloud.compute.v1.ImagesClient.<init>(ImagesClient.java:343)
	at com.google.cloud.compute.v1.ImagesClient.create(ImagesClient.java:326)
	at com.google.cloud.compute.v1.ImagesClient.create(ImagesClient.java:318)
	at org.example.Main.main(Main.java:21)

Process finished with exit code 1 <--- I see the exit code for termination

It's a different error regarding dynamic routing headers, but I believe it shows that the program was able to terminate properly.

I checked the RPC's settings and saw that it was configured with a default timeout value of 10 minutes (as defined in the RetrySettings here and then configured here). I believe this value determines when the program would actually terminate and the spring shell timeout is probably just a coincidence that matches.

My previous PR introduced the concept of terminating the call if the call's duration has exceed the total timeout configured for the RPC. This is done via a "deadline schedule executor" which just schedules a timeout task once the configured timeout elapsed. The deadline scheduler's timeout tasks is scheduled for current_time() + rpc_timeout(), which for this RPC is 10m from when the RPC was invoked.

I believe the issue is with the deadlineScheduler's shutdown() call. The javadocs for shutdown() suggests if the task has been added to the scheduler, the task will continue until it's completed. I think the intended behavior is that once we've received a result (either error or response) prior to the timeout, the timeout task should be cancelled and the client should return the response back to the user.

I ran a test locally which cancels the timeout task once a response has been received and was able to see the client terminate properly:

...
      }
    ],
    "status": "PERMISSION_DENIED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Google developers console API activation",
            "url": "https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=foo"
          }
        ]
      },
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "SERVICE_DISABLED",
        "domain": "googleapis.com",
        "metadatas": {
          "service": "compute.googleapis.com",
          "consumer": "projects/foo"
        }
      }
    ]
  }
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115)
	... 6 more

Process finished with exit code 1 <--- Expected to see this to show that the JVM has exited

I believe the output above should be the intended result. I think there should be a simple fix for this, but I'll do some more investigating to confirm.

@Martiul
Copy link
Author

Martiul commented Jan 12, 2024

Probably not the core of the problem, but the code example is not using try-with-resources: try (ImagesClient imagesClient = ImagesClient.create(ImagesSettings.newBuilder().build())) { .

yes. I split the example into try-catch-finally so it's easier to debug imagesClient.isTerminated()

@Martiul
Copy link
Author

Martiul commented Jan 12, 2024

thanks @lqiu96 for the deepdive on the scheduler, i was also under the impression that there was some unfinished job somewhere. Looking forward to the fix

@lqiu96
Copy link
Contributor

lqiu96 commented Jan 12, 2024

I added a draft POC of a possible fix: googleapis/sdk-platform-java#2360. I'm taking a look into grpc-java (httpjson is mostly a mirror/ inspired from there) and other possible implementations.

CC: @blakeli0

@blakeli0 blakeli0 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jan 12, 2024
@blakeli0
Copy link
Contributor

@Martiul @meltsufin @suztomo, I added priority:p2 to this bug, let me know if you feel differently.

lqiu96 added a commit to googleapis/sdk-platform-java that referenced this issue Feb 13, 2024
Fixes: googleapis/google-cloud-java#10220

Currently, the executorService will wait for any previously submitted
task to finish execution before being able to terminate. This PR
attempts to fix the issue by cancelling the outstanding timeout task. If
a response has been received prior to the timeout, the timeout task will
be cancelled and the client should be able to terminate immediately
afterwards.

This fixes an issue for Clients that have RPCs with a large timeout
value (i.e. 10 min). The client would wait 10 minutes before being able
to terminate completely.

## Local Testing
Running this with SNAPSHOT Gax version (2.41.1-SNAPSHOT) and latest
Compute from Libraries-Bom v26.31.0
Logs:
```
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115)
	... 6 more

Process finished with exit code 1
```

Running this with latest Gax version (fix not included) and latest
Compute from Libraries-Bom v26.31.0:
```
  }
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115)
	... 6 more
```

Missing the termination and exit code. It shows up after 10 minutes.
ddixit14 pushed a commit to googleapis/sdk-platform-java that referenced this issue Feb 15, 2024
Fixes: googleapis/google-cloud-java#10220

Currently, the executorService will wait for any previously submitted
task to finish execution before being able to terminate. This PR
attempts to fix the issue by cancelling the outstanding timeout task. If
a response has been received prior to the timeout, the timeout task will
be cancelled and the client should be able to terminate immediately
afterwards.

This fixes an issue for Clients that have RPCs with a large timeout
value (i.e. 10 min). The client would wait 10 minutes before being able
to terminate completely.

## Local Testing
Running this with SNAPSHOT Gax version (2.41.1-SNAPSHOT) and latest
Compute from Libraries-Bom v26.31.0
Logs:
```
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115)
	... 6 more

Process finished with exit code 1
```

Running this with latest Gax version (fix not included) and latest
Compute from Libraries-Bom v26.31.0:
```
  }
}

	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.api.gax.httpjson.HttpRequestRunnable.run(HttpRequestRunnable.java:115)
	... 6 more
```

Missing the termination and exit code. It shows up after 10 minutes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants