Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fhir-bucket process should exit after reindex failure #2752

Closed
lmsurpre opened this issue Sep 9, 2021 · 3 comments
Closed

fhir-bucket process should exit after reindex failure #2752

lmsurpre opened this issue Sep 9, 2021 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@lmsurpre
Copy link
Member

lmsurpre commented Sep 9, 2021

Is your feature request related to a problem? Please describe.
I have a kubernetes job for executing the fhir-bucket client-driven reindex process.
Under certain conditions, I see that failing as follows:

2021-09-09 14:01:06.145 00000001    INFO x.ClientDrivenReindexOperation Starting monitor thread
2021-09-09 14:01:07.476 00000018    INFO x.ClientDrivenReindexOperation called $retrieve-index (afterIndexId=2739214): 200 OK [took 0.401 s]
2021-09-09 14:01:07.479 00000018    INFO x.ClientDrivenReindexOperation Index IDs available for processing - filling worker pool
2021-09-09 14:01:11.124 0000001a    INFO x.ClientDrivenReindexOperation called $reindex (indexIds=2739215,...): 200 OK [took 3.642 s]
...
2021-09-09 14:01:24.818 0000001e    INFO x.ClientDrivenReindexOperation called $reindex (indexIds=2741015,...): 200 OK [took 6.425 s]
2021-09-09 14:02:07.521 00000018    INFO .http.impl.execchain.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {s}->https://fhir-internal:9443: The target server failed to respond
2021-09-09 14:02:07.522 00000018    INFO .http.impl.execchain.RetryExec Retrying request to {s}->https://fhir-internal:9443
2021-09-09 14:02:07.524 00000018  SEVERE bucket.client.FHIRBucketClient Error while executing the POST request. org.apache.http.NoHttpResponseException: fhir-internal:9443 failed to respond
2021-09-09 14:02:07.524 00000018 WARNING bucket.client.FHIRBucketClient POST URL: https://fhir-internal:9443/fhir-server/api/v4/$retrieve-index
Request Body: {"resourceType":"Parameters","parameter":[{"name":"_count","valueInteger":1000},{"name":"notModifiedAfter","valueString":"2021-06-13T12:00:00Z"},{"name":"afterIndexId","valueString":"2741213"}]}
2021-09-09 14:02:07.525 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:12.527 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:17.528 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:22.529 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:27.530 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:32.531 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:37.533 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:42.534 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
2021-09-09 14:02:47.536 00000018    INFO x.ClientDrivenReindexOperation Waiting for 60 threads to complete before exiting
...
2021-09-09 14:01:24.818 0000001e    INFO x.ClientDrivenReindexOperation called $reindex (indexIds=2741015,...): 200 OK [took 6.425 s]
2021-09-09 14:07:07.661 00000018  SEVERE x.ClientDrivenReindexOperation Reindexing was not fully completed, restart reindex from index ID '2741213' to resume
Exception in thread "Thread-4" com.ibm.fhir.database.utils.api.DataAccessException: FHIR server connection failed
	at com.ibm.fhir.bucket.client.FHIRBucketClient.post(FHIRBucketClient.java:315)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.callRetrieveIndex(ClientDrivenReindexOperation.java:315)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.monitorLoop(ClientDrivenReindexOperation.java:204)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.lambda$init$0(ClientDrivenReindexOperation.java:123)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation$$Lambda$50/0x0000000000449100.run(Unknown Source)
	at java.base/java.lang.Thread.run(Thread.java:836)

However, the job is still has a status of running which means that the process has never exited.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A. System.exit after some time of waiting
B. Keep as-is and deal with outside of fhir-bucket (e.g. by add some kind of timeout to the job)

Acceptance Criteria

  1. GIVEN [a precondition]
    AND [another precondition]
    WHEN [test step]
    AND [test step]
    THEN [verification step]
    AND [verification step]

Additional context
Add any other context or screenshots about the feature request here.

@prb112 prb112 added the bug Something isn't working label Sep 9, 2021
@prb112 prb112 self-assigned this Sep 17, 2021
@prb112
Copy link
Contributor

prb112 commented Sep 17, 2021

Same issue as #2225 (a different code path)

prb112 added a commit that referenced this issue Sep 17, 2021
Signed-off-by: Paul Bastide <pbastide@us.ibm.com>
prb112 added a commit that referenced this issue Sep 20, 2021
fhir-bucket process should exit after reindex failure #2752
@lmsurpre
Copy link
Member Author

To verify, I ran a large reindex job with 60 threads.

After the first few requests, it hit a case where the server does not respond (due to an environmental issue I think).

I think this used to cause the reindex process to exit, but now it was able to recover and continue running, as seen in the last couple messages:

2021-09-20 19:47:30.149 00000012    INFO x.ClientDrivenReindexOperation called $reindex (indexIds=1901,...): 200 OK [took 1.370 s]
2021-09-20 19:48:22.116 0000000c    INFO .http.impl.execchain.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {s}->https://fhir-internal:9443: The target server failed to respond
2021-09-20 19:48:22.117 0000000c    INFO .http.impl.execchain.RetryExec Retrying request to {s}->https://fhir-internal:9443
2021-09-20 19:48:22.120 0000000c  SEVERE bucket.client.FHIRBucketClient Error while executing the POST request. org.apache.http.NoHttpResponseException: fhir-internal:9443 failed to respond
2021-09-20 19:48:22.121 0000000c WARNING bucket.client.FHIRBucketClient POST URL: https://fhir-internal:9443/fhir-server/api/v4/$retrieve-index
Request Body: {"resourceType":"Parameters","parameter":[{"name":"_count","valueInteger":1000},{"name":"notModifiedAfter","valueString":"2021-06-13T12:00:00Z"},{"name":"afterIndexId","valueString":"1999"}]}
2021-09-20 19:48:22.121 0000000c  SEVERE x.ClientDrivenReindexOperation Throwable caught. FHIR client thread will exit

FHIR server connection failed
com.ibm.fhir.database.utils.api.DataAccessException: FHIR server connection failed
	at com.ibm.fhir.bucket.client.FHIRBucketClient.post(FHIRBucketClient.java:315)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.callRetrieveIndex(ClientDrivenReindexOperation.java:321)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.monitorLoop(ClientDrivenReindexOperation.java:204)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.lambda$init$0(ClientDrivenReindexOperation.java:123)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-20 19:48:22.125 0000000c    INFO .http.impl.execchain.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {s}->https://fhir-internal:9443: The target server failed to respond
2021-09-20 19:48:22.126 0000000c    INFO .http.impl.execchain.RetryExec Retrying request to {s}->https://fhir-internal:9443
2021-09-20 19:48:22.128 0000000c  SEVERE bucket.client.FHIRBucketClient Error while executing the POST request. org.apache.http.NoHttpResponseException: fhir-internal:9443 failed to respond
2021-09-20 19:48:22.129 0000000c WARNING bucket.client.FHIRBucketClient POST URL: https://fhir-internal:9443/fhir-server/api/v4/$retrieve-index
Request Body: {"resourceType":"Parameters","parameter":[{"name":"_count","valueInteger":1000},{"name":"notModifiedAfter","valueString":"2021-06-13T12:00:00Z"},{"name":"afterIndexId","valueString":"1999"}]}
2021-09-20 19:48:22.130 0000000c  SEVERE x.ClientDrivenReindexOperation Throwable caught. FHIR client thread will exit

FHIR server connection failed
com.ibm.fhir.database.utils.api.DataAccessException: FHIR server connection failed
	at com.ibm.fhir.bucket.client.FHIRBucketClient.post(FHIRBucketClient.java:315)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.callRetrieveIndex(ClientDrivenReindexOperation.java:321)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.monitorLoop(ClientDrivenReindexOperation.java:204)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.lambda$init$0(ClientDrivenReindexOperation.java:123)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-20 19:48:22.133 0000000c    INFO .http.impl.execchain.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {s}->https://fhir-internal:9443: The target server failed to respond
2021-09-20 19:48:22.134 0000000c    INFO .http.impl.execchain.RetryExec Retrying request to {s}->https://fhir-internal:9443
2021-09-20 19:48:22.139 0000000c  SEVERE bucket.client.FHIRBucketClient Error while executing the POST request. org.apache.http.NoHttpResponseException: fhir-internal:9443 failed to respond
2021-09-20 19:48:22.140 0000000c WARNING bucket.client.FHIRBucketClient POST URL: https://fhir-internal:9443/fhir-server/api/v4/$retrieve-index
Request Body: {"resourceType":"Parameters","parameter":[{"name":"_count","valueInteger":1000},{"name":"notModifiedAfter","valueString":"2021-06-13T12:00:00Z"},{"name":"afterIndexId","valueString":"1999"}]}
2021-09-20 19:48:22.140 0000000c  SEVERE x.ClientDrivenReindexOperation Throwable caught. FHIR client thread will exit

FHIR server connection failed
com.ibm.fhir.database.utils.api.DataAccessException: FHIR server connection failed
	at com.ibm.fhir.bucket.client.FHIRBucketClient.post(FHIRBucketClient.java:315)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.callRetrieveIndex(ClientDrivenReindexOperation.java:321)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.monitorLoop(ClientDrivenReindexOperation.java:204)
	at com.ibm.fhir.bucket.reindex.ClientDrivenReindexOperation.lambda$init$0(ClientDrivenReindexOperation.java:123)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-20 19:48:22.143 0000000c    INFO .http.impl.execchain.RetryExec I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {s}->https://fhir-internal:9443: The target server failed to respond
2021-09-20 19:48:22.143 0000000c    INFO .http.impl.execchain.RetryExec Retrying request to {s}->https://fhir-internal:9443
2021-09-20 19:48:22.229 0000000c    INFO x.ClientDrivenReindexOperation called $retrieve-index (afterIndexId=1999): 200 OK [took 0.087 s]
2021-09-20 19:48:23.074 00000022    INFO x.ClientDrivenReindexOperation called $reindex (indexIds=2059,...): 200 OK [took 0.842 s]

If the reindex job completes, I think we can close this one as resolved.

@lmsurpre
Copy link
Member Author

2021-09-20 22:59:34.575 0000000c    INFO x.ClientDrivenReindexOperation Reindexing was completed
2021-09-20 23:00:34.545 0000000b    INFO   com.ibm.fhir.bucket.app.Main Stopping all services

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants