-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests time out when fetching pages of a search request in parallel ("HAPI-1163: Request timed out") #6109
Comments
@tadgh if you know how to route this ticket please let us know. Many thanks! |
Huh interesting one. I'll read over the other thread and get back to you. Thanks Jing! |
Thanks @tadgh for looking into this; I was wondering if there is any update. Please let me know if you need more inputs for reproducing this problem. |
Sorry, we have been slammed with the release last couple weeks. Once that settles down I can get some focus on this |
Hey @bashir2 and @jingtang10 is there any chance this can be replicated in a HAPI-FHIR test? IF you can submit a PR with a failing test that would go a long way for us in terms of our ability to address this. |
@tadgh can you please point me to an example/similar PR that you want me to create for reproducing this issue? The reproducing steps require a large number of resources to be uploaded to the HAPI server (as mentioned in my OP of this issue). Can I reproduce such an environment in the HAPI-FHIR test that you are suggesting? |
Most certainly. Have a look at StressTesR4Test which generates ~1k resources. Alternatively, and maybe better for your use case, is FhirResourceDaoR4SearchOptimizedTest which seems to be doing roughly what you are. You'll note that that second test creates 200 patients, and queries them in parallel via a thread pool. However, that test interacts directly against the DAOs so that may hide the source of failure if its upstream in the client. Wouldn't be much of a lift to use the igenericclient in a threadpool though. Let me know if you need further guidance. |
Thanks @tadgh for the pointer; it took me a while to get to this and then to set up my environment but I think I have demonstrated the issue in this commit. Looking a little bit into the details, I think I understand the root cause of this now as well. TL;DR; Details: I have demonstrated this in the WARN messages I have added to the end of the test. Here are some relevant log messages for 3 consecutive page fetches; note the WARN messages for
So I am not sure if there is actually anything to be fixed here, is there? I was looking at the |
Yes adding that config to the jpaserver-starter helps; I am also trying to find other ways we can account for this (maybe from our pipeline's end). What does the pre-fetching exactly mean? I remember that when doing a search, HAPI was storing the list of IDs for that particular search in DB. Does pre-fetching means creating that full-list? Is it possible to make creation of that list more gradual? I mean something similar to the idea of If this is not easy to do, I think from our side, we should initially fetch a page that is beyond the last index (e.g., |
I'd rather defer to @michaelabuckley 's thoughts on this, but I can certainly help in exposing the setting. |
Could you not just use |
The problem is that we do not control the source FHIR-server. Our pipeline is supposed to read from any FHIR-server. We have been facing this issue when our pipeline uses the FHIR Search API and the FHIR-server is HAPI. We can recommend that setting to our partners once your PR is released, but we should also be clear about other performance implications of that. BTW, what are the memory implications of pre-fetching everything, i.e., just use |
On a large enough dataset? Not good! I didn't write this code and am not intimately familiar with it but my read on it is that it would prefetch all possible results, which could exceed memory limitations of the DB or the application server. For your use case, it may be better to fetch a known static amount per fetch, then have your pipeline adjust so that it runs like this:
Just spitballing here on the options, no clue if this would be suitable for your particular use case. |
Another option would be to somehow make this prefetch configurable on a per-original-request basis, but that obviously opens up the server to DOS vulnerabilities if it's used haphazardly. |
NOTE: Before filing a ticket, please see the following URL:
https://github.com/hapifhir/hapi-fhir/wiki/Getting-Help
Describe the bug
TL;DR: With recent versions of HAPI JPA server (not sure exactly since when) we cannot fetch pages of a search result in parallel, if the number of resources to be fetched is too large.
Details:
In fhir-data-pipes we have different methods for fetching resources from a FHIR server in parallel. One of these is through FHIR Search API; one search is done and then different pages are downloaded in parallel, i.e., many queries like this:
This used to work fine with HAPI FHIR JPA servers too, at least until two years ago as described in this issue. But we have recently discovered that if the number of resources to be fetched is large, then accessing the pages in parallel will fail with
HAPI-1163: Request timed out after 60492ms
(the time is around 60 seconds and I think comes from this line). Doing some investigations, it seems after ~2K resources are fetched, one page request, all of sudden, takes tens of seconds (other requests are instant). I have investigated this on our side and have reproduced this problem with a simple script (outside our pipeline code), as described in the same issue. If I run the pipeline with a single worker, it eventually succeeds. But having 3 or more workers, usually fails (in my experiments/setup, it failed when number of resources was more than ~5 million). I am guessing that very slow page request, blocks other parallel requests (but I am not sure).Another input: Although I have set the number of DB connections of HAPI JPA server to 40, there is usually only one or maybe two
postgres
processes doing some work. When we did extensive performance experiments two years ago, we could easily flood 40+ CPU cores withpostgres
processes.To Reproduce
Steps to reproduce the behavior:
Expected behavior
The parallel fetch can flood all resources on the machine and succeed.
Screenshots
See this issue for when we profiled HAPI JPA server two years with similar setup.
Environment (please complete the following information):
Additional context
Here is a sample stack-trace from the HAPI JPA docker image logs:
The text was updated successfully, but these errors were encountered: