Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update search-engine resource allocations #137

Merged
merged 1 commit into from
Oct 25, 2023
Merged

Conversation

WadeBarnes
Copy link
Member

No description provided.

@WadeBarnes
Copy link
Member Author

I've been testing this configuration under load since 6:30am this morning. There have been a few instances of the OOM errors pop up. These typically occur during the time of high CPU usage. The CPU usage bumps into the pod's limit and causes the pod to be taken offline, in turn causing a service interruption.

image

image

image

@WadeBarnes
Copy link
Member Author

At first the high CPU usage seemed to be related to the times the BC Registries agent would process new credentials (BC Reg 0 and 30, LEAR 15 and 45), however that does not seem to be the case>

image

@WadeBarnes
Copy link
Member Author

Start parameters for Solr (to look into the other options):

/usr/local/openjdk-11/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http1002370000@search-engine-bc-59-498gs

@WadeBarnes
Copy link
Member Author

Stopped load test and will allow the current pod to run over night and review the metrics and logs in the morning.

@WadeBarnes
Copy link
Member Author

-XX:+UseLargePages might be something to look into. There are warnings related to this on pod startup; OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12)

@WadeBarnes
Copy link
Member Author

Also need to review API logs. Seems like some sync/scanning is going on at times throughout the day.

@WadeBarnes
Copy link
Member Author

Increased CPU allocation to see if it would peek just about the 4CPU mark. CPU was pinned shortly after the pod was started:

image

image

image

@WadeBarnes
Copy link
Member Author

Compare previous image to the current image. We did not have this many issues with the previous image. On the surface they look the same (same Solr version), but the config may be slightly different.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

Info for the last 14hours. Became completely unresponsive at 4:23 AM. The pod was running and passing health checks, but was not responding to any search queries.

image

image

image

image

@WadeBarnes
Copy link
Member Author

Rolled back to the previous search-engine image.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

Image comparison:

  • search-engine@sha256:e6abd08ea2079377a6afc5b2754c2d2ad6129e0672078758a868b957f7a870d4

  • search-engine@sha256:c092c6b9a5a004e173b97af9e026c62557b121434cd2bf8459fcdba40f925cff

    • The "previous" image
    • Base image (search-engine-base-89), image-registry.openshift-image-registry.svc:5000/8ad0ea-tools/search-engine-base@sha256:7069b8e7bca1770a529f14701e26781b025e13af31ad7e1e45f9046cb2700249
      • This base image does not exist anymore.
    • Aries-VCR build commit, dd2a7fb08b46773ccfe139c69c7d3e132acacfc9

There are no changes to any of the Solr config between the Aries-VCR commits, and both images contain the Log4j mitigations, indicating the base images were built off the same version of the openshift-solr source. This means the two images should behave the same.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

Upgraded the base image to Solr 8.11.2 (built off the solr-8.11.2 branch for testing); bcgov/openshift-solr#15

cmd line for Solr 8.11.2 container:

/opt/java/openjdk/bin/java-server-Xms512m-Xmx2048m-XX:+UseG1GC-XX:+PerfDisableSharedMem-XX:+ParallelRefProcEnabled-XX:MaxGCPauseMillis=250-XX:+UseLargePages-XX:+AlwaysPreTouch-XX:+ExplicitGCInvokesConcurrent-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M-Dsolr.jetty.inetaccess.includes=-Dsolr.jetty.inetaccess.excludes=-Dsolr.log.level=WARN-Dsolr.log.dir=/var/solr/logs-Djetty.port=8983-DSTOP.PORT=7983-DSTOP.KEY=solrrocks-Duser.timezone=UTC-XX:-OmitStackTraceInFastThrow-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs-Djetty.home=/opt/solr/server-Dsolr.solr.home=/var/solr/data-Dsolr.data.home=-Dsolr.install.dir=/opt/solr-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf-Dlog4j.configurationFile=/var/solr/log4j2.xml-Xss256k-jarstart.jar--module=http--module=gzip10

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

New version still encounters OOM issues, but now has a script that runs to kill the process right away when OOM errors occur. Which means the pod gets restarted whenever it encounters an OOM error. So same problem, just a slightly different behavior.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

Manually adjusted memory allocation and settings for testing:

- resources:
    limits:
      cpu: '5'
      memory: 8Gi
    requests:
      cpu: 500m
      memory: 3Gi
SOLR_JAVA_MEM=-Xms1024m -Xmx4096m
GC_TUNE=-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:InitiatingHeapOccupancyPercent=75
  • Notable changes to GC_TUNE: Removed -XX:+UseLargePages which eliminates the OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 12) warning messages. Added -XX:InitiatingHeapOccupancyPercent=75 to control the frequency of garbage collection.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 13, 2023

After updating all the settings I scanned through the logs to identify queries associated with high load conditions that were causing service interruptions and incorporated them into the loadTest scripts. The search-engine as been stable under load from 4 instances of the script running concurrently for the last hour.

image

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 14, 2023

The search-engine pod faired much better over the past 19 hours, than it did the previous few days.

image

image

The pod was restarted a total of 4 times between ~3:45pm and 5pm. This was due to activity after the load testing was completed (~1:30 - 2:30). I'll be digging into the logs for details. There is some fine tuning needed.

search-engine restart times (due to OOM issues, via OOM killer script):

  • September 13th 2023, 16:04:09.591
  • September 13th 2023, 16:20:32.341
  • September 13th 2023, 16:39:53.102
  • September 13th 2023, 17:05:03.713

Recovery time (restart time) is typically under a minute.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 14, 2023

Bytespider Traffic - Not huge
Amazon Singapore
47.128.0.0/14 (47.128.0.0 - 47.131.255.255)

image

Other traffic:
image

Other top IPs:

  • 50.98.49.96
  • 141.95.45.36
  • 172.218.186.179

@WadeBarnes
Copy link
Member Author

We need a better way to parse the IP addresses out of the logs in Kibana

@WadeBarnes
Copy link
Member Author

@i5okie, @esune mentioned you might be able to help with the Kibana log parsing. What I'd like to do is parse the X-Forwarded-For field of the log message from logs like this:

image

@WadeBarnes
Copy link
Member Author

Increased the logging level on the search-engine to see if we can capture the events leading up to the OOM issues. The issue could be related to credential indexing as opposed to search queries. Or, perhaps a combination of the two at times.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 15, 2023

search-engine pod metrics since Sep 14, 2023, 9:50 AM:

image

image

The pod has restarted 8 times due to OOM errors:

  • September 14th 2023, 15:15:55.697
  • September 14th 2023, 15:46:25.683
  • September 14th 2023, 17:34:22.525
  • September 14th 2023, 23:12:35.830
  • September 14th 2023, 23:15:28.230
  • September 14th 2023, 23:18:41.520
  • September 14th 2023, 23:26:18.286
  • September 14th 2023, 23:34:55.028

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 15, 2023

Captured a log containing an OOM triggered restart.
Solr-Error.log

  • Reviewed. Nothing obvious.

@WadeBarnes
Copy link
Member Author

Summary over the weekend:

Resource settings:

- resources:
    limits:
      cpu: '5'
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 6Gi

  - name: SOLR_JAVA_MEM
    value: '-Xms2048m -Xmx5120m'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem
      -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
      -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

image

image

Restarts (19):

  • September 15th 2023, 14:35:59.778
  • September 15th 2023, 14:39:42.382
  • September 15th 2023, 15:08:11.088
  • September 16th 2023, 17:46:15.851
  • September 16th 2023, 17:48:13.431
  • September 16th 2023, 17:52:19.325
  • September 16th 2023, 17:59:19.111
  • September 16th 2023, 18:13:22.786
  • September 16th 2023, 18:57:31.627
  • September 16th 2023, 21:41:00.354
  • September 17th 2023, 06:42:45.533
  • September 17th 2023, 06:45:57.840
  • September 17th 2023, 11:26:50.096
  • September 17th 2023, 11:28:08.882
  • September 17th 2023, 11:30:13.455
  • September 17th 2023, 11:38:28.395
  • September 17th 2023, 11:45:50.987
  • September 17th 2023, 12:17:43.509
  • September 17th 2023, 12:19:08.664

@WadeBarnes
Copy link
Member Author

Finally found some good articles on troubleshooting issues with Solr configurations. The main takeaway is testing various heap sizes. The articles also cover more information on common problems with queries and caching configurations that can cause memory issues. Solr docs go into how to collect metrics and visualize the JVM heap. Good resources if we continue to run into issues.

@WadeBarnes
Copy link
Member Author

I've refactored the load tests to, hopefully, exercise the search-engine more. I updated the search-engine configuration (increased memory and heap) and ran the load tests for a couple hours without issue. I'll let it run overnight at least to see how it handles the regular traffic loads.

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 19, 2023

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 8Gi

- name: SOLR_JAVA_MEM
  value: '-Xms4g -Xmx8g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

The period between 6am and 9am was were I was performing load testing on the updated settings. There was the one restart during that time were I was pushing things a bit too hard. Load testing was performed with rate limiting adjusted to allow higher query volumes, otherwise most of the queries were getting blocked. Rate limiting was reapplied following the load testing.

image

image

Restarts since yesterday:

  • September 18th 2023, 06:50:31.287
  • September 18th 2023, 13:00:43.415
  • September 18th 2023, 13:10:27.709
  • September 18th 2023, 14:21:41.840
  • September 18th 2023, 15:08:26.997
  • September 18th 2023, 15:38:31.013
  • September 18th 2023, 15:49:57.219
  • September 18th 2023, 16:05:05.748
  • September 18th 2023, 16:19:51.972
  • September 18th 2023, 16:38:30.411
  • September 18th 2023, 20:35:29.097

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 20, 2023

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

image

image

image

image

Restarts since yesterday:

  • September 19th 2023, 12:56:24.264
  • September 19th 2023, 13:05:06.261
  • September 19th 2023, 13:31:00.405
  • September 19th 2023, 15:46:08.488

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 21, 2023

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

image

Restarts since yesterday:

  • September 20th 2023, 11:49:08.473
  • September 20th 2023, 12:08:22.863
  • September 20th 2023, 12:14:43.854
  • September 20th 2023, 12:38:25.032
  • September 20th 2023, 13:06:03.496
  • September 20th 2023, 15:07:28.761

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Sep 22, 2023

Resource Settings:

- resources:
    limits:
        cpu: '5'
        memory: 20Gi
    requests:
        cpu: 500m
        memory: 10Gi

- name: SOLR_JAVA_MEM
    value: '-Xms8g -Xmx10g'
- name: GC_TUNE
    value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
    -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
    -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

Load testing between 8-10am using some queries added to the load test scripts. I was able to force the search-engine to restart once in that period, near the beginning of the run.

Note:

  • Typical multiple restart pattern did not occur in the 12:00 time block.
  • 3 restarts occurred in the 15:00 time block, when search queries were lower than normal.

image

image

image

image

Restarts since yesterday:

  • September 21st 2023, 08:38:59.721
  • September 21st 2023, 09:07:06.038
  • September 21st 2023, 16:48:09.813
  • September 21st 2023, 16:58:45.291
  • September 21st 2023, 17:06:56.915

@WadeBarnes
Copy link
Member Author

Resource Settings:

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

image

image

image

Restarts since last report:

  • October 1st 2023, 14:04:24.287
  • October 1st 2023, 19:44:39.397
  • October 1st 2023, 22:19:47.885
  • October 2nd 2023, 04:22:02.830
  • October 2nd 2023, 06:27:20.274
  • October 2nd 2023, 07:12:07.662
  • October 2nd 2023, 07:16:47.664
  • October 2nd 2023, 14:05:32.630
  • October 4th 2023, 18:47:10.224
  • October 4th 2023, 18:54:13.308
  • October 4th 2023, 19:06:05.535
  • October 4th 2023, 19:14:47.824
  • October 4th 2023, 19:28:33.086
  • October 4th 2023, 19:51:07.088
  • October 4th 2023, 19:56:51.623
  • October 4th 2023, 20:11:00.963
  • October 4th 2023, 20:22:16.056
  • October 4th 2023, 20:28:56.536
  • October 4th 2023, 20:46:18.216
  • October 4th 2023, 20:55:36.567
  • October 4th 2023, 21:01:36.922
  • October 4th 2023, 21:09:06.920
  • October 4th 2023, 21:32:11.678
  • October 4th 2023, 21:50:30.785
  • October 4th 2023, 21:59:23.275
  • October 4th 2023, 22:10:16.848
  • October 4th 2023, 22:33:34.781
  • October 4th 2023, 22:50:00.354
  • October 4th 2023, 22:55:43.136
  • October 4th 2023, 23:06:57.429
  • October 4th 2023, 23:14:23.512
  • October 4th 2023, 23:21:32.571
  • October 4th 2023, 23:28:12.465
  • October 5th 2023, 00:01:34.585
  • October 5th 2023, 00:15:20.176
  • October 5th 2023, 00:25:50.091
  • October 5th 2023, 00:31:24.233
  • October 5th 2023, 01:02:56.832
  • October 5th 2023, 01:11:11.181
  • October 5th 2023, 01:20:57.755
  • October 5th 2023, 01:29:48.860
  • October 5th 2023, 01:54:59.748
  • October 5th 2023, 02:06:10.983
  • October 5th 2023, 02:15:55.757
  • October 5th 2023, 02:24:57.085
  • October 5th 2023, 02:46:06.866
  • October 5th 2023, 02:56:00.629
  • October 5th 2023, 03:11:29.147
  • October 5th 2023, 03:19:52.533
  • October 5th 2023, 03:30:10.058
  • October 5th 2023, 03:47:53.257
  • October 5th 2023, 03:55:20.794
  • October 5th 2023, 04:22:42.235
  • October 5th 2023, 04:29:18.952
  • October 5th 2023, 04:48:15.072
  • October 5th 2023, 04:56:06.118
  • October 5th 2023, 05:05:37.694
  • October 5th 2023, 05:13:36.939
  • October 5th 2023, 05:26:46.935
  • October 5th 2023, 05:52:43.276
  • October 5th 2023, 06:00:06.666
  • October 5th 2023, 06:10:39.611
  • October 5th 2023, 06:23:52.624
  • October 5th 2023, 06:30:18.834
  • October 5th 2023, 06:48:18.164
  • October 5th 2023, 06:58:28.304
  • October 5th 2023, 07:10:35.642
  • October 5th 2023, 07:17:50.913
  • October 5th 2023, 07:27:28.950

@WadeBarnes
Copy link
Member Author

Clearly there is no correlation between query volume and the OOM restarts.

@WadeBarnes
Copy link
Member Author

All environments have been updated with the latest configurations.

@WadeBarnes
Copy link
Member Author

Disabled the Solr cache on the search-engine; bcgov/aries-vcr#758

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Oct 16, 2023

Resource Settings:

2023-10-06 to 2023-10-08

- resources:
    limits:
      cpu: '5'
      memory: 20Gi
    requests:
      cpu: 500m
      memory: 10Gi

- name: SOLR_JAVA_MEM
  value: '-Xms8g -Xmx10g'
- name: GC_TUNE
  value: >-
    -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250
    -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent
    -XX:InitiatingHeapOccupancyPercent=75

2023-10-08 onward

- resources:
    limits:
      cpu: '6'
      memory: 30Gi
    requests:
      cpu: '6'
      memory: 20Gi

 - name: SOLR_JAVA_MEM
    value: '-Xms16g -Xmx24g'
  - name: GC_TUNE
    value: >-
      -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
      -XX:MaxGCPauseMillis=250 -XX:+AlwaysPreTouch
      -XX:+ExplicitGCInvokesConcurrent
      -XX:InitiatingHeapOccupancyPercent=75

image

image

image

Search-engine container throttling:
image
image
image
image

Restarts since last report:

  • October 10th 2023, 03:31:07.930
  • October 12th 2023, 03:04:11.843
  • October 12th 2023, 03:56:30.013
  • October 12th 2023, 05:01:22.260
  • October 12th 2023, 05:55:11.003
  • October 12th 2023, 06:28:28.116
  • October 13th 2023, 12:13:39.361
  • October 13th 2023, 12:25:56.895
  • October 13th 2023, 12:40:45.049
  • October 13th 2023, 12:49:51.075
  • October 13th 2023, 13:24:17.387
  • October 13th 2023, 13:34:41.587
  • October 13th 2023, 13:44:11.618
  • October 13th 2023, 13:58:55.254
  • October 13th 2023, 14:27:28.602
  • October 14th 2023, 07:13:07.981
  • October 14th 2023, 07:39:07.003
  • October 14th 2023, 08:25:35.409
  • October 14th 2023, 09:26:22.372
  • October 14th 2023, 09:33:54.701
  • October 14th 2023, 09:39:10.708
  • October 14th 2023, 09:47:08.524
  • October 14th 2023, 10:20:28.245
  • October 14th 2023, 10:25:58.895
  • October 14th 2023, 11:09:55.271
  • October 14th 2023, 11:41:24.026
  • October 14th 2023, 12:28:00.140
  • October 14th 2023, 12:44:51.028
  • October 14th 2023, 13:20:45.687
  • October 14th 2023, 14:25:40.141
  • October 14th 2023, 16:43:50.757
  • October 14th 2023, 19:30:12.245
  • October 14th 2023, 21:59:35.091
  • October 14th 2023, 22:05:41.511

Note resources were significantly increased in response to a heavy query load on 2023-10-08. Resources on other services were further adjusted on 2023-10-13 in response to the significant spike in query volume. The number of frontend and API instances were decreased and the query rate limit was also decreased.

One thing to note is that the search-engine experiences a very high percentage of throttling during the periods when it's restarting the most, even when the container has been given a significant allocation of CPU. CPU usage also spikes during these periods to the point where the container goes beyond it's allocation for long enough to be taken offline by k8s.

@WadeBarnes
Copy link
Member Author

Reminder to monitor the active threads (top -H) during heavy load. I suspect the GC threads will be the most active.

top - 16:00:14 up 130 days, 22:34,  0 users,  load average: 9.78, 8.58, 9.21
Threads:  56 total,   3 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s):  24.9/9.4    34[                                                                                                    ]
MiB Mem : 38.0/257605.4 [                                                                                                    ]
MiB Swap:  0.0/0.0      [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                 
      1 1002370+  20   0    2620    600    528 S   0.0   0.0   0:00.08 sh                                                                                                                                                      
      7 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02  `- java                                                                                                                                                
     58 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:02.06      `- java                                                                                                                                            
     59 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.99      `- GC Thread#0                                                                                                                                     
     60 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:01.46      `- G1 Main Marker                                                                                                                                  
     61 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.16      `- G1 Conc#0                                                                                                                                       
     62 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.85      `- GC Thread#1                                                                                                                                     
     63 1002370+  20   0   35.5g  26.1g   1.2g S  11.3  10.4  37:13.04      `- GC Thread#2                                                                                                                                     
     64 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#3                                                                                                                                     
     65 1002370+  20   0   35.5g  26.1g   1.2g S  11.6  10.4  37:13.75      `- GC Thread#4                                                                                                                                     
     66 1002370+  20   0   35.5g  26.1g   1.2g S  12.0  10.4  37:13.48      `- GC Thread#5                                                                                                                                     
     67 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4  13:42.93      `- G1 Refine#0                                                                                                                                     
     68 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:34.28      `- G1 Young RemSet                                                                                                                                 
     69 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:19.12      `- VM Thread                                                                                                                                       
     70 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.02      `- Reference Handl                                                                                                                                 
     71 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Finalizer                                                                                                                                       
     72 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Signal Dispatch                                                                                                                                 
     73 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- Service Thread                                                                                                                                  
     74 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   1:15.37      `- C2 CompilerThre                                                                                                                                 
     75 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:07.41      `- C1 CompilerThre                                                                                                                                 
     76 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.22      `- Sweeper thread                                                                                                                                  
     77 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:45.94      `- VM Periodic Tas                                                                                                                                 
     78 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.08      `- Common-Cleaner                                                                                                                                  
     81 1002370+  20   0   35.5g  26.1g   1.2g S   0.3  10.4   3:20.85      `- Log4j2-TF-3-Asy                                                                                                                                 
     82 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- ShutdownMonitor                                                                                                                                 
     86 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:03.22      `- qtp1548081008-1                                                                                                                                 
     93 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.04      `- Session-HouseKe                                                                                                                                 
     94 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4  12:20.68      `- G1 Conc#1                                                                                                                                       
     95 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.00      `- h2sc-6-thread-1                                                                                                                                 
     96 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.14      `- Connection evic                                                                                                                                 
     97 1002370+  20   0   35.5g  26.1g   1.2g S   0.0  10.4   0:00.13      `- Connection evic                                                                                                                                 

@WadeBarnes
Copy link
Member Author

search-engine right before being OOM killed:

top - 15:18:30 up 19:56,  0 users,  load average: 16.74, 30.24, 26.22
Threads: 257 total,   7 running, 250 sleeping,   0 stopped,   0 zombie
%Cpu(s): 26.8 us,  4.1 sy,  0.0 ni, 68.5 id,  0.0 wa,  0.3 hi,  0.4 si,  0.0 st
MiB Mem : 257605.4 total, 121852.3 free,  64512.1 used,  71241.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 190113.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 
     63 1002370+  20   0   46.8g  26.6g 301972 R  99.3  10.6   2:32.57 GC Thread#2                                                                                             
     59 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.81 GC Thread#0                                                                                             
     65 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.30 GC Thread#4                                                                                             
     66 1002370+  20   0   46.8g  26.6g 301972 R  99.0  10.6   2:32.71 GC Thread#5                                                                                             
     62 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.95 GC Thread#1                                                                                             
     64 1002370+  20   0   46.8g  26.6g 301972 R  98.7  10.6   2:32.60 GC Thread#3                                                                                             
     69 1002370+  20   0   46.8g  26.6g 301972 S   0.7  10.6   0:01.38 VM Thread                                                                                               
    407 1002370+  20   0    9356   3972   3188 R   0.3   0.0   0:00.66 top                                                                                                     
      1 1002370+  20   0    2620    604    528 S   0.0   0.0   0:00.06 sh                                                                                                      
      7 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.01 java                                                                                                    
     58 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:01.95 java                                                                                                    
     60 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 G1 Main Marker                                                                                          
     61 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:18.66 G1 Conc#0                                                                                               
     67 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:34.62 G1 Refine#0                                                                                             
     68 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.16 G1 Young RemSet                                                                                         
     70 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Reference Handl                                                                                         
     71 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Finalizer                                                                                               
     72 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Signal Dispatch                                                                                         
     73 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:00.00 Service Thread                                                                                          
     74 1002370+  20   0   46.8g  26.6g 301972 S   0.0  10.6   0:16.11 C2 CompilerThre 

@WadeBarnes
Copy link
Member Author

WadeBarnes commented Oct 20, 2023

The implementation of the blocklist, #138, as allowed us to limit the unbounded queries (bcgov/aries-vcr#762) causing the pressure on the search-engine resources. Therefore we're going to try out a resource reduction.

@WadeBarnes
Copy link
Member Author

Since the introduction of the blocklist both the api and search-engine pods have seen far less activity and have been maintaining more sane resource levels. There have been no pod restarts, though the memory usage on the api pods is growing.

For example:
image

@WadeBarnes
Copy link
Member Author

Resource usage has been steady and under control for both the serach-engine and api pods. Memory use for the api pods has been stable. No pod restarts.

@WadeBarnes WadeBarnes force-pushed the main branch 2 times, most recently from 2e7d002 to 37cf707 Compare October 25, 2023 14:04
Search-engine:
- Upgrade to Solr 8.11.2.
- Update load tests with real world heavy load queries.
- Refactor load tests to exercise the search-engine more.
  - Ensure each search query requests a random page within a particular page range.  This helps to minimize the number of cache hits, and therefore exercise the heap and garbage collection settings of the search-engine.
- Solr uses SOLR_JAVA_MEM to set the JAVA memory options, which overrides JAVA_OPTS.
- Adjust CPU and memory allocation to minimize container throttling and thrashing under load.
- Adjust health checks to reduce loading.

API:
- Adjust API resources to minimize container throttling and OOM issues during heavier load.

Signed-off-by: Wade Barnes <wade@neoterictech.ca>
@WadeBarnes
Copy link
Member Author

Commit history has been cleaned up, and the configurations have been deployed and tested in dev, test, and prod.

@WadeBarnes WadeBarnes marked this pull request as ready for review October 25, 2023 14:22
@WadeBarnes
Copy link
Member Author

WadeBarnes commented Oct 25, 2023

Search-engine resource use:
image
image
image

@esune esune merged commit 75fb1e0 into bcgov:main Oct 25, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants