Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AEOSVDC memory usage and container failures #157

Closed
FreddieMatherSmartDCSIT opened this issue Nov 27, 2023 · 5 comments
Closed

AEOSVDC memory usage and container failures #157

FreddieMatherSmartDCSIT opened this issue Nov 27, 2023 · 5 comments

Comments

@FreddieMatherSmartDCSIT
Copy link
Collaborator

On longer endurance test runs for the Protocol Verifier, the PV maxes out CPU and gets behind on processing events. When this happens the PV starts to consume memory holding onto the backed up events and this increases as more events are added. Once all the memory of the host box is consumed AEOSVDC containers start to fail periodically one by one (every time the memory of the box is full) and never restart.

Evidence

The 24 hour run had the following container numbers and events sent/s

  • 500 events per second sent for 24 hours
  • 1 AEReception container
  • 4 AEOSVDC containers
  • PV version 1.1.1
  • 32 GB memory box

The cumulative events processed diverges from the the cumulative events sent at approximately the 8 hour mark into the test.

image

The response times start to non-linearly increase at this point hitting a peak after which no more events are processed.

image

After approximately 8 hours the memory of all AEOSVDC containers starts to increase (claiming memory from Kafka) until the max memory of the box is hit. The AEOSVDC containers start to fail progressively one by one as they consume all the memory of the box. Eventually one AEOSVDC container out of the 4 is left and its memory usage continues to climb towards the max memory of the box.

image
Container memory usage

image
image
image
image
CPU usage showing containers failing

@cortlandstarrett
Copy link
Member

This is likely due to the memory leak identified on 26 October.

@jt765487
Copy link
Collaborator

@cortlandstarrett do you know when a version with the fix can be provided to @FreddieMatherSmartDCSIT?

@cortlandstarrett
Copy link
Member

@jt765487 , we could provide it today if desired. I can give @FreddieMatherSmartDCSIT the option of running now or waiting a day or two until we have run our own 24 hour test.

@FreddieMatherSmartDCSIT
Copy link
Collaborator Author

@cortlandstarrett if you can provide us with a new version for defect retests that would be great. We would need a bit of time to build the PV and get prepped for the tests so that day or two would be useful. Its unlikely we would start the deployment process for the PV until tomorrow morning due to being near the end of the day here today but we would likely be ready by tomorrow afternoon-evening to start retests all well and good.

(@jt765487 for you info)

@cortlandstarrett
Copy link
Member

fixed in v1.1.3 (StoredJobId growth)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants