Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Production Issues and complete Prod Release #635

Closed
darunrs opened this issue Apr 3, 2024 · 7 comments
Closed

Investigate Production Issues and complete Prod Release #635

darunrs opened this issue Apr 3, 2024 · 7 comments
Assignees

Comments

@darunrs
Copy link
Collaborator

darunrs commented Apr 3, 2024

The prod release recently done filed in production. It highlighted three issues:

  1. Coordinator logging is incomplete. Many actions are somehow not being logged.
  2. Block Streamer ran out of memory.
  3. Runner most likely also ran out of memory, though we couldn't see any explicit logs related to that. But, the machine was inaccesible through ssh.

The Prod Release was directly responsible for the third one. It inexplicably increased memory usage by Runner by A LOT. It needs to be investigated what change is causing this problem.

The first two are unrelated to the prod release. We've beefed up the Block Streamer machine for now.

@darunrs darunrs self-assigned this Apr 3, 2024
@darunrs
Copy link
Collaborator Author

darunrs commented Apr 4, 2024

Issue presented itself in dev. Will do my testing in dev now.

@darunrs
Copy link
Collaborator Author

darunrs commented Apr 4, 2024

Tried commit 6038fe6 in dev and it ran, although I did see pretty high usage still. You can see the behavior in the image below. The left spike is when dev failed due to the error we saw in prod. The spike on the right is the above commit.

Image

@darunrs
Copy link
Collaborator Author

darunrs commented Apr 4, 2024

The issue is that, while reverting to that commit worked, the spike is still quite high. In production, which has many more indexers, the spike was only to 11GB total in prdo, even with many more Indexers. Both dev and prod ran on the same commit. Although, I did reduce prefetch to 3 from 10 for prod before the revert.

@darunrs
Copy link
Collaborator Author

darunrs commented Apr 4, 2024

Increasing commit up to the implementation of Instrumentation reduced the memory usage by 4GB. Not sure why. I am going to pause dev or a couple minutes to let the messages back up and then start it again. I want to see if its the prefetch which is the difference maker here or not.

Image

After some 10 minutes of Runner being off:

Image

@darunrs
Copy link
Collaborator Author

darunrs commented Apr 5, 2024

Prod is suddenly working after I tried the prod release commit again. I don't know why. Will look into it more on Monday but I want to leave prod up with the changes since the performance enhancements will help.

@pkudinov
Copy link
Collaborator

pkudinov commented Apr 5, 2024

We also had this ticket talking about memory leaks: #437

@pkudinov
Copy link
Collaborator

pkudinov commented Apr 9, 2024

Next step: do a prod release commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants