Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobIDStore IO issue at the stroke of midnight #158

Closed
FreddieMatherSmartDCSIT opened this issue Nov 27, 2023 · 2 comments · Fixed by #166
Closed

JobIDStore IO issue at the stroke of midnight #158

FreddieMatherSmartDCSIT opened this issue Nov 27, 2023 · 2 comments · Fixed by #166
Assignees

Comments

@FreddieMatherSmartDCSIT
Copy link
Collaborator

After running a 24 hour performance test with 8 AEOSVDC containers there was an IO error observed with the JobIDStore in 2 of the 8 AEOSVDC containers. These 2 containers failed and never restarted

Evidence

The 24 hour run had the following container numbers and events sent/s

  • 500 events per second sent for 24 hours
  • 1 AEReception container
  • 8 AEOSVDC containers
  • PV version 1.1.1
  • 32 GB memory box

The cumulative sent vs processed events show when the two containers fail at about 17 hours and 30 minutes into the test
image

CPU

The CPU shows the containers that failed as below

image
CPU for all containers

image
image
CPU of failed containers

Docker compose logs

The logs from the failed containers report the following error which coincidentally occurs exactly at midnight (may or may not be relevant)

deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : Checking received event data against definition
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AESequenceDC::HappyJob.JobInProgress : Follow on event (5f9113b9-790c-4bf1-bc96-cda9b46d1cd4) for existing Job received with jobId = 142d56a4-44e0-48dd-a3de-9727a6c422f4 with Job Name = XOR_constraint_16, current event type = Q with 1 previous event ids
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : Checking received event data against definition
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 183c001b-b69d-4598-a18e-ad64970af3b4, auditEventType = Q, auditEventId = 003157cb-6b6b-4e76-a899-6382d5a0b141
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = f3cc582e-db15-4fd7-b7b4-592b65d39205, auditEventType = Q, auditEventId = 6b14f620-8c77-4f42-874e-ac4f9e424e1e
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 8b39a527-15bc-4517-9578-bd8b317e0d11, auditEventType = Q, auditEventId = 8edb6fd0-02e5-49a8-946d-85c110637bc9
deploy-aeo_svdc_proc_8-1  | IO Error :No such file or directory
deploy-aeo_svdc_proc_8-1  |   Stack:
deploy-aeo_svdc_proc_8-1  |   #1	AEOrdering::JobStore.JobStoreUpdated:15
deploy-aeo_svdc_proc_8-1  |
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 5633b5e4-124a-46c1-adde-9af6d4de82b7, auditEventType = Q, auditEventId = e28dfdea-f54c-4e26-86d9-f7256789d64a
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 1b74e806-a643-4896-8dd3-a54c1561a313, auditEventType = Q, auditEventId = e724d78d-ef86-444a-99f6-2d9a88b60328
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 142d56a4-44e0-48dd-a3de-9727a6c422f4, auditEventType = Q, auditEventId = 5f9113b9-790c-4bf1-bc96-cda9b46d1cd4
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 183c001b-b69d-4598-a18e-ad64970af3b4, auditEventType = Q, auditEventId = 003157cb-6b6b-4e76-a899-6382d5a0b141
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = f3cc582e-db15-4fd7-b7b4-592b65d39205, auditEventType = Q, auditEventId = 6b14f620-8c77-4f42-874e-ac4f9e424e1e
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 8b39a527-15bc-4517-9578-bd8b317e0d11, auditEventType = Q, auditEventId = 8edb6fd0-02e5-49a8-946d-85c110637bc9
deploy-aeo_svdc_proc_2-1  | IO Error :No such file or directory
deploy-aeo_svdc_proc_2-1  |   Stack:
deploy-aeo_svdc_proc_2-1  |   #1	AEOrdering::JobStore.JobStoreUpdated:15
deploy-aeo_svdc_proc_2-1  |
@cortlandstarrett
Copy link
Member

I wonder if a cron job "cleaned" up our JobID store file?

@cortlandstarrett
Copy link
Member

We have reproduced this issue. It is indeed a stroke of midnight issue (when we are archiving our job id files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants