JobIDStore IO issue at the stroke of midnight #158

FreddieMatherSmartDCSIT · 2023-11-27T11:57:48Z

After running a 24 hour performance test with 8 AEOSVDC containers there was an IO error observed with the JobIDStore in 2 of the 8 AEOSVDC containers. These 2 containers failed and never restarted

Evidence

The 24 hour run had the following container numbers and events sent/s

500 events per second sent for 24 hours
1 AEReception container
8 AEOSVDC containers
PV version 1.1.1
32 GB memory box

The cumulative sent vs processed events show when the two containers fail at about 17 hours and 30 minutes into the test

CPU

The CPU shows the containers that failed as below

CPU for all containers

CPU of failed containers

Docker compose logs

The logs from the failed containers report the following error which coincidentally occurs exactly at midnight (may or may not be relevant)

deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : Checking received event data against definition
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AESequenceDC::HappyJob.JobInProgress : Follow on event (5f9113b9-790c-4bf1-bc96-cda9b46d1cd4) for existing Job received with jobId = 142d56a4-44e0-48dd-a3de-9727a6c422f4 with Job Name = XOR_constraint_16, current event type = Q with 1 previous event ids
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : Checking received event data against definition
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 183c001b-b69d-4598-a18e-ad64970af3b4, auditEventType = Q, auditEventId = 003157cb-6b6b-4e76-a899-6382d5a0b141
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = f3cc582e-db15-4fd7-b7b4-592b65d39205, auditEventType = Q, auditEventId = 6b14f620-8c77-4f42-874e-ac4f9e424e1e
deploy-aeo_svdc_proc_8-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 8b39a527-15bc-4517-9578-bd8b317e0d11, auditEventType = Q, auditEventId = 8edb6fd0-02e5-49a8-946d-85c110637bc9
deploy-aeo_svdc_proc_8-1  | IO Error :No such file or directory
deploy-aeo_svdc_proc_8-1  |   Stack:
deploy-aeo_svdc_proc_8-1  |   #1	AEOrdering::JobStore.JobStoreUpdated:15
deploy-aeo_svdc_proc_8-1  |

deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 5633b5e4-124a-46c1-adde-9af6d4de82b7, auditEventType = Q, auditEventId = e28dfdea-f54c-4e26-86d9-f7256789d64a
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 1b74e806-a643-4896-8dd3-a54c1561a313, auditEventType = Q, auditEventId = e724d78d-ef86-444a-99f6-2d9a88b60328
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 142d56a4-44e0-48dd-a3de-9727a6c422f4, auditEventType = Q, auditEventId = 5f9113b9-790c-4bf1-bc96-cda9b46d1cd4
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 183c001b-b69d-4598-a18e-ad64970af3b4, auditEventType = Q, auditEventId = 003157cb-6b6b-4e76-a899-6382d5a0b141
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = f3cc582e-db15-4fd7-b7b4-592b65d39205, auditEventType = Q, auditEventId = 6b14f620-8c77-4f42-874e-ac4f9e424e1e
deploy-aeo_svdc_proc_2-1  | 2023-11-25T23:59:59Z Debug : AEOrdering::AcceptEvent : jobId = 8b39a527-15bc-4517-9578-bd8b317e0d11, auditEventType = Q, auditEventId = 8edb6fd0-02e5-49a8-946d-85c110637bc9
deploy-aeo_svdc_proc_2-1  | IO Error :No such file or directory
deploy-aeo_svdc_proc_2-1  |   Stack:
deploy-aeo_svdc_proc_2-1  |   #1	AEOrdering::JobStore.JobStoreUpdated:15
deploy-aeo_svdc_proc_2-1  |

The text was updated successfully, but these errors were encountered:

cortlandstarrett · 2023-11-27T18:34:37Z

I wonder if a cron job "cleaned" up our JobID store file?

cortlandstarrett · 2023-11-29T16:35:07Z

We have reproduced this issue. It is indeed a stroke of midnight issue (when we are archiving our job id files).

cortlandstarrett assigned gregarnot Nov 27, 2023

gregarnot linked a pull request Dec 1, 2023 that will close this issue

158 jobidstore io issue at the stroke of midnight #166

Merged

cortlandstarrett closed this as completed in #166 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobIDStore IO issue at the stroke of midnight #158

JobIDStore IO issue at the stroke of midnight #158

FreddieMatherSmartDCSIT commented Nov 27, 2023

cortlandstarrett commented Nov 27, 2023

cortlandstarrett commented Nov 29, 2023

JobIDStore IO issue at the stroke of midnight #158

JobIDStore IO issue at the stroke of midnight #158

Comments

FreddieMatherSmartDCSIT commented Nov 27, 2023

Evidence

CPU

Docker compose logs

cortlandstarrett commented Nov 27, 2023

cortlandstarrett commented Nov 29, 2023