-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checksum-compute keeps timing out for rm242zg2446 #1005
Comments
The HB alert suggests that the checksum computation finished and then it tried to post a giant blob of updated cocina via the dor-services-client (here: https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L18), this resulted in a dor-services-app call (made by dor-services-client), which then timed out. One possibility is setting a long timeout in the dor-services-client to allow this big update more time to complete? Basically, add I'm not sure what the default is (docs say Faraday does not impose one), but this suggests the default timeout of the Net:HTTP adapter used is only 60 seconds: lostisland/faraday#708 And your suspicion about future steps failing is a good one -- the next step is exif-collect, which will likely fail in the same way. |
Producing the file set checksums manually now on the common-accessioning box. Will then move the data that is posted to DSA and try and perform the update manually on the DSA server to see what happens. Create the checksum data: (the checksum-compute robot does this)
Now move data to DSA:
Now load up the data in DSA and try the update:
|
Ok, after much experimentation, I am very confident that increasing the timeout for dor-services-client as shown in sul-dlss/dor-services-client#306 will allow this particular object to work. I manually completed the checksum-compute step on the common-accessioning box and then increased the timeout for the request to 10 minutes on the console and made the POST to DSA and it succeeded. 10 minutes is probably overkill, we could try starting with 5 minutes. I'm going to mark the step as completed manually since the result of that step has now been applied. |
This object has made it through assemblyWF and accessionWF. It may still encounter issues in preservation (we shall see), but I'm closing this ticket as if there are further issues, they are unlikely to be related to the problem I fixed here. |
Describe the bug
One druid has been repeatedly hitting a timeout error at the checksum-compute step in the assemblyWF: https://argo.stanford.edu/view/druid:rm242zg2446
Error: checksum-compute : Net::ReadTimeout with #<TCPSocket:(closed)>
https://app.honeybadger.io/projects/52894/faults/88727290
I haven't been able to figure out where the problem is. This step can take 12-24 hours for large files (over 1 TB) and I've never seen it time out before. I've tried pausing google-books and resetting the step while the system is less busy in the hope that the step would run faster but the time out still occurred.
My suspicion is that the problem may be related to the number of files. This object has 25,385 files (a roughly 6300 page publication), which makes it one of the larger objects in the SDR in terms of number of files.
User Impact
This is blocking accessioning as the step cannot be skipped or marked complete. This step creates the checksums in the Cocina metadata that are required for accessioning.
To Reproduce
Steps to reproduce the behavior:
Additional context
It's possible that this object will hit timeouts at later steps too. Depending on the step, it may or may not work to reset the step or manually mark it completed.
The text was updated successfully, but these errors were encountered: