checksum-compute keeps timing out for rm242zg2446 #1005

andrewjbtw · 2022-09-30T17:47:49Z

Describe the bug
One druid has been repeatedly hitting a timeout error at the checksum-compute step in the assemblyWF: https://argo.stanford.edu/view/druid:rm242zg2446

Error: checksum-compute : Net::ReadTimeout with #<TCPSocket:(closed)>
https://app.honeybadger.io/projects/52894/faults/88727290

I haven't been able to figure out where the problem is. This step can take 12-24 hours for large files (over 1 TB) and I've never seen it time out before. I've tried pausing google-books and resetting the step while the system is less busy in the hope that the step would run faster but the time out still occurred.

My suspicion is that the problem may be related to the number of files. This object has 25,385 files (a roughly 6300 page publication), which makes it one of the larger objects in the SDR in terms of number of files.

User Impact
This is blocking accessioning as the step cannot be skipped or marked complete. This step creates the checksums in the Cocina metadata that are required for accessioning.

To Reproduce
Steps to reproduce the behavior:

Go to https://argo.stanford.edu/view/druid:rm242zg2446
Reset the step
Wait about 2 hours
See the error

Additional context
It's possible that this object will hit timeouts at later steps too. Depending on the step, it may or may not work to reset the step or manually mark it completed.

The text was updated successfully, but these errors were encountered:

peetucket · 2022-09-30T19:10:13Z

The HB alert suggests that the checksum computation finished and then it tried to post a giant blob of updated cocina via the dor-services-client (here: https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L18), this resulted in a dor-services-app call (made by dor-services-client), which then timed out. One possibility is setting a long timeout in the dor-services-client to allow this big update more time to complete?

Basically, add builder.options[:timeout] = 600 # 10 minutes timeout to wait for a response

here: https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb#L155-L158

I'm not sure what the default is (docs say Faraday does not impose one), but this suggests the default timeout of the Net:HTTP adapter used is only 60 seconds: lostisland/faraday#708

And your suspicion about future steps failing is a good one -- the next step is exif-collect, which will likely fail in the same way.

peetucket · 2022-09-30T21:44:16Z

Producing the file set checksums manually now on the common-accessioning box. Will then move the data that is posted to DSA and try and perform the update manually on the DSA server to see what happens.

Create the checksum data: (the checksum-compute robot does this)

ssh lyberadmin@common-accessioning-prod-b
ROBOT_ENVIRONMENT=production bin/console
druid='druid:rm242zg2446'

robot = Robots::DorRepo::Assembly::ChecksumCompute.new
assembly_item = robot.send(:item,druid)
cocina_model = assembly_item.cocina_model;nil

# this took about 1 hour and produced data to save (which would have then been posted to DSA)
file_sets = robot.send(:compute_checksums, assembly_item, cocina_model);nil

# save data to a file
File.open("../../shared/tmp/#{druid}_file_sets.dump", 'wb') { |f| f.write(Marshal.dump(file_sets)) }

# or reload if already saved
file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump"));nil # load up the data

updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));nil

Now move data to DSA:

ssh dor_services@dor-services-app-prod-a.stanford.edu
cd dor_services/shared/tmp
sftp lyberadmin@common-accessioning-prod-b
cd common-accessioning/shared/tmp
get druid:rm242zg2446_file_sets.dump tmp/druid:rm242zg2446_file_sets.dump
exit

Now load up the data in DSA and try the update:

bundle exec rails c -e p

conf.echo = false
druid='druid:rm242zg2446'

file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump")) # load up the data

# this is the first thing DSA does on the API call and takes about 60 seconds and could be part of the reason for a timeout, if it is only 60 seconds
cocina_model = CocinaObjectStore.find(druid) 

# this is what the checksum-compute robot creates to post to DSA (ie this bit happens in the robot, just redoing it on DSA so we can use the data):
# https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L17
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));

params = Cocina::Models.without_metadata(updated).to_json

# this is what the update action in DSA does (when it gets called from dor-services client):
# note: the params in the DSA code are passed as the "updated" cocina structure above
# https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/objects_controller.rb#L33
cocina_object = Cocina::Models.build(params.except(:action, :controller, :id).to_unsafe_h)

# this should build the updated object, but just blows up...
# following what is happening in cocina models, it looks like it can't even figure out which type of object it is:
# the line above is building the model here:
# https://github.com/sul-dlss/cocina-models/blob/main/lib/cocina/models.rb#L84
# which tries to determine the type of model to initialize by doing this:
cocina_object.with_indifferent_access.fetch('type')

# but it looks like both .with_indifferent_access and .fetch are very inefficient with very large hashes, causing problems
# this spews a lot of stuff to the terminal for 10 minutes

# but we know it is a DRO:
params[:type]
 => "https://cocina.sul.stanford.edu/models/book"
Cocina::Models::DRO::TYPES.include? updated[:type]
 => true

# so we can try and build the Dro directly (which is what would happen after we detect this is a Dro type):
cocina_object = Cocina::Models::DRO.new(params.except!(:action, :controller, :id).to_unsafe_h, false, true)

# save the object
UpdateObjectService.update(cocina_object, skip_lock: true);

peetucket · 2022-10-05T23:12:38Z

Ok, after much experimentation, I am very confident that increasing the timeout for dor-services-client as shown in sul-dlss/dor-services-client#306 will allow this particular object to work. I manually completed the checksum-compute step on the common-accessioning box and then increased the timeout for the request to 10 minutes on the console and made the POST to DSA and it succeeded. 10 minutes is probably overkill, we could try starting with 5 minutes.

I'm going to mark the step as completed manually since the result of that step has now been applied.

peetucket · 2022-10-06T20:38:30Z

This object has made it through assemblyWF and accessionWF. It may still encounter issues in preservation (we shall see), but I'm closing this ticket as if there are further issues, they are unlikely to be related to the problem I fixed here.

andrewjbtw added the bug label Sep 30, 2022

peetucket self-assigned this Sep 30, 2022

peetucket mentioned this issue Sep 30, 2022

set a timeout for long running responses sul-dlss/dor-services-client#306

Merged

This was referenced Oct 5, 2022

improve efficiency of object type detection sul-dlss/cocina-models#537

Merged

do not make a copy of large hash unneccessarily sul-dlss/dor-services-app#4263

Closed

peetucket mentioned this issue Oct 6, 2022

update dor-services-client to increase timeout #1008

Merged

peetucket closed this as completed Oct 6, 2022

mjgiarlo added this to Infrastructure Portfolio Production Priorities Oct 12, 2023

mjgiarlo moved this to Resolved in Infrastructure Portfolio Production Priorities Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checksum-compute keeps timing out for rm242zg2446 #1005

checksum-compute keeps timing out for rm242zg2446 #1005

andrewjbtw commented Sep 30, 2022

peetucket commented Sep 30, 2022 •

edited

Loading

peetucket commented Sep 30, 2022 •

edited

Loading

peetucket commented Oct 5, 2022

peetucket commented Oct 6, 2022

checksum-compute keeps timing out for rm242zg2446 #1005

checksum-compute keeps timing out for rm242zg2446 #1005

Comments

andrewjbtw commented Sep 30, 2022

peetucket commented Sep 30, 2022 • edited Loading

peetucket commented Sep 30, 2022 • edited Loading

peetucket commented Oct 5, 2022

peetucket commented Oct 6, 2022

peetucket commented Sep 30, 2022 •

edited

Loading

peetucket commented Sep 30, 2022 •

edited

Loading