Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksum-compute keeps timing out for rm242zg2446 #1005

Closed
andrewjbtw opened this issue Sep 30, 2022 · 4 comments
Closed

checksum-compute keeps timing out for rm242zg2446 #1005

andrewjbtw opened this issue Sep 30, 2022 · 4 comments
Assignees
Labels

Comments

@andrewjbtw
Copy link

Describe the bug
One druid has been repeatedly hitting a timeout error at the checksum-compute step in the assemblyWF: https://argo.stanford.edu/view/druid:rm242zg2446

Error: checksum-compute : Net::ReadTimeout with #<TCPSocket:(closed)>
https://app.honeybadger.io/projects/52894/faults/88727290

I haven't been able to figure out where the problem is. This step can take 12-24 hours for large files (over 1 TB) and I've never seen it time out before. I've tried pausing google-books and resetting the step while the system is less busy in the hope that the step would run faster but the time out still occurred.

My suspicion is that the problem may be related to the number of files. This object has 25,385 files (a roughly 6300 page publication), which makes it one of the larger objects in the SDR in terms of number of files.

User Impact
This is blocking accessioning as the step cannot be skipped or marked complete. This step creates the checksums in the Cocina metadata that are required for accessioning.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://argo.stanford.edu/view/druid:rm242zg2446
  2. Reset the step
  3. Wait about 2 hours
  4. See the error

Additional context
It's possible that this object will hit timeouts at later steps too. Depending on the step, it may or may not work to reset the step or manually mark it completed.

@andrewjbtw andrewjbtw added the bug label Sep 30, 2022
@peetucket peetucket self-assigned this Sep 30, 2022
@peetucket
Copy link
Member

peetucket commented Sep 30, 2022

The HB alert suggests that the checksum computation finished and then it tried to post a giant blob of updated cocina via the dor-services-client (here: https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L18), this resulted in a dor-services-app call (made by dor-services-client), which then timed out. One possibility is setting a long timeout in the dor-services-client to allow this big update more time to complete?

Basically, add builder.options[:timeout] = 600 # 10 minutes timeout to wait for a response

here: https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb#L155-L158

I'm not sure what the default is (docs say Faraday does not impose one), but this suggests the default timeout of the Net:HTTP adapter used is only 60 seconds: lostisland/faraday#708

And your suspicion about future steps failing is a good one -- the next step is exif-collect, which will likely fail in the same way.

@peetucket
Copy link
Member

peetucket commented Sep 30, 2022

Producing the file set checksums manually now on the common-accessioning box. Will then move the data that is posted to DSA and try and perform the update manually on the DSA server to see what happens.

Create the checksum data: (the checksum-compute robot does this)

ssh lyberadmin@common-accessioning-prod-b
ROBOT_ENVIRONMENT=production bin/console
druid='druid:rm242zg2446'

robot = Robots::DorRepo::Assembly::ChecksumCompute.new
assembly_item = robot.send(:item,druid)
cocina_model = assembly_item.cocina_model;nil

# this took about 1 hour and produced data to save (which would have then been posted to DSA)
file_sets = robot.send(:compute_checksums, assembly_item, cocina_model);nil

# save data to a file
File.open("../../shared/tmp/#{druid}_file_sets.dump", 'wb') { |f| f.write(Marshal.dump(file_sets)) }

# or reload if already saved
file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump"));nil # load up the data

updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));nil

Now move data to DSA:

ssh dor_services@dor-services-app-prod-a.stanford.edu
cd dor_services/shared/tmp
sftp lyberadmin@common-accessioning-prod-b
cd common-accessioning/shared/tmp
get druid:rm242zg2446_file_sets.dump tmp/druid:rm242zg2446_file_sets.dump
exit

Now load up the data in DSA and try the update:

bundle exec rails c -e p

conf.echo = false
druid='druid:rm242zg2446'

file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump")) # load up the data

# this is the first thing DSA does on the API call and takes about 60 seconds and could be part of the reason for a timeout, if it is only 60 seconds
cocina_model = CocinaObjectStore.find(druid) 

# this is what the checksum-compute robot creates to post to DSA (ie this bit happens in the robot, just redoing it on DSA so we can use the data):
# https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L17
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));

params = Cocina::Models.without_metadata(updated).to_json

# this is what the update action in DSA does (when it gets called from dor-services client):
# note: the params in the DSA code are passed as the "updated" cocina structure above
# https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/objects_controller.rb#L33
cocina_object = Cocina::Models.build(params.except(:action, :controller, :id).to_unsafe_h)

# this should build the updated object, but just blows up...
# following what is happening in cocina models, it looks like it can't even figure out which type of object it is:
# the line above is building the model here:
# https://github.com/sul-dlss/cocina-models/blob/main/lib/cocina/models.rb#L84
# which tries to determine the type of model to initialize by doing this:
cocina_object.with_indifferent_access.fetch('type')

# but it looks like both .with_indifferent_access and .fetch are very inefficient with very large hashes, causing problems
# this spews a lot of stuff to the terminal for 10 minutes

# but we know it is a DRO:
params[:type]
 => "https://cocina.sul.stanford.edu/models/book"
Cocina::Models::DRO::TYPES.include? updated[:type]
 => true

# so we can try and build the Dro directly (which is what would happen after we detect this is a Dro type):
cocina_object = Cocina::Models::DRO.new(params.except!(:action, :controller, :id).to_unsafe_h, false, true)

# save the object
UpdateObjectService.update(cocina_object, skip_lock: true);

@peetucket
Copy link
Member

Ok, after much experimentation, I am very confident that increasing the timeout for dor-services-client as shown in sul-dlss/dor-services-client#306 will allow this particular object to work. I manually completed the checksum-compute step on the common-accessioning box and then increased the timeout for the request to 10 minutes on the console and made the POST to DSA and it succeeded. 10 minutes is probably overkill, we could try starting with 5 minutes.

I'm going to mark the step as completed manually since the result of that step has now been applied.

@peetucket
Copy link
Member

This object has made it through assemblyWF and accessionWF. It may still encounter issues in preservation (we shall see), but I'm closing this ticket as if there are further issues, they are unlikely to be related to the problem I fixed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants