File inventory job error #1263

hectorcorrea · 2025-01-30T16:17:48Z

HoneyBadger reported an error when running the file inventory job in the background:

ActiveRecord::StatementInvalid: PG::ConnectionBad: PQconsumeInput() could not receive data from server: 
Connection timed out SSL SYSCALL error: Connection timed out

[PROJECT_ROOT]/app/jobs/file_inventory_job.rb:39:in `perform'

The error information points to user_id 186 and project_id 44.

https://tigerdata-qa.princeton.edu/projects/44

Matt confirmed that this is indeed the project that he reported during the check-in meeting today and that they job has not finished (it looks like it crashed.)

Honeybadger issue: https://app.honeybadger.io/projects/113559/faults/115546694

The original error was dated 1/30/2025 at 9:48 AM -05:00 I think this was when Matt ran the job. There is another error reported at 1/30/2025 at 10:53 AM -05:00.

I manually re-ran the job right before 11:20 AM and I don't see an error on Honeybadger and it looks like it generated the list. When I click "Download Complete List" it says it was generated 3 minutes ago.

Matt kick off the process again at 11:30 AM to see if the error was a hiccup or a permissions issue. Almost 20 minutes later and Matt's job has not finished but there is no HoneyBadger error about it either. This is what the job looks like in the database:

#<FileInventoryRequest:0x00007f998ac64f78
 id: 13,
 user_id: 186,
 project_id: 44,
 job_id: "ecde7bff-e68d-440e-a44b-afcc2e4f4f86",
 completion_time: nil,
 state: "pending",
 type: "FileInventoryRequest",
 request_details: {"project_title"=>"Test 06-12-24"},
 created_at: Thu, 30 Jan 2025 16:30:04.976340000 UTC +00:00,
 updated_at: Thu, 30 Jan 2025 16:30:04.976340000 UTC +00:00>

The text was updated successfully, but these errors were encountered:

hectorcorrea · 2025-01-30T16:57:14Z

The Rails log shows that there is an asset.query.iterate being executed and so far it has fetched 85,000 files (out of the 189,051 files for this project.)

Here is snippet of the log:

     <asset id="102053" version="1">
        <name>1734-h.htm</name>
        <path>/princeton/tigerdata/projectg/1/7/3/1734/1734-h/1734-h.htm</path>
        <total-size>482625</total-size>
        <mtime>27-Jan-2021 17:40:06</mtime>
        <collection>false</collection>
      </asset>
      <iterated complete="false">88000</iterated>
    </result>
  </reply>
</response>

D, [2025-01-30T16:55:05.654240 #381594] DEBUG -- :

I am not sure why it it soooooooo slow. When I ran the export it finished in a few minutes, but Matt's job it's been running for 20+ minutes by now.

hectorcorrea · 2025-01-30T17:09:33Z

As a side note, there is a surprisingly large number of SQL SELECT related to the flipflop feature on the log as this job is running. This might explain why the original error points to a PostgreSQL time out:

D, [2025-01-30T17:06:28.504226 #381594] DEBUG -- : [ActiveJob] [FileInventoryJob] [fcb2d7a2-49d1-4347-9358-56e1a4cc5099] 
[dd.env=qa dd.service=tigerdata dd.version=1.0.0 dd.trace_id=0 dd.span_id=0 ddsource=ruby]   CACHE Flipflop::Feature Load (0.0ms)  
SELECT "flipflop_features".* FROM "flipflop_features" WHERE "flipflop_features"."key" = $1 ORDER BY "flipflop_features"."id" ASC LIMIT $2  [["key", "alternate_mediaflux"], ["LIMIT", 1]]

Almost 200,000 SQL statements:

$ grep "flipflop_features" log/qa.log | wc -l
199667

These requests might be hitting the cache rather than the DB, not sure about this. Yet, the number is way too large even if it is hitting the cache.

hectorcorrea · 2025-01-30T17:15:09Z

It looks like TWO requests are on-going for the file list. The log output shows one at 42,000 files and the other at 158,000. Both show file similar paths for the files (i.e. /princeton/tigerdata/projectg) which makes me think both are related to Matt's request. His original one and the one at around 11:30 AM.

Of the requests finished at 12:24 PM but none of the records in the user_request table were were updated from "pending" to "complete" and an output_file was not set either.

Waiting to see what happens when the second job finishes (it has exported 119,000 file names)

As of right now there are two files generated on disk, I think the one a 17:23 is the one that should have been attached to Matt's job and the one at 16:24 might be when I ran it (these are UTC times)

-rw-rw-r-- 1 deploy deploy 22090419 Jan 30 17:23 ecde7bff-e68d-440e-a44b-afcc2e4f4f86.csv
-rw-rw-r-- 1 deploy deploy 22090419 Jan 30 16:24 f5069863-3c37-4ecf-8aff-864e0b3462f8.csv

carolyncole · 2025-01-30T17:28:41Z

Is this the error you are seeing @hectorcorrea https://github.com/pulibrary/pdc_describe/blob/main/app/jobs/dspace_file_copy_job.rb#L36-L39

hectorcorrea · 2025-01-30T17:49:22Z

It looks like when the jobs are finishing but they are failing when saving the update on the database (even thought the file list was saved successfully to disk). And as @carolyncole noted then Sidekiq considers that a failed job and retries it.

hectorcorrea · 2025-01-30T21:28:15Z

I manually killed the two jobs in Sidekiq QA for this file inventory so that Sidekiq stops retrying them.

hectorcorrea added the bug label Jan 30, 2025

hectorcorrea self-assigned this Jan 30, 2025

hectorcorrea mentioned this issue Jan 30, 2025

Handle SQL UPDATE errors on long running jobs #1266

Merged

carolyncole closed this as completed in #1266 Jan 30, 2025

hectorcorrea mentioned this issue Jan 30, 2025

Session expired on file inventory jobs #1268

Open

hectorcorrea mentioned this issue Jan 31, 2025

Why do file inventory jobs take so long for some users #1274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File inventory job error #1263

File inventory job error #1263

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025 •

edited

Loading

carolyncole commented Jan 30, 2025

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025

File inventory job error #1263

File inventory job error #1263

Comments

hectorcorrea commented Jan 30, 2025 • edited Loading

hectorcorrea commented Jan 30, 2025

hectorcorrea commented Jan 30, 2025 • edited Loading

hectorcorrea commented Jan 30, 2025 • edited Loading

carolyncole commented Jan 30, 2025

hectorcorrea commented Jan 30, 2025 • edited Loading

hectorcorrea commented Jan 30, 2025

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025 •

edited

Loading

hectorcorrea commented Jan 30, 2025 •

edited

Loading