Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File inventory job error #1263

Closed
hectorcorrea opened this issue Jan 30, 2025 · 6 comments · Fixed by #1266
Closed

File inventory job error #1263

hectorcorrea opened this issue Jan 30, 2025 · 6 comments · Fixed by #1266
Assignees
Labels

Comments

@hectorcorrea
Copy link
Member

hectorcorrea commented Jan 30, 2025

HoneyBadger reported an error when running the file inventory job in the background:

ActiveRecord::StatementInvalid: PG::ConnectionBad: PQconsumeInput() could not receive data from server: 
Connection timed out SSL SYSCALL error: Connection timed out

[PROJECT_ROOT]/app/jobs/file_inventory_job.rb:39:in `perform' 

The error information points to user_id 186 and project_id 44.

https://tigerdata-qa.princeton.edu/projects/44

Matt confirmed that this is indeed the project that he reported during the check-in meeting today and that they job has not finished (it looks like it crashed.)

Honeybadger issue: https://app.honeybadger.io/projects/113559/faults/115546694

The original error was dated 1/30/2025 at 9:48 AM -05:00 I think this was when Matt ran the job. There is another error reported at 1/30/2025 at 10:53 AM -05:00.

I manually re-ran the job right before 11:20 AM and I don't see an error on Honeybadger and it looks like it generated the list. When I click "Download Complete List" it says it was generated 3 minutes ago.

Matt kick off the process again at 11:30 AM to see if the error was a hiccup or a permissions issue. Almost 20 minutes later and Matt's job has not finished but there is no HoneyBadger error about it either. This is what the job looks like in the database:

#<FileInventoryRequest:0x00007f998ac64f78
 id: 13,
 user_id: 186,
 project_id: 44,
 job_id: "ecde7bff-e68d-440e-a44b-afcc2e4f4f86",
 completion_time: nil,
 state: "pending",
 type: "FileInventoryRequest",
 request_details: {"project_title"=>"Test 06-12-24"},
 created_at: Thu, 30 Jan 2025 16:30:04.976340000 UTC +00:00,
 updated_at: Thu, 30 Jan 2025 16:30:04.976340000 UTC +00:00>
@hectorcorrea
Copy link
Member Author

The Rails log shows that there is an asset.query.iterate being executed and so far it has fetched 85,000 files (out of the 189,051 files for this project.)

Here is snippet of the log:

     <asset id="102053" version="1">
        <name>1734-h.htm</name>
        <path>/princeton/tigerdata/projectg/1/7/3/1734/1734-h/1734-h.htm</path>
        <total-size>482625</total-size>
        <mtime>27-Jan-2021 17:40:06</mtime>
        <collection>false</collection>
      </asset>
      <iterated complete="false">88000</iterated>
    </result>
  </reply>
</response>

D, [2025-01-30T16:55:05.654240 #381594] DEBUG -- :

I am not sure why it it soooooooo slow. When I ran the export it finished in a few minutes, but Matt's job it's been running for 20+ minutes by now.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Jan 30, 2025

As a side note, there is a surprisingly large number of SQL SELECT related to the flipflop feature on the log as this job is running. This might explain why the original error points to a PostgreSQL time out:

D, [2025-01-30T17:06:28.504226 #381594] DEBUG -- : [ActiveJob] [FileInventoryJob] [fcb2d7a2-49d1-4347-9358-56e1a4cc5099] 
[dd.env=qa dd.service=tigerdata dd.version=1.0.0 dd.trace_id=0 dd.span_id=0 ddsource=ruby]   CACHE Flipflop::Feature Load (0.0ms)  
SELECT "flipflop_features".* FROM "flipflop_features" WHERE "flipflop_features"."key" = $1 ORDER BY "flipflop_features"."id" ASC LIMIT $2  [["key", "alternate_mediaflux"], ["LIMIT", 1]]

Almost 200,000 SQL statements:

$ grep "flipflop_features" log/qa.log | wc -l
199667

These requests might be hitting the cache rather than the DB, not sure about this. Yet, the number is way too large even if it is hitting the cache.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Jan 30, 2025

It looks like TWO requests are on-going for the file list. The log output shows one at 42,000 files and the other at 158,000. Both show file similar paths for the files (i.e. /princeton/tigerdata/projectg) which makes me think both are related to Matt's request. His original one and the one at around 11:30 AM.

Of the requests finished at 12:24 PM but none of the records in the user_request table were were updated from "pending" to "complete" and an output_file was not set either.

Waiting to see what happens when the second job finishes (it has exported 119,000 file names)

As of right now there are two files generated on disk, I think the one a 17:23 is the one that should have been attached to Matt's job and the one at 16:24 might be when I ran it (these are UTC times)

-rw-rw-r-- 1 deploy deploy 22090419 Jan 30 17:23 ecde7bff-e68d-440e-a44b-afcc2e4f4f86.csv
-rw-rw-r-- 1 deploy deploy 22090419 Jan 30 16:24 f5069863-3c37-4ecf-8aff-864e0b3462f8.csv

@carolyncole
Copy link
Member

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Jan 30, 2025

It looks like when the jobs are finishing but they are failing when saving the update on the database (even thought the file list was saved successfully to disk). And as @carolyncole noted then Sidekiq considers that a failed job and retries it.

@hectorcorrea
Copy link
Member Author

I manually killed the two jobs in Sidekiq QA for this file inventory so that Sidekiq stops retrying them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants