Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck Harvest - NOAA causing delays in the pipeline #917

Closed
nickumia-reisys opened this issue Apr 8, 2023 · 3 comments
Closed

Stuck Harvest - NOAA causing delays in the pipeline #917

nickumia-reisys opened this issue Apr 8, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Apr 8, 2023

Workflow with Issue: 4 - Automated CKAN Jobs
Job Failed: ckan-auto-command
CKAN Command (in question): ckan geodatagov check-stuck-jobs
CKAN Command Schedule: 30 6 * * *
Cloud.gov Environment: prod

Last Commit: 25b56a2
Number of times run: 1
Last run by: btylerburton
Github Action Run: https://github.com/GSA/catalog.data.gov/actions/runs/4649204023

@nickumia-reisys
Copy link
Contributor Author

NOAA caused delays in the harvesting pipeline on 4/8:

source_id: c084a438-6f6b-470d-93e0-16aeddb9f513 | created_time: 2023-04-06 14:31:27.072752 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-06 14:31:27.420864 | gather_finished: 2023-04-06 14:33:50.501971 | running_length: 1 day, 16:07:22.432428 | source_title: NOAA/NESDIS/ncei/accessions | organization: National Oceanic and Atmospheric Administration, Department of Commerce
source_id: 8f77b6d5-f630-4995-bdf3-0aee7158a7f3 | created_time: 2023-04-07 05:54:47.763399 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-07 07:14:21.686083 | gather_finished: 2023-04-07 07:14:23.195498 | running_length: 1 day, 0:44:01.741781 | source_title: Alaska Division of Geological and Geophysical Surveys | organization: State of Alaska
source_id: 8507fa43-f429-4095-b732-2177330ce485 | created_time: 2023-04-07 05:54:46.680944 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-07 07:14:06.720925 | gather_finished: 2023-04-07 07:14:19.763617 | running_length: 1 day, 0:44:02.824236 | source_title: SFO JSON | organization: City of San Francisco
source_id: f35df04a-a619-4f92-bf5c-b9915b083bb1 | created_time: 2023-04-07 05:54:47.977264 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-07 07:14:23.220864 | gather_finished: 2023-04-07 07:14:27.803802 | running_length: 1 day, 0:44:01.527916 | source_title: Alaska Department of Natural Resources, IRM | organization: State of Alaska
source_id: 7590e386-229e-453a-8e53-6f18e200e421 | created_time: 2023-04-07 05:54:46.104598 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-07 07:13:52.071500 | gather_finished: 2023-04-07 07:14:06.253432 | running_length: 1 day, 0:44:03.400582 | source_title: Chicago JSON | organization: City of Chicago
source_id: ee428166-33c7-4eef-aee8-66156e0e9e08 | created_time: 2023-04-07 05:54:18.179824 | current_time: 2023-04-08 06:38:49.505180+00:00 | gather_started: 2023-04-07 06:03:29.676371 | gather_finished: 2023-04-07 06:06:19.979303 | running_length: 1 day, 0:44:31.325356 | source_title: NGDC Paleo | organization: National Oceanic and Atmospheric Administration, Department of Commerce

As of yesterday, NOAA was the only one still "stuck"

source_id: c084a438-6f6b-470d-93e0-16aeddb9f513 | created_time: 2023-04-06 14:31:27.072752 | current_time: 2023-04-09 06:38:39.540218+00:00 | gather_started: 2023-04-06 14:31:27.420864 | gather_finished: 2023-04-06 14:33:50.501971 | running_length: 2 days, 16:07:12.467466 | source_title: NOAA/NESDIS/ncei/accessions | organization: National Oceanic and Atmospheric Administration, Department of Commerce

@FuhuXia I really don't like that the harvesting takes 72 hours to force completion, can we make it 48 hours?

@jbrown-xentity
Copy link
Contributor

Note for future self: the logic for the system timing out is here and these functions here. Those last functions are key. It first tries to see "has there been any movement on any of the harvest objects within the timeout limit" (in our current case, within the last 72 hours), if so don't force timeout. If there are no objects that have been processed, then it tries to go off the job info (when was the gather done). So our timeout should be longer than the following:

  • Our longest expected gather (< 1 hour, to my knowledge)
  • Our longest expected fetch & import (< 12 hours, to my knowledge) (this is because a job may be stuck waiting in the queue behind another job for however long that job takes to complete)

I think 48 hours is probably overly safe, but please correct me if I'm wrong or missing something @FuhuXia . I would advise 24 hours, but I know @FuhuXia noted some problems with that implementation in practice.

@nickumia-reisys
Copy link
Contributor Author

Related to

@nickumia-reisys nickumia-reisys changed the title Automated CKAN Job Error Condition Stuck Harvest - NOAA causing delays in the pipeline Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants