-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another 500 in "dandi-api" (now dandi-archive) (only) #152
Comments
what is difference between default
Note that PR in con/tinuous says to be |
|
re UNK: but we do get to https://github.com/dandi/dandi-archive/actions/runs/2198420871/ (where we fail) or that is for some other commit? What do you think about the original 500 issue? |
I'm not clear what you're asking.
Seeing as the run in question is six months old and thus the logs would have expired anyway even if GitHub wasn't erroring out, I'm wondering why you're trying to fetch logs that old. Maybe try limiting how far back you're fetching logs with the "since" config key? |
is that run associated with the PR we label
Indeed. The question then not why I but why con/tinuous -- we had this cron job running regularly for months already. And the most recent commit is just a week old
and then it started to error out on Oct 6 with
(different run but seems to be as old) |
Yes, that's the run.
If there's a way to get the PR from the GitHub API, I don't know how. As I showed in my previous comment, the obvious ways which tinuous uses don't return any results.
Can you post a complete log for a failed run? And remind me where the tinuous job for dandi-archive is set up? |
complete email log from the most recent run
they are all ran as cron job on |
Re 'UNK' - asked (again?) on https://github.com/orgs/community/discussions/36055 . In the comment above I thought I posted the location for the cron runs but I do not see it -- it is as dandi on drogon and going under /mnt/backup/dandi/tinuous-logs/dandi-api. edit: ah - formatting was bad - fixing |
@yarikoptic At the top of the logs:
The most likely explanation for why it's going back so far is that some workflow run around that time was never marked as "completed" on GitHub. ... and if I use the API to go through all the workflow runs in that repository back to the given timestamp, I find one that's not "completed": https://github.com/dandi/dandi-archive/actions/runs/1994129192 (currently marked "queued" despite being seven months old). I recommend simply updating the "since" key in |
but why the "since" is not automatically set for each run to the dates we record in |
@yarikoptic Do you mean the "since" in |
good - that aligns with my understanding. So if there is later in the statefile - it is used.
I see... so that is why we have
so we have some incomplete to fetch since March of this year but never were alarmed to that and keep going back in history to consider workflows that far every time we run? Do we know which one was incomplete and why is it still incomplete ? Since it is unlikely it could ever become complete since logs time out on those services, I think we should not allow for this timestamp to go back beyond more than the logs retention duration per each service, and pragmatically - probably no more than a week (given that tinuous was ran often in that week to try to complete fetching those logs). WDYT? |
I linked it above.
Note that GitHub's maximum log retention period is 400 days, so that wouldn't have helped us here, and I'm not sure if Travis even has a maximum retention period. An alternative approach I was considering would be to emit a warning if a CI job is encountered that hasn't completed in, say, 72 hours.
I do not think we should be implementing arbitrary limits like "no more than a week" just because it fits with what we do in our own setup; recall that tinuous has at least one other user. |
I agree that we should not implement "arbitrary limits", but we can and should introduce some which match limits of the CI platforms. E.g. github workflows do have time limit in a few hours. Thus can't we announce those workflows which were ran 6 months ago and which were not marked "Complete" are to be considered "Complete" for all means and purposes of tinuous fetching/archiving them? I think so. For that IMHO we can establish a "non-arbitrary", informed limit e.g. of 1 week across all currently supported CIs for announcing any given workflow "complete" since we know that pragmatically speaking it could not take even that long. |
@yarikoptic How about adding a config option for the maximum amount of time back to look for CI runs (with a default of a week, if you insist)? It could even be given instead of the current "since" config option if the user doesn't want to give a specific timestamp. |
Sounds good to me. Let's may be even make it a month, and sounds great for having it a default mode of operation if no |
@yarikoptic Let's call this new option " |
sounds good to me. The tune up (feel free to ignore) which comes to mind is
would it be hard to implement that it is
if you decide to go this way - OK with me. Just issue a warning so there might be a record left about that.
I would not bother with differentiating unspecified vs specified for this one. I just would make |
@yarikoptic If we went with just implementing |
I don't quite get why such anomalies (might not grasping entirely what is a "failure" ) should keep state file without update . but that is ok -- we can go with |
@yarikoptic They're not "failures" because no HTTP requests fail; the only thing suspicious about them from tinuous' point of view is how long they've gone without being completed. When tinuous sees a CI run that isn't complete, it just skips it, and the state timestamp is set to no later than the timestamp of the incomplete run, so that the run can (hopefully) be picked up by a later tinuous invocation once it's completed. |
rright. And I was not talking in terms of an |
#151 (released in 0.5.4) seems has fixed #150 at large but there is other 500s left:
needs investigation for the reason and workaround...
so far didn't spot it for any other repo. may be due to rename? we still refer to
dandi/dandi-api
in its configurationThe text was updated successfully, but these errors were encountered: