-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457
Comments
@rkmanda asked in this post to increase timeout from 60 to 120 minutes. Relevant docs:
PR:
|
At least one case of timing out after 3h reported by Vanessa Arndorfer on Teams here:
And another one timed out after 3h: More successful-but-long-running occurrences reported by Suhas Rao: |
Pull Request 536820: Updated LintDiff.yml: increase timeout from 180 min to 300 min |
A case where LintDiff timed out after 5 hours:
Note this is similar to the API spec as explained here:
The logs say:
(...)
So from the mandatory 2 LintDiff runs, the first one, PR:
Update 3/21/2024: After the increased timeout LintDiff finished after 10h 54min: |
Pull Request 541101: Updated LintRPaaS.yml: set timeout to 1440 minutes (24h) |
TLDR:
Dumping here some of my correspondences to provide context:
From email thread with subject
LINTDIFF big diff issue: Asking for help & info about issues with LintDiff runs that produce a large diff
Email 1 from me:
Email 2:
Reply by Roopesh:
Email 3 from me:
Re:
The changes have been approved by Mike K. because (info from private Teams group chat):
Technical details
The affected PR:
Modified files in 12 API versions. As a result, the LintDiff check launched AutoRest with https://github.com/Azure/azure-openapi-validator extension 24 times: twice (
before
andafter
changes) for each API version, resulting in gigantic diff. The tool ran for over 27 minutes and produced log 402 MB in size. The same problem happened for staging LintDiff.This is the produced 402 MB log:
https://dev.azure.com/azure-sdk/590cfd2a-581c-4dcb-a12e-6568ce786175/_apis/build/builds/3336876/logs/20
Hotfix applied
We put a
try/catch
block over putting things in database that continues on failure:The failure happened because the task result data was so big, it crossed the 17 MB threshold, as explained in this Stack Overflow answer.
How we found the root-cause
I queried pipeline-bot logs and observed the
RangeError [ERR_OUT_OF_RANGE]
in column having theconsole.out
logs. Then I looked for relevant logs and found out we put too big of a document, which was crashing our pipeline-bot instances before we added the hotfix try/catch. This log pointed out to the name of the LintDiff check.Example occurrence of the ERR_OUT_OF_RANGE in logs,
Chart showing when the problem started, due to large size of document to be put to db:
The text was updated successfully, but these errors were encountered: