[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

konrad-jamrozik · 2023-12-14T21:45:05Z

TLDR:

Tool is slow and doesn't run in parallel, so a lot of files with a lot of diffs is just slow
Small change causes entire API version to be re-evaluated twice. See here

Dumping here some of my correspondences to provide context:

From email thread with subject LINTDIFF big diff issue: Asking for help & info about issues with LintDiff runs that produce a large diff

Email 1 from me:

Hey Roopesh!
Some LintDiff runs produce large amount of data (and run for 25+ minutes). For example see this log which is for this check for PR 26992
Or this one, for Staging LintDiff, also for PR 26992.

Unfortunately, due to our infrastructure drawbacks we cannot process such runs fully, and we have to discard such results, without ever showing them on GitHub. This is because we use MongoDB database to communicate from ADO to GitHub, and the massive amount of data is above the 17 MB limit.

Ask to you from us:
While we will be working on figuring out how to make such be fully supported on GitHub, can you please have a look if this is expected behavior? Maybe some of the LintDiff rules are triggering too often? Would possibly truncating the results be an option here?

Email 2:

Hey, some additional info on what caused LintDiff to run for so long and produce so much data.

Looking at the affected PR of 26992, it modified files in 12 different API versions. As a result, LintDiff ran 24 times: for each of the API version, once “before changes” and once “after changes”. LintDiff runs as a plugin to AutoRest, and AutoRest decides which files are in scope by looking at the API version tag in README. This resulted in large amount of files being ingested as input to these AutoRest/LintDiff runs, resulting in the huge log. The tooling doesn’t care about which files from given API versions were modified exactly: it runs twice on all files in given API version, even if only 1 file changed in it.

The question is how we should handle such cases. There doesn’t seem to be much utility in producing this amount of errors, they won’t be too actionable to the PR author. LintDiff appears to be designed to work with PRs that modify no more than few API versions at a time.

Reply by Roopesh:

We need to inspect what sort of changes are being made across multiple API versions given any contract change to previous API versions would result in a breaking change. This is not representative of a typical scenario for which PRs are raised.

How frequently are you seeing this behavior? Also is there an impact beyond just the checks taking longer for that one PR? Or is it slowing checks for other PRs down as well - noisy neighbor?

Email 3 from me:

Re frequency:

I think we saw it, like, once 😃 in just one PR.

With the hotfix from today now the impact on the rest of the system will be primarily delay: the “automated merging requirements met” check won’t conclude until all dependencies conclude, including LintDiff, which may run for 25+ minutes in such cases. Also, the GitHub LintDiff check view won’t reflect the information: if the log is too large, more than 17 MB, now we just discard it and never show the update on GitHub side (can be still viewed in ADO though).

Re:

We need to inspect what sort of changes are being made across multiple API versions given any contract change to previous API versions would result in a breaking change. This is not representative of a typical scenario for which PRs are raised.

The changes have been approved by Mike K. because (info from private Teams group chat):

If the REST API was incorrect -- it does not match the behavior of the service -- then we allow an existing API version (preview or GA) to be "corrected". And that might involve correcting many GA versions. Notes from the approval:

The property does not exist in the OpenAPI

Just correcting the REST API definition.

Approvals like these happen once every few months.

Technical details

The affected PR:

https://github.com/Azure/azure-rest-api-specs/pull/26992/files

Modified files in 12 API versions. As a result, the LintDiff check launched AutoRest with https://github.com/Azure/azure-openapi-validator extension 24 times: twice (before and after changes) for each API version, resulting in gigantic diff. The tool ran for over 27 minutes and produced log 402 MB in size. The same problem happened for staging LintDiff.

This is the produced 402 MB log:
https://dev.azure.com/azure-sdk/590cfd2a-581c-4dcb-a12e-6568ce786175/_apis/build/builds/3336876/logs/20

Hotfix applied

We put a try/catch block over putting things in database that continues on failure:

https://devdiv.visualstudio.com/DevDiv/_git/openapi-alps/pullrequest/516527

The failure happened because the task result data was so big, it crossed the 17 MB threshold, as explained in this Stack Overflow answer.

How we found the root-cause

I queried pipeline-bot logs and observed the RangeError [ERR_OUT_OF_RANGE] in column having the console.out logs. Then I looked for relevant logs and found out we put too big of a document, which was crashing our pipeline-bot instances before we added the hotfix try/catch. This log pointed out to the name of the LintDiff check.

Example occurrence of the ERR_OUT_OF_RANGE in logs,

Chart showing when the problem started, due to large size of document to be put to db:

The text was updated successfully, but these errors were encountered:

konrad-jamrozik · 2024-03-04T22:36:41Z

@rkmanda asked in this post to increase timeout from 60 to 120 minutes.

Relevant docs:

PR:

Pull Request 533655: Increase LintDiff timeout to 120 minutes

konrad-jamrozik · 2024-03-12T18:13:52Z

More cases:

Pull Request 535599: Updated LintDiff.yml: increase LintDiff timeout to 180 min
Teams post on API spec review: Swagger Lint Diff check running for long time | by Vaibhav Jain (VJ)
Teams post on API spec review: Swagger Lint Diff check running for long time | by Young Park

konrad-jamrozik · 2024-03-18T06:28:37Z

More occurrences reported by Zhenuha Hu:

konrad-jamrozik · 2024-03-18T06:36:23Z

At least one case of timing out after 3h reported by Vanessa Arndorfer on Teams here:

And another one timed out after 3h:

https://github.com/Azure/azure-rest-api-specs-pr/pull/17224

More successful-but-long-running occurrences reported by Suhas Rao:

[MFE] Adding OneLake DatastoreType to 2024-04-01 stable azure-rest-api-specs#28092
- See here: [LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457 (comment)
Add pipeline deployments to mfe.json azure-rest-api-specs#28115
Mark MOE API Contract MaxQueueWait as deprecated azure-rest-api-specs#28205

konrad-jamrozik · 2024-03-18T06:43:43Z

Pull Request 536820: Updated LintDiff.yml: increase timeout from 180 min to 300 min

konrad-jamrozik · 2024-03-21T03:50:02Z

A case where LintDiff timed out after 5 hours:

Note this is similar to the API spec as explained here:

[MFE] Adding OneLake DatastoreType to 2024-04-01 stable azure-rest-api-specs#28092 (comment)
/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md

Here we have --tag=package-preview-2024-04 versus the previous --tag=package-2024-04

- input-file:
  - Microsoft.MachineLearningServices/preview/2024-04-01-preview/machineLearningServices.json
  - Microsoft.MachineLearningServices/preview/2024-04-01-preview/mfe.json
  - Microsoft.MachineLearningServices/preview/2024-04-01-preview/registries.json
  - Microsoft.MachineLearningServices/preview/2024-04-01-preview/workspaceFeatures.json
  - Microsoft.MachineLearningServices/preview/2024-04-01-preview/workspaceRP.json

The 2024-04-01-preview dir contents
- Per the logs below, LintDiff runs on the input-files listed above located in this directory
- Notably machineLearningServices.json has 30'000+ lines

The logs say:

2024-03-20T02:08:33.1176420Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: after.
2024-03-20T02:08:33.1179602Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: after, openapiType: undefined, rpaasLint: false
2024-03-20T02:08:33.1248713Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm  --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/azure-rest-api-specs/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:03:12.5332687Z Execution of AutoRest with LintDiff done. Error is not null: false, stdout contains AutoRest 'error': false, stdout contains AutoRest 'fatal': false, stderr contains AutoRest 'error': false, stderr contains AutoRest 'fatal': false
2024-03-20T07:03:12.5337424Z RETURN definition momentOfTruth.executeAutoRestWithLintDiff.

(...)

2024-03-20T07:03:13.2762900Z Processing AutoRest config README: specification/machinelearningservices/resource-manager/readme.md. beforeOrAfter: before.
2024-03-20T07:03:13.2763769Z ENTER definition momentOfTruth.executeAutoRestWithLintDiff. autoRestConfigReadmePath: /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md, tag: package-preview-2024-04, beforeOrAfter: before, openapiType: undefined, rpaasLint: false
2024-03-20T07:03:13.2765199Z Executing AutoRest with LintDiff: node /mnt/vss/_work/_tasks/AzureApiValidation_5654d05d-82c1-48da-ad8f-161b817f6d41/0.0.90/private/azure-swagger-validation/azureSwaggerValidation/node_modules/autorest/dist/app.js --v3 --spectral --azure-validator --semantic-validator=false --model-validator=false --message-format=json --openapi-type=arm --openapi-subtype=arm  --use=@microsoft.azure/openapi-validator@2.2.0 --tag=package-preview-2024-04 /mnt/vss/_work/1/lint-c93b354fd9c14905bb574a8834c4d69b/specification/machinelearningservices/resource-manager/readme.md
2024-03-20T07:06:54.4645508Z ##[error]The Operation will be canceled. The next steps may not contain expected logs.
2024-03-20T07:06:59.4796537Z ##[error]The operation was canceled.
2024-03-20T07:06:59.4800092Z ##[section]Finishing: LintDiff

So from the mandatory 2 LintDiff runs, the first one, after, ran for about 4 hours and 56 minutes.

PR:

Pull Request 537787: Updated LintDiff.yml: increase timeout from 300 min (5h) to 1440 min (24h)

Update 3/21/2024:

After the increased timeout LintDiff finished after 10h 54min:

https://dev.azure.com/azure-sdk/internal/_build/results?buildId=3621689&view=logs&j=0574a2a6-2d0a-5ec6-40e4-4c6e2f70bea2&t=80c3e782-49f0-5d1c-70dd-cbee57bdd0c7

konrad-jamrozik · 2024-04-02T19:54:48Z

Pull Request 541101: Updated LintRPaaS.yml: set timeout to 1440 minutes (24h)

konrad-jamrozik added Central-EngSys This issue is owned by the Engineering System team. Spec PR Tools Tooling that runs in azure-rest-api-specs repo. labels Dec 14, 2023

konrad-jamrozik self-assigned this Dec 14, 2023

github-project-automation bot added this to Spec PR Tools and Azure SDK EngSys 🚢🎉 Dec 14, 2023

github-project-automation bot moved this to 🤔 Triage in Azure SDK EngSys 🚢🎉 Dec 14, 2023

konrad-jamrozik moved this from 🤔 Triage to 📋 Backlog in Azure SDK EngSys 🚢🎉 Dec 14, 2023

konrad-jamrozik moved this to 📋 Backlog in Spec PR Tools Dec 14, 2023

This was referenced Feb 22, 2024

Aggregate issue for LintDiff reliability improvements #7731

Open

LintDiff is apparently hanging and timing out after 1 hour #7780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

konrad-jamrozik commented Dec 14, 2023 •

edited

Loading

konrad-jamrozik commented Mar 4, 2024 •

edited

Loading

konrad-jamrozik commented Mar 12, 2024 •

edited

Loading

konrad-jamrozik commented Mar 18, 2024 •

edited

Loading

konrad-jamrozik commented Mar 18, 2024 •

edited

Loading

konrad-jamrozik commented Mar 18, 2024

konrad-jamrozik commented Mar 21, 2024 •

edited

Loading

konrad-jamrozik commented Apr 2, 2024

[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

[LintDiff][PR Workflow] Handle cases where check result is too large to be processed by our infrastructure #7457

Comments

konrad-jamrozik commented Dec 14, 2023 • edited Loading

Technical details

Hotfix applied

How we found the root-cause

konrad-jamrozik commented Mar 4, 2024 • edited Loading

konrad-jamrozik commented Mar 12, 2024 • edited Loading

konrad-jamrozik commented Mar 18, 2024 • edited Loading

konrad-jamrozik commented Mar 18, 2024 • edited Loading

konrad-jamrozik commented Mar 18, 2024

konrad-jamrozik commented Mar 21, 2024 • edited Loading

konrad-jamrozik commented Apr 2, 2024

konrad-jamrozik commented Dec 14, 2023 •

edited

Loading

konrad-jamrozik commented Mar 4, 2024 •

edited

Loading

konrad-jamrozik commented Mar 12, 2024 •

edited

Loading

konrad-jamrozik commented Mar 18, 2024 •

edited

Loading

konrad-jamrozik commented Mar 18, 2024 •

edited

Loading

konrad-jamrozik commented Mar 21, 2024 •

edited

Loading