From 1f72b3e78ef34b83492980f637004ccf29852012 Mon Sep 17 00:00:00 2001 From: Stephen Toub Date: Tue, 10 Feb 2026 11:10:25 -0500 Subject: [PATCH 1/2] Further improvements to code-review skill - Try to remove bias from PR/issue comments by prompting it to first form its own opinions based only on the code and only then factor in and evaluate the PR/issue comments - Urge it to have a more aggressive / skeptical mindset - Explicitly give it the option of saying it needs human review to reduce desire to just say LGTM - Delete the CCR code-review-instructions that seem to be confusing it and just have those refer back to the skill --- .github/code-review-instructions.md | 127 +--------------------------- .github/skills/code-review/SKILL.md | 91 +++++++++++++------- 2 files changed, 60 insertions(+), 158 deletions(-) diff --git a/.github/code-review-instructions.md b/.github/code-review-instructions.md index 305ad8b7bed9df..b51d4025c9f45c 100644 --- a/.github/code-review-instructions.md +++ b/.github/code-review-instructions.md @@ -4,129 +4,4 @@ excludeAgent: coding-agent # Code Review Instructions for dotnet/runtime -These instructions guide code reviews for the dotnet/runtime repository. A compiler runs on every PR (as do a wealth of static analyzers for C# changes), so focus on higher-level concerns that require expert judgment rather than stylistic or syntactic issues. - -## Review Priorities - -### 1. Security - -**Critical Security Concerns:** -- **Input Validation & Sanitization**: Ensure all external inputs (user data, file I/O, network data) are properly validated and sanitized before use -- **Injection Vulnerabilities**: Check for potential SQL injection, command injection, path traversal, or code injection risks -- **Cryptographic Operations**: Verify proper use of cryptographic APIs, secure random number generation, and correct handling of keys/certificates -- **Buffer Overflows**: In native code, check for proper bounds checking and safe memory operations -- **Authentication & Authorization**: Ensure proper access control checks are in place and cannot be bypassed -- **Information Disclosure**: Watch for accidental logging or exposure of sensitive data (credentials, keys, PII) -- **Denial of Service**: Check for potential infinite loops, resource exhaustion, or algorithmic complexity attacks -- **Deserialization**: Ensure safe deserialization practices, especially with untrusted data -- **Race Conditions**: Identify potential TOCTOU (time-of-check-time-of-use) vulnerabilities in security-sensitive operations - -### 2. Performance - -**Performance Considerations:** -- **Algorithmic Complexity**: Identify inefficient algorithms (O(n²) where O(n) is possible, unnecessary allocations) -- **Memory Allocations**: Watch for excessive allocations in hot paths, consider stack allocation (stackalloc) or object pooling where appropriate -- **Boxing**: Identify unnecessary boxing of value types -- **String Operations**: Check for string concatenation in loops (use StringBuilder), excessive string allocations -- **LINQ Performance**: Evaluate LINQ usage in performance-critical paths; consider more direct alternatives -- **Async/Await Overhead**: Ensure async/await is used appropriately (not for trivial synchronous operations) -- **Collection Choices**: Verify appropriate collection types for access patterns (List vs. HashSet vs. Dictionary) -- **Lazy Initialization**: Check for opportunities to defer expensive operations -- **Caching**: Identify repeated expensive computations that could be cached -- **Span and Memory**: Encourage use of modern memory-efficient types for buffer manipulation -- **Native Interop**: Ensure P/Invoke calls are efficient and properly marshaled - -### 3. Backwards Compatibility - -**Compatibility Requirements:** -- **Public API Changes**: Any change to public APIs requires careful scrutiny - - Breaking changes are generally not acceptable - - New optional parameters, overloads, and interface implementations need careful consideration - - Verify that API additions follow existing patterns and naming conventions -- **Serialization Compatibility**: Ensure changes don't break serialization/deserialization of persisted data -- **Configuration Changes**: Changes to configuration format or behavior must maintain backwards compatibility -- **Binary Compatibility**: IL-level changes must preserve binary compatibility for existing assemblies -- **Behavioral Changes**: Even non-breaking API changes can break consumers if behavior changes unexpectedly - - Document behavioral changes clearly - - Consider feature flags or opt-in mechanisms for significant behavior changes -- **Obsolete APIs**: Check that proper obsolescence process is followed (attributes, documentation, migration path) - -### 4. Cross-Component Interactions - -**Integration Points:** -- **Runtime/Library Boundaries**: Changes affecting CoreCLR, Mono, or NativeAOT must work across all runtimes -- **Platform Differences**: Ensure changes work correctly across Windows, Linux, macOS, and other supported platforms -- **Architecture Considerations**: Verify behavior is correct on x86, x64, ARM32, and ARM64 -- **Dependencies**: Changes to core libraries may impact higher-level libraries; consider cascading effects -- **Threading Models**: Ensure thread-safety is maintained and synchronization primitives are used correctly -- **Lifecycle Management**: Verify proper initialization, disposal patterns, and cleanup across component boundaries -- **Shared State**: Carefully review any shared mutable state for thread-safety and consistency -- **Error Handling**: Ensure exceptions and error codes are properly propagated across component boundaries - -### 5. Correctness and Edge Cases - -**Code Correctness:** -- **Null Handling**: While the compiler enforces nullable reference types, verify runtime null checks are appropriate -- **Boundary Conditions**: Test for off-by-one errors, empty collections, null inputs, maximum values -- **Error Paths**: Ensure error handling is correct and complete; resources are properly cleaned up -- **Concurrency**: Identify race conditions, deadlocks, or improper synchronization -- **Exception Safety**: Verify operations maintain invariants even when exceptions occur -- **Resource Management**: Ensure IDisposable is implemented correctly and resources are not leaked -- **Numeric Overflow**: Check for potential integer overflow/underflow in calculations -- **Culture/Locale Issues**: Verify culture-invariant operations where appropriate, proper localization otherwise -- **Time Handling**: Check for timezone, daylight saving, and leap second handling issues - -### 6. Design and Architecture - -**Design Quality:** -- **API Design**: Ensure new APIs are intuitive, follow framework design guidelines, and are hard to misuse -- **Abstraction Level**: Verify abstractions are at the appropriate level and don't leak implementation details -- **Separation of Concerns**: Check that responsibilities are properly separated -- **Extensibility**: Consider whether the design allows for future extension without breaking changes -- **SOLID Principles**: Evaluate adherence to single responsibility, open/closed, and other design principles -- **Code Duplication**: Identify opportunities to reduce duplication while maintaining clarity -- **Testability**: Ensure the code is designed to be testable (proper dependency injection, separation of concerns) - -### 7. Testing - -**Test Quality:** -- **Coverage**: Ensure new functionality has appropriate test coverage -- **Test Scenarios**: Verify tests cover happy paths, error paths, and edge cases -- **Test Reliability**: Watch for flaky tests (timing dependencies, environmental assumptions) -- **Test Performance**: Ensure tests run efficiently and don't unnecessarily slow down CI -- **Platform-Specific Tests**: Verify platform-specific tests are properly conditioned -- **Regression Tests**: Check that bugs being fixed have corresponding regression tests - -### 8. Documentation and Code Clarity - -**Documentation:** -- **XML Documentation**: New public APIs must have clear XML documentation explaining purpose, parameters, return values, and exceptions. Do not comment on existing APIs that lack documentation. -- **Complex Logic**: Comments should explain the "why" behind non-obvious decisions, not restate what the code does -- **TODOs and FIXMEs**: Ensure they are tracked with issues and are appropriate for the change -- **Breaking Changes**: Must be clearly documented with migration guidance - -## What NOT to Focus On - -The following are handled by automated tooling and don't need review comments: - -- Code formatting and style (handled by `.editorconfig` and analyzers) -- Naming convention violations (handled by analyzers) -- Missing using directives (handled by compiler) -- Most syntax errors (handled by compiler) -- Simple code style preferences without technical merit - -## Review Approach - -1. **Understand the Context**: Read the PR description and linked issues to understand the goal. Consider as much relevant code from the containing project as possible. For public APIs, review any code in the repo that consumes the method. -2. **Assess the Scope**: Verify the change is focused and not mixing unrelated concerns -3. **Evaluate Risk**: Consider the risk level based on what components are affected and how widely used they are -4. **Think Like an Attacker**: For security-sensitive code, consider how it might be exploited -5. **Think Like a Consumer**: Consider how the API will be used and potentially misused -6. **Consider Maintenance**: Think about long-term maintenance burden and technical debt - -## Severity Guidelines - -- **Critical**: Security vulnerabilities, data corruption, crashes, breaking changes -- **High**: Performance regressions, resource leaks, incorrect behavior in common scenarios -- **Medium**: Edge case bugs, suboptimal design, missing documentation -- **Low**: Code clarity issues, minor inefficiencies, nice-to-have improvements +Follow the review process, priorities, conventions, and output format defined in the [code-review skill](/.github/skills/code-review/SKILL.md). diff --git a/.github/skills/code-review/SKILL.md b/.github/skills/code-review/SKILL.md index f0b38ad2db68a8..9b5c5e4b237633 100644 --- a/.github/skills/code-review/SKILL.md +++ b/.github/skills/code-review/SKILL.md @@ -7,6 +7,8 @@ description: Review code changes in dotnet/runtime for correctness, performance, Review code changes against conventions and patterns established by dotnet/runtime maintainers. These rules were extracted from 43,000+ maintainer review comments across 6,600+ PRs and represent the actual standards enforced in practice. +**Reviewer mindset:** Be polite but very skeptical. Your job is to help speed the review process for maintainers, which includes not only finding problems the PR author may have missed but also questioning the value of the PR in its entirety. Treat the PR description and linked issues as claims to verify, not facts to accept. Question the stated direction, probe edge cases, and don't hesitate to flag concerns even when unsure. + ## When to Use This Skill Use this skill when: @@ -17,41 +19,58 @@ Use this skill when: ## Review Process -### Step 0: Gather Context +### Step 0: Gather Code Context (No PR Narrative Yet) + +Before analyzing anything, collect as much relevant **code** context as you can. **Critically, do NOT read the PR description, linked issues, or existing review comments yet.** You must form your own independent assessment of what the code does, why it might be needed, what problems it has, and whether the approach is sound — before being exposed to the author's framing. Reading the author's narrative first anchors your judgment and makes you less likely to find real problems. + +1. **Diff and file list**: Fetch the full diff and the list of changed files. +2. **Full source files**: For every changed file, read the **entire source file** (not just the diff hunks). You need the surrounding code to understand invariants, locking protocols, call patterns, and data flow. Diff-only review is the #1 cause of false positives and missed issues. +3. **Consumers and callers**: If the change modifies a public/internal API, a type that others depend on, or a virtual/interface method, search for how consumers use the functionality. Grep for callers, usages, and test sites. Understanding how the code is consumed reveals whether the change could break existing behavior or violate caller assumptions. +4. **Sibling types and related code**: If the change fixes a bug or adds a pattern in one type, check whether sibling types (e.g., other abstraction implementations, other collection types, platform-specific variants) have the same issue or need the same fix. Fetch and read those files too. +5. **Key utility/helper files**: If the diff calls into shared utilities, read those to understand the contracts (thread-safety, idempotency, etc.). +6. **Git history**: Check recent commits to the changed files (`git log --oneline -20 -- `). Look for related recent changes, reverts, or prior attempts to fix the same problem. This reveals whether the area is actively churning, whether a similar fix was tried and reverted, or whether the current change conflicts with recent work. + +### Step 1: Form an Independent Assessment + +Based **only** on the code context gathered above (without the PR description or issue), answer these questions: + +1. **What does this change actually do?** Describe the behavioral change in your own words by reading the diff and surrounding code. What was the old behavior? What is the new behavior? +2. **Why might this change be needed?** Infer the motivation from the code itself. What bug, gap, or improvement does it appear to address? +3. **Is this the right approach?** Would a simpler alternative be more consistent with the codebase? Could the goal be achieved with existing functionality? Are there correctness, performance, or safety concerns? +4. **What problems do you see?** Identify bugs, edge cases, missing validation, thread-safety issues, performance regressions, API design problems, test gaps, and anything else that concerns you. -Before analyzing anything, collect as much relevant context as you can. **Do not review from the diff alone.** The items below are the minimum — go beyond them whenever additional context would improve the review. More context leads to better reviews; err on the side of gathering too much rather than too little. +Write down your independent assessment before proceeding. You must produce a holistic assessment (see [Holistic PR Assessment](#holistic-pr-assessment)) at this stage. + +### Step 2: Incorporate PR Narrative and Reconcile + +Now read the PR description, labels, linked issues (in full), author information, existing review comments, and any related open issues in the same area. Treat all of this as **claims to verify**, not facts to accept. 1. **PR metadata**: Fetch the PR description, labels, linked issues, and author. Read linked issues in full — they often contain the repro, root cause analysis, and constraints the fix must satisfy. -2. **Diff and file list**: Fetch the full diff and the list of changed files. -3. **Full source files**: For every changed file, read the **entire source file** (not just the diff hunks). You need the surrounding code to understand invariants, locking protocols, call patterns, and data flow. Diff-only review is the #1 cause of false positives and missed issues. -4. **Consumers and callers**: If the change modifies a public/internal API, a type that others depend on, or a virtual/interface method, search for how consumers use the functionality. Grep for callers, usages, and test sites. Understanding how the code is consumed reveals whether the change could break existing behavior or violate caller assumptions. -5. **Sibling types and related code**: If the change fixes a bug or adds a pattern in one type, check whether sibling types (e.g., other abstraction implementations, other collection types, platform-specific variants) have the same issue or need the same fix. Fetch and read those files too. -6. **Key utility/helper files**: If the diff calls into shared utilities, read those to understand the contracts (thread-safety, idempotency, etc.). -7. **Git history**: Check recent commits to the changed files (`git log --oneline -20 -- `). Look for related recent changes, reverts, or prior attempts to fix the same problem. This reveals whether the area is actively churning, whether a similar fix was tried and reverted, or whether the current change conflicts with recent work. -8. **Related issues**: Search for other open issues in the same area (same labels, same component). This can reveal known problems the PR should also address, or constraints the author may not be aware of. -9. **Existing review comments**: Check if there are already review comments on the PR to avoid duplicating feedback. - -### Step 1: Analyze - -1. **Assess the PR holistically first.** Before reviewing individual lines, evaluate whether the PR is justified, takes the right approach, and will be a net positive. See the [Holistic PR Assessment](#holistic-pr-assessment) section below. -2. **Focus on what matters.** Prioritize bugs, performance regressions, safety issues, race conditions, resource management problems, incorrect assumptions about data or state, and API design problems. Do not comment on trivial style issues unless they violate an explicit rule below. -3. **Consider collateral damage.** For every changed code path, actively brainstorm: what other scenarios, callers, or inputs flow through this code? Could any of them break or behave differently after this change? If you identify any plausible risk — even one you can't fully confirm — surface it so the author can evaluate. Do not dismiss behavioral changes because you believe the fix justifies them. The tradeoff is the author's decision — your job is to make it visible. -4. **Be specific and actionable.** Every comment should tell the author exactly what to change and why. Reference the relevant convention. Include evidence of how you verified the issue is real, e.g., "looked at all callers and none of them validate this parameter". -5. **Flag severity clearly:** +2. **Related issues**: Search for other open issues in the same area (same labels, same component). This can reveal known problems the PR should also address, or constraints the author may not be aware of. +3. **Existing review comments**: Check if there are already review comments on the PR to avoid duplicating feedback. +4. **Reconcile your assessment with the author's claims.** Where your independent reading of the code disagrees with the PR description or issue, investigate further — but do not simply defer to the author's framing. If the PR claims a bug fix, a performance improvement, or a behavioral correction, verify those claims against the code and any provided evidence. If your independent assessment found problems the PR narrative doesn't acknowledge, those problems are more likely to be real, not less. +5. **Update your holistic assessment** if the additional context reveals information that genuinely changes your evaluation (e.g., a linked issue proves the bug is real, or an existing review comment already identified the same concern). But do not soften findings just because the PR description sounds reasonable. + +### Step 3: Detailed Analysis + +1. **Focus on what matters.** Prioritize bugs, performance regressions, safety issues, race conditions, resource management problems, incorrect assumptions about data or state, and API design problems. Do not comment on trivial style issues unless they violate an explicit rule below. +2. **Consider collateral damage.** For every changed code path, actively brainstorm: what other scenarios, callers, or inputs flow through this code? Could any of them break or behave differently after this change? If you identify any plausible risk — even one you can't fully confirm — surface it so the author can evaluate. Do not dismiss behavioral changes because you believe the fix justifies them. The tradeoff is the author's decision — your job is to make it visible. +3. **Be specific and actionable.** Every comment should tell the author exactly what to change and why. Reference the relevant convention. Include evidence of how you verified the issue is real, e.g., "looked at all callers and none of them validate this parameter". +4. **Flag severity clearly:** - ❌ **error** — Must fix before merge. Bugs, security issues, API violations, test gaps for behavior changes. - ⚠️ **warning** — Should fix. Performance issues, missing validation, inconsistency with established patterns. - 💡 **suggestion** — Consider changing. Style improvements, minor readability wins, optional optimizations. -6. **Don't pile on.** If the same issue appears many times, flag it once on the primary file with a note listing all affected files. Do not leave separate comments for each occurrence. -7. **Respect existing style.** When modifying existing files, the file's current style takes precedence over general guidelines. -8. **Don't flag what CI catches.** Do not flag issues that a linter, typechecker, compiler, analyzer, or CI build step would catch, e.g., missing usings, unsupported syntax, formatting. Assume CI will run separately. -9. **Avoid false positives.** Before flagging any issue: +5. **Don't pile on.** If the same issue appears many times, flag it once on the primary file with a note listing all affected files. Do not leave separate comments for each occurrence. +6. **Respect existing style.** When modifying existing files, the file's current style takes precedence over general guidelines. +7. **Don't flag what CI catches.** Do not flag issues that a linter, typechecker, compiler, analyzer, or CI build step would catch, e.g., missing usings, unsupported syntax, formatting. Assume CI will run separately. +8. **Avoid false positives.** Before flagging any issue: - **Verify the concern actually applies** given the full context, not just the diff. Open the surrounding code to check. Confirm the issue isn't already handled by a caller, callee, or wrapper layer before claiming something is missing. - **Skip theoretical concerns with negligible real-world probability.** "Could happen" is not the same as "will happen." - **If you're unsure, either investigate further until you're confident, or surface it explicitly as a low-confidence question rather than a firm claim.** Do not speculate about issues you have no concrete basis for. Every comment should be worth the reader's time. - **Trust the author's context.** The author knows their codebase. If a pattern seems odd but is consistent with the repo, assume it's intentional. - **Never assert that something "does not exist," "is deprecated," or "is unavailable" based on training data alone.** Your knowledge has a cutoff date. When uncertain, ask rather than assert. -10. **Ensure code suggestions are valid.** Any code you suggest must be syntactically correct and complete. Ensure any suggestion would result in working code. -11. **Label in-scope vs. follow-up.** Distinguish between issues the PR should fix and out-of-scope improvements. Be explicit when a suggestion is a follow-up rather than a blocker. +9. **Ensure code suggestions are valid.** Any code you suggest must be syntactically correct and complete. Ensure any suggestion would result in working code. +10. **Label in-scope vs. follow-up.** Distinguish between issues the PR should fix and out-of-scope improvements. Be explicit when a suggestion is a follow-up rather than a blocker. ## Multi-Model Review @@ -85,7 +104,7 @@ When presenting the final review (whether as a PR comment or as output to the us **Approach**: <1-2 sentences on whether the fix/change takes the right approach> -**Net positive/negative**: <1 sentence verdict on whether this is a net positive or negative for the codebase> +**Summary**: <✅ LGTM / ⚠️ Needs Human Review / ⚠️ Needs Changes / ❌ Reject>. <2-3 sentence summary of the overall verdict and key points. If "Needs Human Review," explicitly state which findings you are uncertain about and what a human reviewer should focus on.> --- @@ -96,12 +115,6 @@ When presenting the final review (whether as a PR comment or as output to the us (Repeat for each finding category. Group related findings under a single heading.) - ---- - -### Summary - -**.** <2-3 sentence summary of the overall verdict and key points.> ``` ### Guidelines @@ -114,9 +127,23 @@ When presenting the final review (whether as a PR comment or as output to the us - 💡 for minor suggestions or observations (nice-to-have) - **Cross-cutting analysis** should be included when relevant: check whether related code (sibling types, callers, other platforms) is affected by the same issue or needs a similar fix. - **Test quality** should be assessed as its own finding when tests are part of the PR. -- **Summary** gives a clear verdict: LGTM (no blocking issues), Needs Changes (with blocking issues listed), or Reject (explaining why this should be closed outright). +- **Summary** gives a clear verdict: LGTM (no blocking issues — use only when confident), Needs Human Review (code may be correct but you have unresolved concerns or uncertainty that require human judgment), Needs Changes (with blocking issues listed), or Reject (explaining why this should be closed outright). **Never give a blanket LGTM when you are unsure.** When in doubt, use "Needs Human Review" and explain what a human should focus on. - Keep the review concise but thorough. Every claim should be backed by evidence from the code. +### Verdict Consistency Rules + +The summary verdict **must** be consistent with the findings in the body. Follow these rules: + +1. **The verdict must reflect your most severe finding.** If you have any ⚠️ findings, the verdict cannot be "LGTM." Use "Needs Human Review" or "Needs Changes" instead. Only use "LGTM" when all findings are ✅ or 💡 and you are confident the change is correct and complete. + +2. **When uncertain, always escalate to human review.** If you are unsure whether a concern is valid, whether the approach is sufficient, or whether you have enough context to judge, the verdict must be "Needs Human Review" — not LGTM. Your job is to surface concerns for human judgment, not to give approval when uncertain. A false LGTM is far worse than an unnecessary escalation. + +3. **Separate code correctness from approach completeness.** A change can be correct code that is an incomplete approach. If you believe the code is right for what it does but the approach is insufficient (e.g., treats symptoms without investigating root cause, silently masks errors that should be diagnosed, fixes one instance but not others), the verdict must reflect the gap — do not let "the code itself looks fine" collapse into LGTM. + +4. **Classify each ⚠️ and ❌ finding as merge-blocking or advisory.** Before writing your summary, decide for each finding: "Would I be comfortable if this merged as-is?" If any answer is "no," the verdict must be "Needs Changes." If any answer is "I'm not sure," the verdict must be "Needs Human Review." + +5. **Devil's advocate check before finalizing.** Re-read all your ⚠️ findings. For each one, ask: does this represent an unresolved concern about the approach, scope, or risk of masking deeper issues? If so, the verdict must reflect that tension. Do not default to optimism because the diff is small or the code is obviously correct at a syntactic level. + --- ## Holistic PR Assessment From 006a1462a410aceaeaf5676dddd580aed3e5ac27 Mon Sep 17 00:00:00 2001 From: Stephen Toub Date: Tue, 10 Feb 2026 11:28:33 -0500 Subject: [PATCH 2/2] Update summary in Holistic Assessment section --- .github/skills/code-review/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/skills/code-review/SKILL.md b/.github/skills/code-review/SKILL.md index 9b5c5e4b237633..7b3de1150d5a2c 100644 --- a/.github/skills/code-review/SKILL.md +++ b/.github/skills/code-review/SKILL.md @@ -119,7 +119,7 @@ When presenting the final review (whether as a PR comment or as output to the us ### Guidelines -- **Holistic Assessment** comes first and covers Motivation, Approach, and Net impact. +- **Holistic Assessment** comes first and covers Motivation, Approach, and Summary. - **Detailed Findings** uses emoji-prefixed category headers: - ✅ for things that are correct / look good (use to confirm important aspects were verified) - ⚠️ for warnings or impactful suggestions (should fix, or follow-up)