initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175
initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175aniket866 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds a new GitHub Actions workflow that runs when issues are opened, encodes issue text with all-MiniLM-L6-v2, computes cosine similarities against existing issues, writes top matches to JSON, comments potential duplicates on the issue, and applies a "duplicate" label (with permission-safe operations). Changes
Sequence DiagramsequenceDiagram
actor User as User (opens issue)
participant GHA as GitHub Actions
participant API as GitHub API
participant Script as Python script
participant ML as all-MiniLM-L6-v2
User->>GHA: Issue opened event
GHA->>API: Fetch all issues (open & closed)
API-->>GHA: Issues list
GHA->>Script: Save issues.json + current issue
Script->>ML: Encode issue texts
ML-->>Script: Embeddings
Script->>Script: Compute cosine similarities (threshold 0.82, top 3)
Script->>GHA: Write matches.json
GHA->>API: Post comment with matches (if any)
GHA->>API: Add "duplicate" label (guarded)
API-->>GHA: Responses (or permission notices)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 146-153: The workflow is auto-applying a strong `duplicate` label
via github.rest.issues.addLabels inside the safe(() => ...) call which conflicts
with the "suggest-only" requirement and also fails with 404 if the label doesn't
exist; change this to either skip auto-labeling entirely or use a softer label
like `possible-duplicate` and ensure the label exists before adding by
checking/creating it (use github.rest.issues.getLabel or issues.createLabel) or
handle the 404 explicitly so the operation isn't silently ignored; update the
call site where addLabels(...) is invoked and adjust the surrounding safe(...)
logic accordingly.
- Around line 86-105: The matches list contains numpy scalars from
cosine_similarity (sims) which cause json.dump to fail; when building each match
in the loop in this block (variables: current_vec, other_vecs, sims, matches,
THRESHOLD, MAX_RESULTS, cosine_similarity) convert the numpy.float64 score to a
native Python float (e.g., use float(score) or score.item()) before rounding and
storing it in the dict so json.dump can serialize matches to matches.json
without error.
- Around line 31-39: The current github.paginate call stores mixed issues and
PRs in the issues variable using github.rest.issues.listForRepo; filter out pull
requests before similarity comparison by removing any item that has a
pull_request field (e.g., filter issues where !issue.pull_request) or otherwise
check issue.pull_request !== undefined during the mapping step that builds the
list for duplicate detection; update the mapping/processing that consumes issues
to only operate on true issues so PRs are excluded from comparisons.
- Line 25: Replace the outdated action reference "uses:
actions/github-script@v6" with "uses: actions/github-script@v7" wherever it
appears (the occurrence shown and the second occurrence referenced in the
comment), ensuring both instances are updated to v7 to satisfy actionlint and
current runners; no other changes are needed.
🧹 Nitpick comments (2)
.github/workflows/duplicate_issue_detector.yaml (2)
20-22: Performance and rate-limit concern for large repositories.This workflow fetches all issues (open and closed) on every new issue, then downloads a ~90 MB sentence-transformer model, encodes every issue, and computes pairwise similarity. For repositories with thousands of issues, this will be slow and may hit GitHub API rate limits or Action time limits.
Consider:
- Capping the number of issues fetched (e.g., last 500 by
sort: 'created',direction: 'desc').- Caching the Python dependencies and model using
actions/cacheto avoid re-downloading ~90 MB+ on every run.Also applies to: 31-39
127-131:safe()wrapper silently swallows all errors, not just permission errors.The catch block discards every error type (network failures, malformed responses, bugs) and logs only "Skipped write action due to permissions." This makes debugging difficult. Consider logging
e.messageor at least distinguishing permission errors from unexpected failures.Proposed fix
- const safe = async (fn) => { - try { await fn(); } catch { - core.notice('Skipped write action due to permissions'); + const safe = async (fn) => { + try { await fn(); } catch (e) { + core.warning(`Write action failed: ${e.message}`); } };
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 120-125: The issue is unsanitized interpolation of other-issue
titles into the generated comment (in the matches.map block building list) which
can break or inject markdown; modify the code to escape or sanitize m.title
before interpolation (create and call an escapeMarkdown/escapeGithubMarkdown
helper or reuse an existing sanitizer) and use that escapedTitle in the template
string that constructs list so characters like backticks, brackets,
angle-brackets, asterisks, underscores and HTML tags are neutralized.
- Around line 86-89: Guard against an empty other_vecs before calling
cosine_similarity: check if other_vecs (from embeddings[1:]) is empty and if so
skip the cosine_similarity call (e.g., set sims to an empty list or handle no
comparisons), otherwise compute sims = cosine_similarity([current_vec],
other_vecs)[0]; reference variables current_vec, other_vecs and sims and the
cosine_similarity invocation when applying the guard.
🧹 Nitpick comments (3)
.github/workflows/duplicate_issue_detector.yaml (3)
20-22: Pin dependency versions and consider caching pip packages.
sentence-transformersandscikit-learnare installed without version constraints. A breaking release or supply-chain compromise could silently break or subvert this workflow. Pin to known-good versions (e.g.,sentence-transformers==2.x.x scikit-learn==1.x.x). Also consider addingactions/cachefor the pip cache directory to avoid re-downloading ~400 MB of packages + model weights on every issue open.
31-39: Scalability concern: fetching and encoding all issues is expensive for active repos.
state: 'all'paginates every issue ever filed. For repos with thousands of issues, this means many API pages fetched and thousands of texts encoded by the ML model on every single issue open event. This could easily take 5–10+ minutes and consume significant runner resources.Consider limiting to the most recent N issues (e.g., last 500) by using
sort: 'created',direction: 'desc', and capping the pagination, or caching embeddings of existing issues between runs.
10-12: Consider adding concurrency control and a job timeout.If several issues are opened in quick succession, each triggers an independent workflow run that downloads ~400 MB of model weights and performs expensive encoding. Adding a
concurrencygroup and atimeout-minuteslimit would prevent resource waste and runaway jobs.Suggested addition
jobs: detect-duplicates: runs-on: ubuntu-latest + timeout-minutes: 10 + concurrency: + group: duplicate-detection + cancel-in-progress: true
hi @rahulharpal1603 I have done all testing it is working good testing:in my fork I created a issue , and this issue is compared with original repo of yours (pictopy) and marked duplicate it take 2-3 minutes to analyze since it uses github bot , as coderabbit takes 2-3 minutes so same time it also takes here is the video, it is long so keep forwarding and check the result at the end Thankyou |
Addressed Issues:
Closes #1110
see issue #1110 for more details
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
@rahulharpal1603 Please check this out,
Summary by CodeRabbit