Skip to content

initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175

Open
aniket866 wants to merge 4 commits intoAOSSIE-Org:mainfrom
aniket866:patch-1
Open

initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175
aniket866 wants to merge 4 commits intoAOSSIE-Org:mainfrom
aniket866:patch-1

Conversation

@aniket866
Copy link

@aniket866 aniket866 commented Feb 15, 2026

Addressed Issues:

Closes #1110

see issue #1110 for more details

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

@rahulharpal1603 Please check this out,

Summary by CodeRabbit

  • New Features
    • Automated duplicate issue detection: when a new issue is opened, the system compares it to existing issues, posts a non-blocking comment listing top potential duplicates with confidence scores and links, and automatically applies a "duplicate" label. Operations degrade gracefully if comment or labeling permissions are restricted.

@github-actions github-actions bot added the CI/CD label Feb 15, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

Warning

Rate limit exceeded

@aniket866 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new GitHub Actions workflow that runs when issues are opened, encodes issue text with all-MiniLM-L6-v2, computes cosine similarities against existing issues, writes top matches to JSON, comments potential duplicates on the issue, and applies a "duplicate" label (with permission-safe operations).

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/duplicate_issue_detector.yaml
New workflow that triggers on issue creation, installs Python deps (sentence-transformers, scikit-learn), collects repo issues into issues.json, runs a Python script to compute embeddings and cosine similarities (threshold 0.82, max 3), outputs matches.json, posts a non-blocking comment listing matches, and attempts to add a duplicate label with graceful failure handling.

Sequence Diagram

sequenceDiagram
    actor User as User (opens issue)
    participant GHA as GitHub Actions
    participant API as GitHub API
    participant Script as Python script
    participant ML as all-MiniLM-L6-v2

    User->>GHA: Issue opened event
    GHA->>API: Fetch all issues (open & closed)
    API-->>GHA: Issues list
    GHA->>Script: Save issues.json + current issue
    Script->>ML: Encode issue texts
    ML-->>Script: Embeddings
    Script->>Script: Compute cosine similarities (threshold 0.82, top 3)
    Script->>GHA: Write matches.json
    GHA->>API: Post comment with matches (if any)
    GHA->>API: Add "duplicate" label (guarded)
    API-->>GHA: Responses (or permission notices)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Poem

🐰
I sniff through threads both new and old,
embeddings hum, similarities told.
Three close kin I gently show —
no closing yet, just a friendly glow.
Hops and fixes, duplicates low.

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding a semantic issue similarity analysis and duplicate detection automation workflow, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #1110: automatically detects similar issues using semantic similarity, suggests duplicates without auto-closing, and integrates as a GitHub Action workflow.
Out of Scope Changes check ✅ Passed All changes are within scope—the single file modified (.github/workflows/duplicate_issue_detector.yaml) directly implements the automated duplicate detection workflow specified in issue #1110.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 146-153: The workflow is auto-applying a strong `duplicate` label
via github.rest.issues.addLabels inside the safe(() => ...) call which conflicts
with the "suggest-only" requirement and also fails with 404 if the label doesn't
exist; change this to either skip auto-labeling entirely or use a softer label
like `possible-duplicate` and ensure the label exists before adding by
checking/creating it (use github.rest.issues.getLabel or issues.createLabel) or
handle the 404 explicitly so the operation isn't silently ignored; update the
call site where addLabels(...) is invoked and adjust the surrounding safe(...)
logic accordingly.
- Around line 86-105: The matches list contains numpy scalars from
cosine_similarity (sims) which cause json.dump to fail; when building each match
in the loop in this block (variables: current_vec, other_vecs, sims, matches,
THRESHOLD, MAX_RESULTS, cosine_similarity) convert the numpy.float64 score to a
native Python float (e.g., use float(score) or score.item()) before rounding and
storing it in the dict so json.dump can serialize matches to matches.json
without error.
- Around line 31-39: The current github.paginate call stores mixed issues and
PRs in the issues variable using github.rest.issues.listForRepo; filter out pull
requests before similarity comparison by removing any item that has a
pull_request field (e.g., filter issues where !issue.pull_request) or otherwise
check issue.pull_request !== undefined during the mapping step that builds the
list for duplicate detection; update the mapping/processing that consumes issues
to only operate on true issues so PRs are excluded from comparisons.
- Line 25: Replace the outdated action reference "uses:
actions/github-script@v6" with "uses: actions/github-script@v7" wherever it
appears (the occurrence shown and the second occurrence referenced in the
comment), ensuring both instances are updated to v7 to satisfy actionlint and
current runners; no other changes are needed.
🧹 Nitpick comments (2)
.github/workflows/duplicate_issue_detector.yaml (2)

20-22: Performance and rate-limit concern for large repositories.

This workflow fetches all issues (open and closed) on every new issue, then downloads a ~90 MB sentence-transformer model, encodes every issue, and computes pairwise similarity. For repositories with thousands of issues, this will be slow and may hit GitHub API rate limits or Action time limits.

Consider:

  • Capping the number of issues fetched (e.g., last 500 by sort: 'created', direction: 'desc').
  • Caching the Python dependencies and model using actions/cache to avoid re-downloading ~90 MB+ on every run.

Also applies to: 31-39


127-131: safe() wrapper silently swallows all errors, not just permission errors.

The catch block discards every error type (network failures, malformed responses, bugs) and logs only "Skipped write action due to permissions." This makes debugging difficult. Consider logging e.message or at least distinguishing permission errors from unexpected failures.

Proposed fix
-            const safe = async (fn) => {
-              try { await fn(); } catch {
-                core.notice('Skipped write action due to permissions');
+            const safe = async (fn) => {
+              try { await fn(); } catch (e) {
+                core.warning(`Write action failed: ${e.message}`);
               }
             };

aniket866 and others added 2 commits February 15, 2026 15:58
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 120-125: The issue is unsanitized interpolation of other-issue
titles into the generated comment (in the matches.map block building list) which
can break or inject markdown; modify the code to escape or sanitize m.title
before interpolation (create and call an escapeMarkdown/escapeGithubMarkdown
helper or reuse an existing sanitizer) and use that escapedTitle in the template
string that constructs list so characters like backticks, brackets,
angle-brackets, asterisks, underscores and HTML tags are neutralized.
- Around line 86-89: Guard against an empty other_vecs before calling
cosine_similarity: check if other_vecs (from embeddings[1:]) is empty and if so
skip the cosine_similarity call (e.g., set sims to an empty list or handle no
comparisons), otherwise compute sims = cosine_similarity([current_vec],
other_vecs)[0]; reference variables current_vec, other_vecs and sims and the
cosine_similarity invocation when applying the guard.
🧹 Nitpick comments (3)
.github/workflows/duplicate_issue_detector.yaml (3)

20-22: Pin dependency versions and consider caching pip packages.

sentence-transformers and scikit-learn are installed without version constraints. A breaking release or supply-chain compromise could silently break or subvert this workflow. Pin to known-good versions (e.g., sentence-transformers==2.x.x scikit-learn==1.x.x). Also consider adding actions/cache for the pip cache directory to avoid re-downloading ~400 MB of packages + model weights on every issue open.


31-39: Scalability concern: fetching and encoding all issues is expensive for active repos.

state: 'all' paginates every issue ever filed. For repos with thousands of issues, this means many API pages fetched and thousands of texts encoded by the ML model on every single issue open event. This could easily take 5–10+ minutes and consume significant runner resources.

Consider limiting to the most recent N issues (e.g., last 500) by using sort: 'created', direction: 'desc', and capping the pagination, or caching embeddings of existing issues between runs.


10-12: Consider adding concurrency control and a job timeout.

If several issues are opened in quick succession, each triggers an independent workflow run that downloads ~400 MB of model weights and performs expensive encoding. Adding a concurrency group and a timeout-minutes limit would prevent resource waste and runaway jobs.

Suggested addition
 jobs:
   detect-duplicates:
     runs-on: ubuntu-latest
+    timeout-minutes: 10
+    concurrency:
+      group: duplicate-detection
+      cancel-in-progress: true

Copy link
Contributor

@rahulharpal1603 rahulharpal1603 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test this by opening an issue on your fork? But take the list of existing issues from our actual repo.

Please share a screen recording of the same.

@aniket866
Copy link
Author

Can you test this by opening an issue on your fork? But take the list of existing issues from our actual repo.

Please share a screen recording of the same.

Can you test this by opening an issue on your fork? But take the list of existing issues from our actual repo.

Please share a screen recording of the same.

hi @rahulharpal1603 I have done all testing it is working good
aniket866#5 check for more details

testing:

in my fork I created a issue , and this issue is compared with original repo of yours (pictopy) and marked duplicate

it take 2-3 minutes to analyze since it uses github bot , as coderabbit takes 2-3 minutes so same time it also takes

here is the video, it is long so keep forwarding and check the result at the end
https://drive.google.com/file/d/1U3W-LWR_GCP4m1vgpbMDCZFlGSh_WgTt/view?usp=sharing

Thankyou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI/CD: Duplicate issue detection and marking

2 participants