initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow by aniket866 · Pull Request #1175 · AOSSIE-Org/PictoPy

aniket866 · 2026-02-15T10:23:19Z

Addressed Issues:

see issue #1110 for more details

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

@rahulharpal1603 Please check this out,

Summary by CodeRabbit

New Features
- Automated duplicate issue detection: when a new issue is opened, the system compares it to existing issues, posts a non-blocking comment listing top potential duplicates with confidence scores and links, and automatically applies a "duplicate" label. Operations degrade gracefully if comment or labeling permissions are restricted.

coderabbitai · 2026-02-15T10:23:44Z

Warning

Rate limit exceeded

@aniket866 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new GitHub Actions workflow that runs when issues are opened, encodes issue text with all-MiniLM-L6-v2, computes cosine similarities against existing issues, writes top matches to JSON, comments potential duplicates on the issue, and applies a "duplicate" label (with permission-safe operations).

Changes

Cohort / File(s)	Summary
GitHub Actions Workflow `.github/workflows/duplicate_issue_detector.yaml`	New workflow that triggers on issue creation, installs Python deps (sentence-transformers, scikit-learn), collects repo issues into `issues.json`, runs a Python script to compute embeddings and cosine similarities (threshold 0.82, max 3), outputs `matches.json`, posts a non-blocking comment listing matches, and attempts to add a `duplicate` label with graceful failure handling.

Sequence Diagram

sequenceDiagram
    actor User as User (opens issue)
    participant GHA as GitHub Actions
    participant API as GitHub API
    participant Script as Python script
    participant ML as all-MiniLM-L6-v2

    User->>GHA: Issue opened event
    GHA->>API: Fetch all issues (open & closed)
    API-->>GHA: Issues list
    GHA->>Script: Save issues.json + current issue
    Script->>ML: Encode issue texts
    ML-->>Script: Embeddings
    Script->>Script: Compute cosine similarities (threshold 0.82, top 3)
    Script->>GHA: Write matches.json
    GHA->>API: Post comment with matches (if any)
    GHA->>API: Add "duplicate" label (guarded)
    API-->>GHA: Responses (or permission notices)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Poem

🐰
I sniff through threads both new and old,
embeddings hum, similarities told.
Three close kin I gently show —
no closing yet, just a friendly glow.
Hops and fixes, duplicates low.

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: adding a semantic issue similarity analysis and duplicate detection automation workflow, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR implements all coding requirements from issue `#1110`: automatically detects similar issues using semantic similarity, suggests duplicates without auto-closing, and integrates as a GitHub Action workflow.
Out of Scope Changes check	✅ Passed	All changes are within scope—the single file modified (.github/workflows/duplicate_issue_detector.yaml) directly implements the automated duplicate detection workflow specified in issue `#1110`.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 146-153: The workflow is auto-applying a strong `duplicate` label
via github.rest.issues.addLabels inside the safe(() => ...) call which conflicts
with the "suggest-only" requirement and also fails with 404 if the label doesn't
exist; change this to either skip auto-labeling entirely or use a softer label
like `possible-duplicate` and ensure the label exists before adding by
checking/creating it (use github.rest.issues.getLabel or issues.createLabel) or
handle the 404 explicitly so the operation isn't silently ignored; update the
call site where addLabels(...) is invoked and adjust the surrounding safe(...)
logic accordingly.
- Around line 86-105: The matches list contains numpy scalars from
cosine_similarity (sims) which cause json.dump to fail; when building each match
in the loop in this block (variables: current_vec, other_vecs, sims, matches,
THRESHOLD, MAX_RESULTS, cosine_similarity) convert the numpy.float64 score to a
native Python float (e.g., use float(score) or score.item()) before rounding and
storing it in the dict so json.dump can serialize matches to matches.json
without error.
- Around line 31-39: The current github.paginate call stores mixed issues and
PRs in the issues variable using github.rest.issues.listForRepo; filter out pull
requests before similarity comparison by removing any item that has a
pull_request field (e.g., filter issues where !issue.pull_request) or otherwise
check issue.pull_request !== undefined during the mapping step that builds the
list for duplicate detection; update the mapping/processing that consumes issues
to only operate on true issues so PRs are excluded from comparisons.
- Line 25: Replace the outdated action reference "uses:
actions/github-script@v6" with "uses: actions/github-script@v7" wherever it
appears (the occurrence shown and the second occurrence referenced in the
comment), ensuring both instances are updated to v7 to satisfy actionlint and
current runners; no other changes are needed.

🧹 Nitpick comments (2)

.github/workflows/duplicate_issue_detector.yaml (2)
20-22: Performance and rate-limit concern for large repositories.

This workflow fetches all issues (open and closed) on every new issue, then downloads a ~90 MB sentence-transformer model, encodes every issue, and computes pairwise similarity. For repositories with thousands of issues, this will be slow and may hit GitHub API rate limits or Action time limits.

Consider:

Capping the number of issues fetched (e.g., last 500 by sort: 'created', direction: 'desc').

Caching the Python dependencies and model using actions/cache to avoid re-downloading ~90 MB+ on every run.

Also applies to: 31-39

127-131: safe() wrapper silently swallows all errors, not just permission errors.

The catch block discards every error type (network failures, malformed responses, bugs) and logs only "Skipped write action due to permissions." This makes debugging difficult. Consider logging e.message or at least distinguishing permission errors from unexpected failures.
Proposed fix
-            const safe = async (fn) => {
-              try { await fn(); } catch {
-                core.notice('Skipped write action due to permissions');
+            const safe = async (fn) => {
+              try { await fn(); } catch (e) {
+                core.warning(`Write action failed: ${e.message}`);
               }
             };

.github/workflows/duplicate_issue_detector.yaml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/workflows/duplicate_issue_detector.yaml:
- Around line 120-125: The issue is unsanitized interpolation of other-issue
titles into the generated comment (in the matches.map block building list) which
can break or inject markdown; modify the code to escape or sanitize m.title
before interpolation (create and call an escapeMarkdown/escapeGithubMarkdown
helper or reuse an existing sanitizer) and use that escapedTitle in the template
string that constructs list so characters like backticks, brackets,
angle-brackets, asterisks, underscores and HTML tags are neutralized.
- Around line 86-89: Guard against an empty other_vecs before calling
cosine_similarity: check if other_vecs (from embeddings[1:]) is empty and if so
skip the cosine_similarity call (e.g., set sims to an empty list or handle no
comparisons), otherwise compute sims = cosine_similarity([current_vec],
other_vecs)[0]; reference variables current_vec, other_vecs and sims and the
cosine_similarity invocation when applying the guard.

🧹 Nitpick comments (3)

.github/workflows/duplicate_issue_detector.yaml (3)
20-22: Pin dependency versions and consider caching pip packages.

sentence-transformers and scikit-learn are installed without version constraints. A breaking release or supply-chain compromise could silently break or subvert this workflow. Pin to known-good versions (e.g., sentence-transformers==2.x.x scikit-learn==1.x.x). Also consider adding actions/cache for the pip cache directory to avoid re-downloading ~400 MB of packages + model weights on every issue open.

31-39: Scalability concern: fetching and encoding all issues is expensive for active repos.

state: 'all' paginates every issue ever filed. For repos with thousands of issues, this means many API pages fetched and thousands of texts encoded by the ML model on every single issue open event. This could easily take 5–10+ minutes and consume significant runner resources.

Consider limiting to the most recent N issues (e.g., last 500) by using sort: 'created', direction: 'desc', and capping the pagination, or caching embeddings of existing issues between runs.

10-12: Consider adding concurrency control and a job timeout.

If several issues are opened in quick succession, each triggers an independent workflow run that downloads ~400 MB of model weights and performs expensive encoding. Adding a concurrency group and a timeout-minutes limit would prevent resource waste and runaway jobs.
Suggested addition
 jobs:
   detect-duplicates:
     runs-on: ubuntu-latest
+    timeout-minutes: 10
+    concurrency:
+      group: duplicate-detection
+      cancel-in-progress: true

.github/workflows/duplicate_issue_detector.yaml

rahulharpal1603

Can you test this by opening an issue on your fork? But take the list of existing issues from our actual repo.

Please share a screen recording of the same.

aniket866 · 2026-02-15T11:11:14Z

Can you test this by opening an issue on your fork? But take the list of existing issues from our actual repo.

Please share a screen recording of the same.

hi @rahulharpal1603 I have done all testing it is working good
aniket866#5 check for more details

testing:

in my fork I created a issue , and this issue is compared with original repo of yours (pictopy) and marked duplicate

it take 2-3 minutes to analyze since it uses github bot , as coderabbit takes 2-3 minutes so same time it also takes

here is the video, it is long so keep forwarding and check the result at the end
https://drive.google.com/file/d/1U3W-LWR_GCP4m1vgpbMDCZFlGSh_WgTt/view?usp=sharing

Thankyou

Create duplicate_issue_detector.yaml

6d583cd

github-actions bot added the CI/CD label Feb 15, 2026

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

aniket866 and others added 2 commits February 15, 2026 15:58

Update .github/workflows/duplicate_issue_detector.yaml

a0cac45

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Code rabbit follow-up

497aeb8

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

.github/workflows/duplicate_issue_detector.yaml Show resolved Hide resolved

.github/workflows/duplicate_issue_detector.yaml Outdated Show resolved Hide resolved

rahulharpal1603 requested changes Feb 15, 2026

View reviewed changes

Update duplicate_issue_detector.yaml

e28a15a

aniket866 requested a review from rahulharpal1603 February 15, 2026 11:32

This was referenced Feb 15, 2026

[FEATURE]: Organization Template Enhancement – Ideas Welcome AOSSIE-Org/Template-Repo#16

Open

[FEATURE]: Duplicate Issue Detection and labeler as Possible Duplicate using semantic analysis AOSSIE-Org/Template-Repo#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175

initialize-semantic-issue-similarity-analysis-and-duplicate-detection-automation-workflow#1175
aniket866 wants to merge 4 commits intoAOSSIE-Org:mainfrom
aniket866:patch-1

aniket866 commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

rahulharpal1603 left a comment •

edited

Loading

Uh oh!

aniket866 commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aniket866 commented Feb 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rahulharpal1603 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aniket866 commented Feb 15, 2026

testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aniket866 commented Feb 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

rahulharpal1603 left a comment •

edited

Loading