[feat]: add user profile summarizing and generation of embeddings #91

smokeyScraper · 2025-07-08T17:54:44Z

closes #67

closes #77

Summary by CodeRabbit

New Features
- Introduced AI-generated, keyword-rich profile summaries for GitHub users to enhance semantic search and contributor recommendations.
- Added advanced search capabilities, including vector similarity and keyword-based searches for finding relevant contributors.
- User profiles now support embedding vectors for improved search accuracy.
Enhancements
- Improved error handling and logging during profile creation, embedding generation, and storage.
- Enhanced profile summarization workflow with integration of a language model for technical, concise summaries.
- Disabled automatic vectorization in schema creation to better control embedding behavior.
Removals
- Removed Supabase vector database support and related services, consolidating vector operations within the new architecture.

…ng as embedding

coderabbitai · 2025-07-08T17:54:51Z

Walkthrough

The changes introduce advanced user profile summarization and semantic search capabilities using vector embeddings and LLM-generated summaries. The Weaviate database integration is enhanced with new methods for vector-based and keyword-based contributor search, profile retrieval, and explicit vectorization configuration. The Supabase vector DB service is removed, and the embedding service now orchestrates profile summarization, embedding, and similarity search.

Changes

File(s)	Change Summary
backend/app/database/weaviate/init.py	Added package initializer to expose Weaviate operations and client functions via `__all__`.
backend/app/database/weaviate/operations.py	Updated user profile methods to support embedding vectors; added async search and retrieval methods; introduced top-level convenience functions; improved error handling.
backend/app/database/weaviate/scripts/create_schemas.py	Explicitly disables built-in vectorization when creating Weaviate schema collections.
backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py	Added a prompt template constant for generating concise, keyword-rich developer profile summaries optimized for semantic search.
backend/app/services/embedding_service/service.py	Replaced generic embedding item processing with profile summarization, embedding, and similarity search; added LLM integration; introduced result model for summaries; removed Supabase-related methods.
backend/app/services/user/profiling.py	Integrated embedding service into user profiling process; updated flow to generate and store profile embeddings; improved error handling and logging.
backend/app/services/vector_db/service.py	Removed the Supabase vector DB service and its associated Pydantic model and all CRUD/search methods.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ProfilingService as User Profiling
    participant EmbeddingService
    participant WeaviateDB

    User->>ProfilingService: profile_user_from_github(user_id, github_username)
    ProfilingService->>ProfilingService: build_user_profile(...)
    ProfilingService->>EmbeddingService: process_user_profile(profile)
    EmbeddingService->>EmbeddingService: summarize_user_profile(profile)
    EmbeddingService->>EmbeddingService: get_embedding(summary)
    EmbeddingService-->>ProfilingService: (profile, embedding_vector)
    ProfilingService->>WeaviateDB: store_user_profile(profile, embedding_vector)
    WeaviateDB-->>ProfilingService: success/failure
    ProfilingService-->>User: True/False

sequenceDiagram
    participant User
    participant EmbeddingService
    participant WeaviateDB

    User->>EmbeddingService: search_similar_profiles(query_text)
    EmbeddingService->>EmbeddingService: get_embedding(query_text)
    EmbeddingService->>WeaviateDB: search_similar_contributors(query_embedding)
    WeaviateDB-->>EmbeddingService: List of similar profiles
    EmbeddingService-->>User: List of similar profiles

Assessment against linked issues

Objective	Addressed	Explanation
Extract multi-dimensional data from GitHub profiles, repositories, and contributions (#67)	✅	Profile building integrates GitHub data and processes it for embedding and storage.
Analyze expertise levels based on actual code contributions (#67)	✅	Summarization prompt and profile processing emphasize recent PRs and technical skills.
Create semantic chunks for searchable user expertise (#67)	✅	Embedding vectors and LLM-generated summaries enable semantic chunking and search.
Enable smart matching via semantic search (#67, #77)	✅	Added vector similarity and keyword search methods in Weaviate integration and embedding service.
Integrate embedding generation and hybrid database architecture in agent workflows (#77)	✅	Embedding service is integrated into profiling; Supabase vector DB service removed; Weaviate used.

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes detected.

Suggested labels

enhancement

Poem

In the warren of code, a new path appears,
With vectors and summaries—let’s give three cheers!
Profiles now sparkle with embeddings so bright,
Weaviate searches bring matches to light.
Supabase hops off, as LLMs take the stage—
Oh, what a leap for our developer page!
🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ff9678 and 897238b.

📒 Files selected for processing (2)

backend/app/database/weaviate/operations.py (9 hunks)
backend/app/services/embedding_service/service.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#87
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T23:15:13.374Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer test updates and fixes (like missing imports after module reorganization) to separate PRs rather than expanding the scope of module update/chore PRs to include comprehensive test refactoring.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#90
File: backend/app/agents/devrel/nodes/react_supervisor.py:97-101
Timestamp: 2025-07-05T04:33:39.840Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer code deduplication refactoring (like extracting duplicate functions to shared utilities) until there are more common functionalities present among tools/workflow. With only two files using the same function, they consider it not a problem currently and prefer to "align later in a more better way" once more patterns emerge.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T14:45:55.244Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer comprehensive test refactoring to separate PRs/efforts when doing major backend restructuring, rather than expanding the scope of the current refactoring PR to include test updates.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: backend/app/services/auth/management.py:32-33
Timestamp: 2025-06-28T14:44:36.819Z
Learning: In the Devr.AI project, smokeyScraper prefers using machine timezone (IST) for datetime operations during development and testing for easier debugging, with plans to switch to UTC for deployment later.

🔇 Additional comments (16)

backend/app/services/embedding_service/service.py (7)

5-12: LGTM! Clean imports organization.

The new imports are well-organized and necessary for the enhanced functionality. The addition of Pydantic BaseModel, Langchain components, and Weaviate models aligns perfectly with the new profile summarization and semantic search capabilities.

23-27: LGTM! Well-structured data model.

The ProfileSummaryResult class provides a clean structure for encapsulating summarization results with proper typing and optional embedding field.

29-38: LGTM! Updated class documentation and initialization.

The class docstring correctly reflects the new Weaviate integration focus, and the initialization properly handles the new LLM instance variable.

54-68: LGTM! Proper lazy loading implementation for LLM.

The LLM property follows the same lazy loading pattern as the embedding model, with appropriate error handling and logging. The use of settings for configuration values is good practice.

111-157: LGTM! Comprehensive profile summarization implementation.

The method properly extracts and formats profile data for LLM processing, handles empty/null values gracefully, and includes appropriate logging throughout. The token estimation approach is reasonable for monitoring purposes.

178-203: LGTM! Well-implemented semantic search method.

The method properly generates query embeddings, integrates with Weaviate operations, and includes comprehensive logging. The import placement inside the method avoids circular dependency issues.

218-226: LGTM! Enhanced cache clearing with LLM support.

The cache clearing method now properly handles both the embedding model and LLM instances, with appropriate garbage collection and CUDA cache clearing for memory management.

backend/app/database/weaviate/operations.py (9)

3-9: LGTM! Proper imports for new functionality.

The additional imports for Filter and enhanced typing support the new search capabilities effectively.

30-30: LGTM! Fixed query filter parameter.

The correction from where to filters resolves the parameter naming issue and aligns with the Weaviate client API.

47-70: LGTM! Proper embedding vector integration.

The create_user_profile method now correctly accepts and passes the embedding vector to Weaviate's data operations, maintaining the same error handling pattern.

72-95: LGTM! Consistent embedding vector support in updates.

The update method follows the same pattern as create, properly handling the embedding vector parameter.

97-113: LGTM! Enhanced upsert with embedding vector support.

The upsert method correctly passes the embedding vector to both create and update operations, maintaining consistency across the API.

115-164: LGTM! Comprehensive vector similarity search implementation.

The method implements proper vector similarity search with:

Appropriate logging and error handling

Proper distance-to-similarity conversion

Comprehensive result formatting

Graceful handling of malformed results

166-214: LGTM! Well-implemented keyword search functionality.

The BM25 keyword search method provides a complementary search capability with proper query construction and result processing.

216-216: Good documentation for future enhancement.

The TODO comment appropriately notes the limitation of Weaviate's built-in hybrid search with custom vectors, providing context for future development.

281-306: Inconsistent min_distance default values remain.

The convenience function search_similar_contributors uses min_distance=0.7 while the past review comment noted an inconsistency. Looking at the class method, it also uses min_distance=0.7 by default, so this appears to be consistent now.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

backend/app/services/embedding_service/service.py (2)
190-191: Consider moving import to module level.

While importing inside the method avoids circular imports, it's better practice to handle this at the module level with proper import organization.

Move the import to the top of the file:
+from app.database.weaviate.operations import search_similar_contributors
 from app.models.database.weaviate import WeaviateUserProfile
Then remove the import from line 190-191.

222-223: Move gc import to module level.

Standard practice is to import modules at the top of the file.

Move import gc to the top of the file with other imports.
backend/app/database/weaviate/operations.py (1)

216-217: Valid TODO: Document the limitation clearly.

The comment correctly identifies that Weaviate's built-in hybrid search doesn't support custom vectors. Consider creating an issue to track this enhancement.

Would you like me to create an issue to track the implementation of a custom hybrid search solution?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8700e48 and 4ff9678.

📒 Files selected for processing (7)

backend/app/database/weaviate/__init__.py (1 hunks)
backend/app/database/weaviate/operations.py (9 hunks)
backend/app/database/weaviate/scripts/create_schemas.py (1 hunks)
backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py (1 hunks)
backend/app/services/embedding_service/service.py (4 hunks)
backend/app/services/user/profiling.py (3 hunks)
backend/app/services/vector_db/service.py (0 hunks)

💤 Files with no reviewable changes (1)

backend/app/services/vector_db/service.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#87
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T23:15:13.374Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer test updates and fixes (like missing imports after module reorganization) to separate PRs rather than expanding the scope of module update/chore PRs to include comprehensive test refactoring.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#90
File: backend/app/agents/devrel/nodes/react_supervisor.py:97-101
Timestamp: 2025-07-05T04:33:39.840Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer code deduplication refactoring (like extracting duplicate functions to shared utilities) until there are more common functionalities present among tools/workflow. With only two files using the same function, they consider it not a problem currently and prefer to "align later in a more better way" once more patterns emerge.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T14:45:55.244Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer comprehensive test refactoring to separate PRs/efforts when doing major backend restructuring, rather than expanding the scope of the current refactoring PR to include test updates.

Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: backend/app/services/auth/management.py:32-33
Timestamp: 2025-06-28T14:44:36.819Z
Learning: In the Devr.AI project, smokeyScraper prefers using machine timezone (IST) for datetime operations during development and testing for easier debugging, with plans to switch to UTC for deployment later.

🧬 Code Graph Analysis (3)

backend/app/database/weaviate/__init__.py (2)

backend/app/database/weaviate/operations.py (8)

store_user_profile (281-286)

search_similar_contributors (115-164)

search_similar_contributors (288-293)

search_contributors_by_keywords (166-214)

search_contributors_by_keywords (295-300)

get_contributor_profile (218-260)

get_contributor_profile (302-305)

WeaviateUserOperations (13-278)

backend/app/database/weaviate/client.py (1)

get_weaviate_client (19-32)

backend/app/services/user/profiling.py (2)

backend/app/services/embedding_service/service.py (3)

EmbeddingService (29-226)

process_user_profile (159-176)

clear_cache (213-226)

backend/app/database/weaviate/operations.py (1)

store_user_profile (281-286)

backend/app/services/embedding_service/service.py (1)

backend/app/models/database/weaviate.py (1)

WeaviateUserProfile (32-129)

🔇 Additional comments (9)

backend/app/database/weaviate/scripts/create_schemas.py (1)

9-9: LGTM!

Disabling automatic vectorization is the correct approach since embeddings are generated externally by the embedding service.

backend/app/database/weaviate/__init__.py (1)

1-18: Well-structured package initialization.

The exports provide a clean public API for the Weaviate operations module.

backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py (1)

1-24: Well-crafted prompt template for profile summarization.

The prompt provides clear instructions for generating keyword-rich summaries optimized for semantic search and contributor matching. The structure and guidelines are comprehensive.

backend/app/services/user/profiling.py (2)

9-9: Correct import for embedding service integration.

303-326: Excellent error handling and resource management.

The implementation correctly:

Checks for None profile before processing

Handles embedding service exceptions

Clears the embedding service cache in the finally block to prevent memory leaks

backend/app/services/embedding_service/service.py (1)

54-68: Good choice of temperature for consistent summarization.

Using temperature=0.3 for the LLM ensures relatively consistent and deterministic profile summaries, which is appropriate for this use case.

backend/app/database/weaviate/operations.py (3)

30-30: Good fix: Corrected parameter name to match Weaviate API.

The change from where to filters aligns with Weaviate's query API requirements.

166-215: Well-implemented keyword search functionality.

The BM25 search implementation is correct with proper error handling and result formatting.

47-47: All embedding_vector callers are up to date—no further changes needed.

I’ve verified that:

There are no external calls to create_user_profile, update_user_profile, or upsert_user_profile outside of operations.py itself.

The only convenience entry point, store_user_profile in backend/app/services/user/profiling.py, already passes the new embedding_vector argument.

The test helper update_user_profile in tests/test_weaviate.py is a locally defined function and does not reference the updated method signature.

No action is required on existing callers.

backend/app/services/embedding_service/service.py

backend/app/database/weaviate/operations.py

smokeyScraper added 5 commits July 8, 2025 21:56

[feat]: add vectorizer field to none to support custom vectors

b918178

[feat]: add weaviate search operations

3266b92

[feat]: add user profile summarization logic for further indexing

45d872d

[refactor]: align workflow to support profile summarization and stori…

c5dbfd1

…ng as embedding

[chore]: remove unused vector_db service and its init file

4ff9678

coderabbitai bot reviewed Jul 8, 2025

View reviewed changes

[chore]: coderrabbit fixes

897238b

smokeyScraper requested a review from chandansgowda July 8, 2025 21:22

chandansgowda approved these changes Jul 12, 2025

View reviewed changes

chandansgowda merged commit 6c7dc98 into AOSSIE-Org:main Jul 12, 2025
1 check passed

coderabbitai bot mentioned this pull request Jul 22, 2025

[feat]: implement github contributor recommendation tool #110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]: add user profile summarizing and generation of embeddings #91

[feat]: add user profile summarizing and generation of embeddings #91

Uh oh!

smokeyScraper commented Jul 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 8, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[feat]: add user profile summarizing and generation of embeddings #91

[feat]: add user profile summarizing and generation of embeddings #91

Uh oh!

Conversation

smokeyScraper commented Jul 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

closes #67

closes #77

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Suggested labels

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smokeyScraper commented Jul 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)