Skip to content

Conversation

@smokeyScraper
Copy link
Contributor

@smokeyScraper smokeyScraper commented Jul 8, 2025

closes #67

closes #77

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced AI-generated, keyword-rich profile summaries for GitHub users to enhance semantic search and contributor recommendations.
    • Added advanced search capabilities, including vector similarity and keyword-based searches for finding relevant contributors.
    • User profiles now support embedding vectors for improved search accuracy.
  • Enhancements

    • Improved error handling and logging during profile creation, embedding generation, and storage.
    • Enhanced profile summarization workflow with integration of a language model for technical, concise summaries.
    • Disabled automatic vectorization in schema creation to better control embedding behavior.
  • Removals

    • Removed Supabase vector database support and related services, consolidating vector operations within the new architecture.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 8, 2025

Walkthrough

The changes introduce advanced user profile summarization and semantic search capabilities using vector embeddings and LLM-generated summaries. The Weaviate database integration is enhanced with new methods for vector-based and keyword-based contributor search, profile retrieval, and explicit vectorization configuration. The Supabase vector DB service is removed, and the embedding service now orchestrates profile summarization, embedding, and similarity search.

Changes

File(s) Change Summary
backend/app/database/weaviate/init.py Added package initializer to expose Weaviate operations and client functions via __all__.
backend/app/database/weaviate/operations.py Updated user profile methods to support embedding vectors; added async search and retrieval methods; introduced top-level convenience functions; improved error handling.
backend/app/database/weaviate/scripts/create_schemas.py Explicitly disables built-in vectorization when creating Weaviate schema collections.
backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py Added a prompt template constant for generating concise, keyword-rich developer profile summaries optimized for semantic search.
backend/app/services/embedding_service/service.py Replaced generic embedding item processing with profile summarization, embedding, and similarity search; added LLM integration; introduced result model for summaries; removed Supabase-related methods.
backend/app/services/user/profiling.py Integrated embedding service into user profiling process; updated flow to generate and store profile embeddings; improved error handling and logging.
backend/app/services/vector_db/service.py Removed the Supabase vector DB service and its associated Pydantic model and all CRUD/search methods.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ProfilingService as User Profiling
    participant EmbeddingService
    participant WeaviateDB

    User->>ProfilingService: profile_user_from_github(user_id, github_username)
    ProfilingService->>ProfilingService: build_user_profile(...)
    ProfilingService->>EmbeddingService: process_user_profile(profile)
    EmbeddingService->>EmbeddingService: summarize_user_profile(profile)
    EmbeddingService->>EmbeddingService: get_embedding(summary)
    EmbeddingService-->>ProfilingService: (profile, embedding_vector)
    ProfilingService->>WeaviateDB: store_user_profile(profile, embedding_vector)
    WeaviateDB-->>ProfilingService: success/failure
    ProfilingService-->>User: True/False
Loading
sequenceDiagram
    participant User
    participant EmbeddingService
    participant WeaviateDB

    User->>EmbeddingService: search_similar_profiles(query_text)
    EmbeddingService->>EmbeddingService: get_embedding(query_text)
    EmbeddingService->>WeaviateDB: search_similar_contributors(query_embedding)
    WeaviateDB-->>EmbeddingService: List of similar profiles
    EmbeddingService-->>User: List of similar profiles
Loading

Assessment against linked issues

Objective Addressed Explanation
Extract multi-dimensional data from GitHub profiles, repositories, and contributions (#67) Profile building integrates GitHub data and processes it for embedding and storage.
Analyze expertise levels based on actual code contributions (#67) Summarization prompt and profile processing emphasize recent PRs and technical skills.
Create semantic chunks for searchable user expertise (#67) Embedding vectors and LLM-generated summaries enable semantic chunking and search.
Enable smart matching via semantic search (#67, #77) Added vector similarity and keyword search methods in Weaviate integration and embedding service.
Integrate embedding generation and hybrid database architecture in agent workflows (#77) Embedding service is integrated into profiling; Supabase vector DB service removed; Weaviate used.

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes detected.

Suggested labels

enhancement

Poem

In the warren of code, a new path appears,
With vectors and summaries—let’s give three cheers!
Profiles now sparkle with embeddings so bright,
Weaviate searches bring matches to light.
Supabase hops off, as LLMs take the stage—
Oh, what a leap for our developer page!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ff9678 and 897238b.

📒 Files selected for processing (2)
  • backend/app/database/weaviate/operations.py (9 hunks)
  • backend/app/services/embedding_service/service.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#87
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T23:15:13.374Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer test updates and fixes (like missing imports after module reorganization) to separate PRs rather than expanding the scope of module update/chore PRs to include comprehensive test refactoring.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#90
File: backend/app/agents/devrel/nodes/react_supervisor.py:97-101
Timestamp: 2025-07-05T04:33:39.840Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer code deduplication refactoring (like extracting duplicate functions to shared utilities) until there are more common functionalities present among tools/workflow. With only two files using the same function, they consider it not a problem currently and prefer to "align later in a more better way" once more patterns emerge.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T14:45:55.244Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer comprehensive test refactoring to separate PRs/efforts when doing major backend restructuring, rather than expanding the scope of the current refactoring PR to include test updates.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: backend/app/services/auth/management.py:32-33
Timestamp: 2025-06-28T14:44:36.819Z
Learning: In the Devr.AI project, smokeyScraper prefers using machine timezone (IST) for datetime operations during development and testing for easier debugging, with plans to switch to UTC for deployment later.
🔇 Additional comments (16)
backend/app/services/embedding_service/service.py (7)

5-12: LGTM! Clean imports organization.

The new imports are well-organized and necessary for the enhanced functionality. The addition of Pydantic BaseModel, Langchain components, and Weaviate models aligns perfectly with the new profile summarization and semantic search capabilities.


23-27: LGTM! Well-structured data model.

The ProfileSummaryResult class provides a clean structure for encapsulating summarization results with proper typing and optional embedding field.


29-38: LGTM! Updated class documentation and initialization.

The class docstring correctly reflects the new Weaviate integration focus, and the initialization properly handles the new LLM instance variable.


54-68: LGTM! Proper lazy loading implementation for LLM.

The LLM property follows the same lazy loading pattern as the embedding model, with appropriate error handling and logging. The use of settings for configuration values is good practice.


111-157: LGTM! Comprehensive profile summarization implementation.

The method properly extracts and formats profile data for LLM processing, handles empty/null values gracefully, and includes appropriate logging throughout. The token estimation approach is reasonable for monitoring purposes.


178-203: LGTM! Well-implemented semantic search method.

The method properly generates query embeddings, integrates with Weaviate operations, and includes comprehensive logging. The import placement inside the method avoids circular dependency issues.


218-226: LGTM! Enhanced cache clearing with LLM support.

The cache clearing method now properly handles both the embedding model and LLM instances, with appropriate garbage collection and CUDA cache clearing for memory management.

backend/app/database/weaviate/operations.py (9)

3-9: LGTM! Proper imports for new functionality.

The additional imports for Filter and enhanced typing support the new search capabilities effectively.


30-30: LGTM! Fixed query filter parameter.

The correction from where to filters resolves the parameter naming issue and aligns with the Weaviate client API.


47-70: LGTM! Proper embedding vector integration.

The create_user_profile method now correctly accepts and passes the embedding vector to Weaviate's data operations, maintaining the same error handling pattern.


72-95: LGTM! Consistent embedding vector support in updates.

The update method follows the same pattern as create, properly handling the embedding vector parameter.


97-113: LGTM! Enhanced upsert with embedding vector support.

The upsert method correctly passes the embedding vector to both create and update operations, maintaining consistency across the API.


115-164: LGTM! Comprehensive vector similarity search implementation.

The method implements proper vector similarity search with:

  • Appropriate logging and error handling
  • Proper distance-to-similarity conversion
  • Comprehensive result formatting
  • Graceful handling of malformed results

166-214: LGTM! Well-implemented keyword search functionality.

The BM25 keyword search method provides a complementary search capability with proper query construction and result processing.


216-216: Good documentation for future enhancement.

The TODO comment appropriately notes the limitation of Weaviate's built-in hybrid search with custom vectors, providing context for future development.


281-306: Inconsistent min_distance default values remain.

The convenience function search_similar_contributors uses min_distance=0.7 while the past review comment noted an inconsistency. Looking at the class method, it also uses min_distance=0.7 by default, so this appears to be consistent now.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
backend/app/services/embedding_service/service.py (2)

190-191: Consider moving import to module level.

While importing inside the method avoids circular imports, it's better practice to handle this at the module level with proper import organization.

Move the import to the top of the file:

+from app.database.weaviate.operations import search_similar_contributors
 from app.models.database.weaviate import WeaviateUserProfile

Then remove the import from line 190-191.


222-223: Move gc import to module level.

Standard practice is to import modules at the top of the file.

Move import gc to the top of the file with other imports.

backend/app/database/weaviate/operations.py (1)

216-217: Valid TODO: Document the limitation clearly.

The comment correctly identifies that Weaviate's built-in hybrid search doesn't support custom vectors. Consider creating an issue to track this enhancement.

Would you like me to create an issue to track the implementation of a custom hybrid search solution?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8700e48 and 4ff9678.

📒 Files selected for processing (7)
  • backend/app/database/weaviate/__init__.py (1 hunks)
  • backend/app/database/weaviate/operations.py (9 hunks)
  • backend/app/database/weaviate/scripts/create_schemas.py (1 hunks)
  • backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py (1 hunks)
  • backend/app/services/embedding_service/service.py (4 hunks)
  • backend/app/services/user/profiling.py (3 hunks)
  • backend/app/services/vector_db/service.py (0 hunks)
💤 Files with no reviewable changes (1)
  • backend/app/services/vector_db/service.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#87
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T23:15:13.374Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer test updates and fixes (like missing imports after module reorganization) to separate PRs rather than expanding the scope of module update/chore PRs to include comprehensive test refactoring.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#90
File: backend/app/agents/devrel/nodes/react_supervisor.py:97-101
Timestamp: 2025-07-05T04:33:39.840Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer code deduplication refactoring (like extracting duplicate functions to shared utilities) until there are more common functionalities present among tools/workflow. With only two files using the same function, they consider it not a problem currently and prefer to "align later in a more better way" once more patterns emerge.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: tests/test_supabase.py:1-3
Timestamp: 2025-06-28T14:45:55.244Z
Learning: In the Devr.AI project, smokeyScraper prefers to defer comprehensive test refactoring to separate PRs/efforts when doing major backend restructuring, rather than expanding the scope of the current refactoring PR to include test updates.
Learnt from: smokeyScraper
PR: AOSSIE-Org/Devr.AI#85
File: backend/app/services/auth/management.py:32-33
Timestamp: 2025-06-28T14:44:36.819Z
Learning: In the Devr.AI project, smokeyScraper prefers using machine timezone (IST) for datetime operations during development and testing for easier debugging, with plans to switch to UTC for deployment later.
🧬 Code Graph Analysis (3)
backend/app/database/weaviate/__init__.py (2)
backend/app/database/weaviate/operations.py (8)
  • store_user_profile (281-286)
  • search_similar_contributors (115-164)
  • search_similar_contributors (288-293)
  • search_contributors_by_keywords (166-214)
  • search_contributors_by_keywords (295-300)
  • get_contributor_profile (218-260)
  • get_contributor_profile (302-305)
  • WeaviateUserOperations (13-278)
backend/app/database/weaviate/client.py (1)
  • get_weaviate_client (19-32)
backend/app/services/user/profiling.py (2)
backend/app/services/embedding_service/service.py (3)
  • EmbeddingService (29-226)
  • process_user_profile (159-176)
  • clear_cache (213-226)
backend/app/database/weaviate/operations.py (1)
  • store_user_profile (281-286)
backend/app/services/embedding_service/service.py (1)
backend/app/models/database/weaviate.py (1)
  • WeaviateUserProfile (32-129)
🔇 Additional comments (9)
backend/app/database/weaviate/scripts/create_schemas.py (1)

9-9: LGTM!

Disabling automatic vectorization is the correct approach since embeddings are generated externally by the embedding service.

backend/app/database/weaviate/__init__.py (1)

1-18: Well-structured package initialization.

The exports provide a clean public API for the Weaviate operations module.

backend/app/services/embedding_service/profile_summarization/prompts/summarization_prompt.py (1)

1-24: Well-crafted prompt template for profile summarization.

The prompt provides clear instructions for generating keyword-rich summaries optimized for semantic search and contributor matching. The structure and guidelines are comprehensive.

backend/app/services/user/profiling.py (2)

9-9: Correct import for embedding service integration.


303-326: Excellent error handling and resource management.

The implementation correctly:

  • Checks for None profile before processing
  • Handles embedding service exceptions
  • Clears the embedding service cache in the finally block to prevent memory leaks
backend/app/services/embedding_service/service.py (1)

54-68: Good choice of temperature for consistent summarization.

Using temperature=0.3 for the LLM ensures relatively consistent and deterministic profile summaries, which is appropriate for this use case.

backend/app/database/weaviate/operations.py (3)

30-30: Good fix: Corrected parameter name to match Weaviate API.

The change from where to filters aligns with Weaviate's query API requirements.


166-215: Well-implemented keyword search functionality.

The BM25 search implementation is correct with proper error handling and result formatting.


47-47: All embedding_vector callers are up to date—no further changes needed.

I’ve verified that:

  • There are no external calls to create_user_profile, update_user_profile, or upsert_user_profile outside of operations.py itself.
  • The only convenience entry point, store_user_profile in backend/app/services/user/profiling.py, already passes the new embedding_vector argument.
  • The test helper update_user_profile in tests/test_weaviate.py is a locally defined function and does not reference the updated method signature.

No action is required on existing callers.

@chandansgowda chandansgowda merged commit 6c7dc98 into AOSSIE-Org:main Jul 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEATURE REQUEST: Integrate Hybrid Database Service FEATURE REQUEST: Efficient chunking strategy to create User Profile

2 participants