Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix issue #95 #215

Merged
merged 2 commits into from
Nov 7, 2024
Merged

bug fix issue #95 #215

merged 2 commits into from
Nov 7, 2024

Conversation

benx13
Copy link
Contributor

@benx13 benx13 commented Nov 6, 2024

LightRAG Bug Fix Report

Issue

A TypeError was occurring in the hybrid query mode when trying to access content from text units that contained None values. The error specifically occurred in the _find_most_related_text_unit_from_entities function when trying to process text units for token size truncation.

Root Cause

The issue stemmed from insufficient null checks when processing text units in the knowledge graph. Specifically:

  1. Text unit data could be None when retrieved from text_chunks_db
  2. The data dictionary could be missing the 'content' field
  3. No proper filtering of invalid entries before token size truncation

Key problematic area was in:

591:597:LightRAG/lightrag/operate.py

    if any([v is None for v in all_text_units_lookup.values()]):
        logger.warning("Text chunks are missing, maybe the storage is damaged")
    all_text_units = [
        {"id": k, **v} for k, v in all_text_units_lookup.items() if v is not None
    ]
    all_text_units = sorted(
        all_text_units, key=lambda x: (x["order"], -x["relation_counts"])

Solution

Added comprehensive null checks and data validation throughout the text unit processing pipeline:

  1. Added null check for node data and source_id field:

571:575:LightRAG/lightrag/operate.py

        for k, v in zip(all_one_hop_nodes, all_one_hop_nodes_data)
        if v is not None
    }
    all_text_units_lookup = {}
    for index, (this_text_units, this_edges) in enumerate(zip(text_units, edges)):

  1. Added content validation when getting chunk data:

591:597:LightRAG/lightrag/operate.py

    if any([v is None for v in all_text_units_lookup.values()]):
        logger.warning("Text chunks are missing, maybe the storage is damaged")
    all_text_units = [
        {"id": k, **v} for k, v in all_text_units_lookup.items() if v is not None
    ]
    all_text_units = sorted(
        all_text_units, key=lambda x: (x["order"], -x["relation_counts"])

  1. Added comprehensive filtering for None values:

599:604:LightRAG/lightrag/operate.py

    all_text_units = truncate_list_by_token_size(
        all_text_units,
        key=lambda x: x["data"]["content"],
        max_token_size=query_param.max_token_for_text_unit,
    )
    all_text_units: list[TextChunkSchema] = [t["data"] for t in all_text_units]

The changes are backward compatible and require no modifications to the existing API or data structures.

@benx13
Copy link
Contributor Author

benx13 commented Nov 6, 2024

Fixes #95

@LarFii LarFii merged commit 7c5080e into HKUDS:main Nov 7, 2024
@LarFii
Copy link
Collaborator

LarFii commented Nov 7, 2024

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants