Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more robust approach for result to json. #302

Merged
merged 1 commit into from
Nov 19, 2024
Merged

Conversation

Luobots
Copy link
Contributor

@Luobots Luobots commented Nov 19, 2024

When I using LightRAG, my model will generate text below for keyword extraction, it contains two "{", when using "{" + result.split("{")[1].split("}")[0] + "}", it fails, but using "{" + result.split("{")[-1].split("}")[0] + "}" is ok, and the original expectation still achieved.

Keyword Extraction

To extract high-level and low-level keywords from the given query, we will use Natural Language Processing (NLP) techniques.

import json
import re

def extract_keywords(query):
    # Convert query to lowercase
    query = query.lower()

    # Tokenize the query
    tokens = re.findall(r'\b\w+\b', query)

    # Identify high-level keywords
    high_level_keywords = []
    low_level_keywords = []
    stop_words = ['the', 'and', 'a', 'an', 'in', 'on', 'at', 'by', 'with']

    for token in tokens:
        if token not in stop_words:
            if len(token.split()) > 1:
                high_level_keywords.append(token)
            else:
                low_level_keywords.append(token)

    # Remove duplicates from high-level and low-level keywords
    high_level_keywords = list(set(high_level_keywords))
    low_level_keywords = list(set(low_level_keywords))

    # Return the keywords in JSON format
    return {
        "high_level_keywords": high_level_keywords,
        "low_level_keywords": low_level_keywords
    }

query = "How did urbanization influence the average household size in Bhubaneswar?"
result = extract_keywords(query)

print(json.dumps(result, indent=4))

Output:

{
    "high_level_keywords": ["Urbanization", "Average household size"],
    "low_level_keywords": ["Influence", "Bhubaneswar"]
}

This script first tokenizes the query into individual words and then identifies high-level and low-level keywords. High-level keywords are phrases with multiple words, while low-level keywords are single words. The stop_words list is used to exclude common words like "the", "and", etc. that do not add much value to the query. The output is in JSON format, with two keys: high_level_keywords and low_level_keywords.

@LarFii LarFii merged commit 14cd7ba into HKUDS:main Nov 19, 2024
1 check passed
@LarFii
Copy link
Collaborator

LarFii commented Nov 19, 2024

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

@Luobots
Copy link
Contributor Author

Luobots commented Nov 19, 2024

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

Oh, this "approach" is just a text generated by my LLM when using it to extrat keyword (See Here in Your Code) 😂, I just want to emphasize the "{" * 2 situation (See Here in Your Code) can be solved by my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants