A more robust approach for result to json. #302

Luobots · 2024-11-19T06:05:40Z

When I using LightRAG, my model will generate text below for keyword extraction, it contains two "{", when using "{" + result.split("{")[1].split("}")[0] + "}", it fails, but using "{" + result.split("{")[-1].split("}")[0] + "}" is ok, and the original expectation still achieved.

Keyword Extraction

To extract high-level and low-level keywords from the given query, we will use Natural Language Processing (NLP) techniques.

import json
import re

def extract_keywords(query):
    # Convert query to lowercase
    query = query.lower()

    # Tokenize the query
    tokens = re.findall(r'\b\w+\b', query)

    # Identify high-level keywords
    high_level_keywords = []
    low_level_keywords = []
    stop_words = ['the', 'and', 'a', 'an', 'in', 'on', 'at', 'by', 'with']

    for token in tokens:
        if token not in stop_words:
            if len(token.split()) > 1:
                high_level_keywords.append(token)
            else:
                low_level_keywords.append(token)

    # Remove duplicates from high-level and low-level keywords
    high_level_keywords = list(set(high_level_keywords))
    low_level_keywords = list(set(low_level_keywords))

    # Return the keywords in JSON format
    return {
        "high_level_keywords": high_level_keywords,
        "low_level_keywords": low_level_keywords
    }

query = "How did urbanization influence the average household size in Bhubaneswar?"
result = extract_keywords(query)

print(json.dumps(result, indent=4))

Output:

{
    "high_level_keywords": ["Urbanization", "Average household size"],
    "low_level_keywords": ["Influence", "Bhubaneswar"]
}

This script first tokenizes the query into individual words and then identifies high-level and low-level keywords. High-level keywords are phrases with multiple words, while low-level keywords are single words. The stop_words list is used to exclude common words like "the", "and", etc. that do not add much value to the query. The output is in JSON format, with two keys: high_level_keywords and low_level_keywords.

LarFii · 2024-11-19T08:27:37Z

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

Luobots · 2024-11-19T08:31:21Z

Thanks. However, this keyword extraction approach can only extract words that are present in the query. It becomes challenging to extract keywords that represent concepts or ideas needed to answer the query but are not explicitly mentioned in it.

Oh, this "approach" is just a text generated by my LLM when using it to extrat keyword (See Here in Your Code) 😂, I just want to emphasize the "{" * 2 situation (See Here in Your Code) can be solved by my PR.

A more robust approach for result to json.

c9becdf

LarFii merged commit 14cd7ba into HKUDS:main Nov 19, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more robust approach for result to json. #302

A more robust approach for result to json. #302

Luobots commented Nov 19, 2024

LarFii commented Nov 19, 2024

Luobots commented Nov 19, 2024 •

edited

Loading

A more robust approach for result to json. #302

A more robust approach for result to json. #302

Conversation

Luobots commented Nov 19, 2024

Keyword Extraction

Output:

LarFii commented Nov 19, 2024

Luobots commented Nov 19, 2024 • edited Loading

Luobots commented Nov 19, 2024 •

edited

Loading