Filtering out results based on a threshold #30

is-ahmed · 2024-07-12T21:47:10Z

is-ahmed
Jul 12, 2024

Hi, so I'm trying to take the results i get from a retriever and then try and filter out the results based on their scores. That is, check if a score is above a certain threshold. However, I'm not too sure what the right threshold should be since the scores seem to vary depending on the query and the corpus. Does anyone have any experience doing this?

EDIT: Just to add on, I'm considering trying to normalize the scores, but there's a variety of different normalization techniques, and I don't know which would be the best in this case

Answered by is-ahmed

Jul 16, 2024

I managed to figure out this specific case by adding onto the stopwords that are used. Not sure if this will work out for other test cases, but I'll close this for now

View full answer

xhluca · 2024-07-13T20:48:25Z

xhluca
Jul 13, 2024
Maintainer

Don't have personal experience, but curious to hear what others think!

0 replies

is-ahmed · 2024-07-15T15:01:15Z

is-ahmed
Jul 15, 2024
Author

I guess to add on what I said before. Right now, I'm getting the following scores for some query and corpus

[1.9009349, 0.9240056, 0. , 0. , 0. ]

However, only the document that corresponds to the first score is semantically relevant to my query and I'd like to filter it out the second document if possible.

Here is the code snippet that I'm referring to


import bm25s


tokenized_corpus = [['ai', 'the', 'paper', 'titled', 'learning', 'optimal', 'representations', 'with', 'the', 'decodable', 'information', 'bottleneck', 'introduces', 'a', 'novel', 'approach', 'to', 'finding', 'optimal', 'representations', 'for', 'supervised', 'learning', 'traditionally', 'the', 'information', 'bottleneck', 'method', 'has', 'been', 'used', 'to', 'compress', 'inputs', 'while', 'retaining', 'relevant', 'information', 'about', 'targets', 'in', 'a', 'decoderagnostic', 'manner', 'however', 'the', 'goal', 'in', 'machine', 'learning', 'is', 'generalization', 'which', 'is', 'closely', 'tied', 'to', 'the', 'specific', 'predictive', 'family', 'or', 'decoder', 'such', 'as', 'a', 'neural', 'network\n\nthe', 'authors', 'propose', 'the', 'decodable', 'information', 'bottleneck', 'dib', 'which', 'considers', 'both', 'information', 'retention', 'and', 'compression', 'from', 'the', 'perspective', 'of', 'the', 'desired', 'predictive', 'family', 'this', 'method', 'aims', 'to', 'create', 'representations', 'that', 'are', 'optimal', 'in', 'terms', 'of', 'expected', 'test', 'performance', 'and', 'can', 'be', 'estimated', 'with', 'guarantees\n\nempirical', 'results', 'show', 'that', 'the', 'dib', 'framework', 'can', 'enforce', 'a', 'small', 'generalization', 'gap', 'on', 'downstream', 'classifiers', 'and', 'predict', 'the', 'generalization', 'ability', 'of', 'neural', 'networks', 'this', 'means', 'that', 'the', 'representations', 'learned', 'using', 'dib', 'are', 'effective', 'for', 'both', 'training', 'and', 'unseen', 'test', 'data', 'enhancing', 'the', 'models', 'overall', 'generalization', 'ability\n\nin', 'summary', 'the', 'dib', 'method', 'improves', 'upon', 'traditional', 'information', 'bottleneck', 'approaches', 'by', 'focusing', 'on', 'the', 'specific', 'requirements', 'of', 'the', 'predictive', 'family', 'leading', 'to', 'better', 'generalization', 'and', 'potentially', 'reducing', 'the', 'computational', 'effort', 'required', 'for', 'hyperparameter', 'tuning', 'in', 'deep', 'learning', 'models'], 
                    ['user', 'can', 'you', 'find', 'me', 'a', 'paper', 'on', 'optimal', 'representations', 'and', 'give', 'me', 'a', 'summary', 'of', 'it'], 
                    ['ai', 'a', 'neural', 'net', 'or', 'neural', 'network', 'is', 'a', 'computational', 'model', 'inspired', 'by', 'the', 'way', 'biological', 'neural', 'networks', 'in', 'the', 'human', 'brain', 'process', 'information', 'it', 'consists', 'of', 'layers', 'of', 'interconnected', 'nodes', 'or', 'neurons', 'where', 'each', 'connection', 'has', 'an', 'associated', 'weight', 'these', 'networks', 'are', 'designed', 'to', 'recognize', 'patterns', 'and', 'learn', 'from', 'data', 'through', 'a', 'process', 'called', 'training\n\nin', 'the', 'context', 'of', 'supervised', 'learning', 'neural', 'networks', 'aim', 'to', 'find', 'optimal', 'representations', 'of', 'input', 'data', 'that', 'can', 'accurately', 'predict', 'target', 'outputs', 'for', 'example', 'the', 'paper', 'learning', 'optimal', 'representations', 'with', 'the', 'decodable', 'information', 'bottleneck', 'discusses', 'a', 'method', 'called', 'the', 'decodable', 'information', 'bottleneck', 'dib', 'which', 'focuses', 'on', 'creating', 'representations', 'that', 'are', 'optimal', 'for', 'a', 'specific', 'predictive', 'family', 'or', 'decoder', 'such', 'as', 'a', 'neural', 'network', 'this', 'method', 'aims', 'to', 'improve', 'the', 'generalization', 'ability', 'of', 'neural', 'networks', 'ensuring', 'that', 'they', 'perform', 'well', 'on', 'unseen', 'data', 'by', 'minimizing', 'the', 'expected', 'test', 'loss\n\nthus', 'neural', 'networks', 'are', 'powerful', 'tools', 'in', 'machine', 'learning', 'that', 'can', 'learn', 'complex', 'patterns', 'and', 'make', 'predictions', 'with', 'ongoing', 'research', 'like', 'the', 'dib', 'framework', 'enhancing', 'their', 'ability', 'to', 'generalize', 'from', 'training', 'data', 'to', 'realworld', 'applications'], 
                    ['user', 'can', 'you', 'explain', 'what', 'a', 'neural', 'net', 'is'],
                    ['user', 'hi', 'my', 'name', 'is', 'ismail']
                ]

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = tokenized_corpus

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "what is my name?"
query_tokens = bm25s.tokenize(query)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, k=len(corpus_tokens))
# results are [4, 3, 2, 1, 0]
# scores are [1.9009349, 0.9240056, 0.       , 0.       , 0.       ]
for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f})")

0 replies

is-ahmed · 2024-07-16T14:10:02Z

is-ahmed
Jul 16, 2024
Author

I managed to figure out this specific case by adding onto the stopwords that are used. Not sure if this will work out for other test cases, but I'll close this for now

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering out results based on a threshold #30

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Filtering out results based on a threshold #30

is-ahmed Jul 12, 2024

Replies: 3 comments

xhluca Jul 13, 2024 Maintainer

is-ahmed Jul 15, 2024 Author

is-ahmed Jul 16, 2024 Author

is-ahmed
Jul 12, 2024

xhluca
Jul 13, 2024
Maintainer

is-ahmed
Jul 15, 2024
Author

is-ahmed
Jul 16, 2024
Author