You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There seems to be a bug in the way query_idx gets filled in results in a call to text_sim.search.
In would have expected one of the following two behaviors in the values of query_idx in the output to text_sim.search:
(Preferred) The output results are in the same order as the input queries and each result also has query_idx correctly identifying the index of the query it is a result of (which will be equal to the index of the result in the output)
(Less preferred but still usable) The output results may be in a different order to the input queries but each result has query_idx correctly identifying the index of the query it is a result of so that results can be sorted by query_idx to match the order of queries.
Correct values of query_idx are important for debugging if an index is created with store_data=False (desirable for large indexes).
However, I observed in my use of the package that the results from text_sim.search seem to have fewer unique values for query_idx in the results than the length of the input query list.
Example to reproduce:
import nltk
# nltk.download('punkt') # Needs to be run once
# nltk.download('gutenberg') # Needs to be run once
import numpy as np
from tqdm import tqdm
from unisim import TextSim
hamlet = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
sents = nltk.sent_tokenize(hamlet)
text_sim = TextSim(store_data=True, index_type="approx", batch_size=1024)
text_sim.reset_index()
for sent in sents:
text_sim.add([sent])
queries = sents
retrieval_results = text_sim.search(queries, similarity_threshold=0.9, k=1, drop_closest_match=False)
print("Num queries =", len(queries))
print("Num results =", len(retrieval_results.results))
print("Num results where query_idx in result != idx of result in results list =", len([idx for idx in range(len(retrieval_results.results)) if idx != retrieval_results.results[idx].query_idx]))
print("Num unique queries in input =", len(set(queries)))
print("Num unique queries in output =", len(set([result.query_data for result in retrieval_results.results])))
print("Num results where query data at result idx != input query at idx =", len([idx for idx in range(len(retrieval_results.results)) if retrieval_results.results[idx].query_data != queries[idx]]))
print("Num results where match 0 data at result idx != input query at idx =", len([idx for idx in range(len(retrieval_results.results)) if retrieval_results.results[idx].matches[0].data != queries[idx]]))
My output:
Num queries = 2355
Num results = 2355
Num results where query_idx in result != idx of result in results list = 1331
Num unique queries in input = 1991
Num unique queries in output = 1991
Num results where query data at result idx != input query at idx = 0
Num results where match 0 data at result idx != input query at idx = 9
In this case I created the index with store_data=True so I could verify using the query_data field that the results were in the same order of the queries but the lack of reliable query indexes makes more detailed debugging in indexes where store_data=False challenging for example if we are observing unexpected retrieval results.
I am using unisim==1.0.1 with Python 3.10.12 on a workstation with 2 A6000 GPUs.
In case it is relevant, I also get the following warnings:
2024-10-21 11:05:35.186726: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-21 11:05:35.186754: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-21 11:05:35.187653: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-21 11:05:35.191876: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-21 11:05:35.852691: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-21 11:05:36.482330: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.486731: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.523215: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.527291: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.531250: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.535234: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.468180: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.469837: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.471392: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.472804: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.474295: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.475709: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.486683: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.488149: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.489640: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.491081: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.492567: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.493970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13450 MB memory: -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:2b:00.0, compute capability: 8.6
2024-10-21 11:05:42.494316: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.495743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 46146 MB memory: -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:41:00.0, compute capability: 8.6
/home/aishwarya/Documents/venvs/copyright_env/lib/python3.10/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer RandomNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
warnings.warn(
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
The text was updated successfully, but these errors were encountered:
There seems to be a bug in the way
query_idx
gets filled in results in a call totext_sim.search
.In would have expected one of the following two behaviors in the values of
query_idx
in the output totext_sim.search
:query_idx
correctly identifying the index of the query it is a result of (which will be equal to the index of the result in the output)query_idx
correctly identifying the index of the query it is a result of so that results can be sorted byquery_idx
to match the order of queries.Correct values of
query_idx
are important for debugging if an index is created withstore_data=False
(desirable for large indexes).However, I observed in my use of the package that the results from
text_sim.search
seem to have fewer unique values forquery_idx
in the results than the length of the input query list.Example to reproduce:
My output:
In this case I created the index with
store_data=True
so I could verify using thequery_data
field that the results were in the same order of the queries but the lack of reliable query indexes makes more detailed debugging in indexes wherestore_data=False
challenging for example if we are observing unexpected retrieval results.I am using
unisim==1.0.1
with Python 3.10.12 on a workstation with 2 A6000 GPUs.In case it is relevant, I also get the following warnings:
The text was updated successfully, but these errors were encountered: