-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow for moderate number of embeddings #106
Comments
Do you happen to have the code you used to ingest embeddings into
Also depends if you're using a custom factory or now, so any example code would be great! |
i have lost the code sorry. If I remember right this was to create the index after all the data has been inserted |
I am seeing the same performance problem: populating the virtual table takes forever. Here is the Python code I'm using: note that the part where it becomes slow is at the very end, where all the vectors have already been inserted and we proceed to populating the virtual table ( If you see any obvious problems with my code, I'd be grateful to hear about them! import io, shutil, json, sqlite3, sqlite_vss
# create an empty database:
shutil.copyfile("empty.sqlite", "de.sqlite")
con = sqlite3.connect("de.sqlite")
# load sqlite_vss into the database connection:
con.enable_load_extension(True)
sqlite_vss.load(con)
con.enable_load_extension(False)
# insert the words and their vectors:
fname = "/media/michmech/Iomega_HDD/bigstuff/cc.de.300.vec"
line_count = 0
f = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
for line in f:
line_count += 1
if line_count > 10000: break
tokens = line.rstrip().split(' ')
word = tokens[0]
vector = list(map(float, tokens[1:]))
if len(vector)==300:
con.execute("insert into words(word, vector) values(?, ?)", (word, json.dumps(vector)))
print(line_count, word)
con.commit()
# NOTE: until now everything runs very quickly, after now it becomes dead slow
# create and populate an index of the vectors:
con.execute("create virtual table vss_words using vss0( vector(300) )")
con.execute("insert into vss_words(rowid, vector) select rowid, vector from words")
con.commit()
con.close() In case it matters, the CREATE TABLE "words" (
"word" TEXT,
"vector" TEXT
); and has an index on the CREATE INDEX "ix_words" ON "words" (
"word" ASC
); |
Here is a visual on how ingest time scales versus number of embeddings. If I log both axis' it looks approximately linear.
I also noticed that there only seems to be a single thread running for the entire duration of the ingest.
I am using embed dings with dimension 2560.
I am using python and have installed sqlite-vss via pip if that makes a difference
The text was updated successfully, but these errors were encountered: