Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow for moderate number of embeddings #106

Open
Nintorac opened this issue Oct 24, 2023 · 3 comments
Open

Very slow for moderate number of embeddings #106

Nintorac opened this issue Oct 24, 2023 · 3 comments

Comments

@Nintorac
Copy link

Here is a visual on how ingest time scales versus number of embeddings. If I log both axis' it looks approximately linear.

I also noticed that there only seems to be a single thread running for the entire duration of the ingest.

I am using embed dings with dimension 2560.

image

I am using python and have installed sqlite-vss via pip if that makes a difference

@asg017
Copy link
Owner

asg017 commented Dec 8, 2023

Do you happen to have the code you used to ingest embeddings into sqlite-vss? It shouldn't take 30 mins to insert 30k vectors. I suspect there's a number fixes that could be made to make it much faster, including:

  • Insert all vectors in one transaction (surround with BEGIN and COMMIT)
  • Avoid execute() and prefer executemany() if in Python
  • Insert vectors in all one go (depends on the source of your vectors)

Also depends if you're using a custom factory or now, so any example code would be great!

@Nintorac
Copy link
Author

i have lost the code sorry. If I remember right this was to create the index after all the data has been inserted

@michmech
Copy link

I am seeing the same performance problem: populating the virtual table takes forever.

Here is the Python code I'm using: note that the part where it becomes slow is at the very end, where all the vectors have already been inserted and we proceed to populating the virtual table (insert into vss_... select...). The vectors I'm working with here have 300 dimensions, in case that matters.

If you see any obvious problems with my code, I'd be grateful to hear about them!

import io, shutil, json, sqlite3, sqlite_vss

# create an empty database:
shutil.copyfile("empty.sqlite", "de.sqlite")
con = sqlite3.connect("de.sqlite")

# load sqlite_vss into the database connection:
con.enable_load_extension(True)
sqlite_vss.load(con)
con.enable_load_extension(False)

# insert the words and their vectors: 
fname = "/media/michmech/Iomega_HDD/bigstuff/cc.de.300.vec"
line_count = 0
f = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
for line in f:
  line_count += 1
  if line_count > 10000: break
  tokens = line.rstrip().split(' ')
  word = tokens[0]
  vector = list(map(float, tokens[1:]))
  if len(vector)==300:
    con.execute("insert into words(word, vector) values(?, ?)", (word, json.dumps(vector)))
    print(line_count, word)
con.commit()

# NOTE: until now everything runs very quickly, after now it becomes dead slow

# create and populate an index of the vectors:
con.execute("create virtual table vss_words using vss0( vector(300) )")
con.execute("insert into vss_words(rowid, vector) select rowid, vector from words")
con.commit()

con.close()

In case it matters, the words table is defined like this:

CREATE TABLE "words" (
  "word" TEXT,
  "vector" TEXT
);

and has an index on the word column:

CREATE INDEX "ix_words" ON "words" (
  "word" ASC
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants