Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when inserting embeddings into vss_table #18

Closed
MahmoudFawzyKhalil opened this issue Apr 23, 2023 · 9 comments
Closed

Segfault when inserting embeddings into vss_table #18

MahmoudFawzyKhalil opened this issue Apr 23, 2023 · 9 comments

Comments

@MahmoudFawzyKhalil
Copy link

MahmoudFawzyKhalil commented Apr 23, 2023

Summary:
When inserting data into the virtual table "vss_chunks" using SQLite, a segmentation fault occurs in the C++ code of the "vssIndexUpdate" function.

Steps to reproduce:

  1. Create a virtual table "vss_chunks" using the vss0 module with the following SQL statement:
  2. Insert data into the "chunks" table with the following Python code:
def save_resource(resource: Resource):
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        INSERT INTO resources (url, title)
        VALUES (?, ?)
    ''', (resource.url, resource.title))

    resource_id = cursor.lastrowid
    for chunk, embedding in zip(resource.chunks, resource.embeddings):
        embedding_bytes = embedding.tobytes()
        cursor.execute('''
            INSERT INTO chunks (chunk, embedding, resource_id)
            VALUES (?, ?, ?)
        ''', (chunk, embedding_bytes, resource_id))

    conn.commit()
    conn.close()

r = salmon.create_article('https://news.ycombinator.com/item?id=34218724')
save_resource(r)
  1. Insert data from the "chunks" table into the "vss_chunks" virtual table with the following SQL statement:
connection = create_connection()
cursor = connection.cursor()
cursor.execute('''
            INSERT INTO vss_chunks (rowid, chunk_embedding)
                SELECT rowid, embedding
                FROM chunks
        ''')
connection.commit()
connection.close()

Logs from IntelliJ when attempting the same query using its SQL console:

# Problematic frame:
# C  [vss0.so+0xb7eb9]  vssIndexUpdate(sqlite3_vtab*, int, sqlite3_value**, long long*)+0x459

Environment:

Operating system: Ubuntu 22.04.2 LTS
Python version: 3.10.6
SQLite version: 3.40.0
sqlite_vss version: 0.0.4 (installed with pip)
sentence_transformers: multi-qa-mpnet-base-cos-v1 model when generates 768 dimensional embeddings

Schema:

def init_db():
    conn = create_connection()
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS resources (
            id INTEGER PRIMARY KEY,
            url TEXT,
            title TEXT
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY,
            chunk TEXT,
            embedding BLOB,
            resource_id INTEGER,
            FOREIGN KEY (resource_id) REFERENCES resources (id)
        )
    ''')

    cursor.execute('''
        CREATE VIRTUAL TABLE vss_chunks USING vss0(
        chunk_embedding(768)
    );
    ''')
    conn.commit()
    conn.close()
@asg017
Copy link
Owner

asg017 commented Apr 23, 2023

Thanks for filing @MahmoudFawzyKhalil !

I'm having trouble reproducing: Can you try running this python script and see if you find the same issue?

import sqlite3
import sqlite_vss
import numpy as np

db = sqlite3.connect(":memory:")
db.enable_load_extension(True)
sqlite_vss.load(db)

print(db.execute("select vss_version()").fetchone()[0])

db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
    id INTEGER PRIMARY KEY,
    url TEXT,
    title TEXT
);

CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY,
    chunk TEXT,
    embedding BLOB,
    resource_id INTEGER,
    FOREIGN KEY (resource_id) REFERENCES resources (id)
);

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(768)
);

INSERT INTO resources (url, title)
  VALUES ("foo", "bar");


""")


db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
  VALUES ("foo", ?1, 1);
""", [np.zeros((1, 768), dtype=np.float32)])

db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT rowid, embedding
  FROM chunks;
""")
db.commit()

results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()

print(results)


db.close()

If that works, then in your original database, can you run:

SELECT DISTINCT length(embedding)
FROM chunks

And see what it returns? I have a feeling that some of the embeddings lengths are not 3072 (768*4), which might be causing the segfault

@MahmoudFawzyKhalil
Copy link
Author

MahmoudFawzyKhalil commented Apr 24, 2023

Thank you for the quick support!

  1. Testing the code works
  2. Sadly it returns a single row with 3072 as the length

Also: in the example you sent me if I set the vector length when creating the virtual table to something like 10, it just truncates the rest, it does not segfault.

@MahmoudFawzyKhalil
Copy link
Author

MahmoudFawzyKhalil commented Apr 24, 2023

I managed to reproduce the issue in the script you sent:

  1. Create a file db with one connection
  2. Execute some statements on that connection
  3. Close it
  4. Create another connection and do the insert into the vss_table

Using just a single connection in my code solved the issue. Also, re-running the script on the same database is fine even though it opens a new connection, as long as it is the only one opened in that run.

import sqlite3
from typing import List, Any

import sqlite_vss
import numpy as np

db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)

print(db.execute("select vss_version()").fetchone()[0])

db.executescript("""
CREATE TABLE IF NOT EXISTS resources (
    id INTEGER PRIMARY KEY,
    url TEXT,
    title TEXT
);

CREATE TABLE IF NOT EXISTS chunks (
    id INTEGER PRIMARY KEY,
    chunk TEXT,
    embedding BLOB,
    resource_id INTEGER,
    FOREIGN KEY (resource_id) REFERENCES resources (id)
);

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(10)
);

INSERT INTO resources (url, title)
  VALUES ("foo", "bar");


""")
db.execute("""
INSERT INTO chunks (chunk, embedding, resource_id)
  VALUES ("foo", ?, 1);
""", [np.zeros((1, 768), dtype=np.float32)])
db.commit()

print(db.execute("SELECT * FROM chunks").fetchall())
# Close connection and create a new one
db.close()

db = sqlite3.connect("bla.db")
db.enable_load_extension(True)
sqlite_vss.load(db)

db.execute("""
INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT rowid, embedding
  FROM chunks;
""")
db.commit()

results = db.execute("select rowid, vector_debug(chunk_embedding), * from vss_chunks").fetchall()

print(results)


db.close()

@asg017
Copy link
Owner

asg017 commented Apr 24, 2023

Thank you @MahmoudFawzyKhalil ! I can now reproduce, attempting a fix

@asg017
Copy link
Owner

asg017 commented Apr 24, 2023

Smallest possible repro:

.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0

CREATE VIRTUAL TABLE vss_chunks USING vss0(
  chunk_embedding(1)
);

.open tmp.db
.load dist/debug/vector0
.load dist/debug/vss0

INSERT INTO vss_chunks (rowid, chunk_embedding)
  SELECT 2, json_array(1);

If you create a vss0 table, don't insert any data, close the connection, open a new connection, and then try to insert into the table, it segfault. This is because we only serialize the Faiss index after write transactions are commited. But if you don't insert into the table when it's first created, there's no write transaction commit, so the index never gets written.

Will add a new test case and a new version bump shortly.

@MahmoudFawzyKhalil
Copy link
Author

Thank you @asg017

@bkono
Copy link

bkono commented Apr 26, 2023

Following for the fix as well. Was meaning to submit a bug report this week about the same behavior, noticed when I included the table setup as part of migrations in an app. Workaround (and what I did originally, which masked the issue) was to perform the table setup after seeding the database.

Looking forward to being able to remove the temporary hack around I had in place :)

@asg017
Copy link
Owner

asg017 commented Apr 26, 2023

This has now been fixed in v0.0.5. The loadable extensions/python package/npm package/deno module have all been updated as well.

Will close, but please file another issue if you find anything else! Thanks for the initial report.

@asg017 asg017 closed this as completed Apr 26, 2023
@bkono
Copy link

bkono commented Apr 27, 2023

Confirmed working for my use case. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants