timescale · Tostino · Nov 13, 2024 · Nov 14, 2024 · Nov 14, 2024 · Nov 14, 2024
@@ -27,7 +27,6 @@ jobs:
         uses: actions/setup-python@v5
         with:
           python-version: "3.10"
-          cache: "pip" # caching pip dependencies
 
       - name: Verify Docker installation
         run: |
@@ -88,6 +87,11 @@ jobs:
           just ext docker-rm
 
   build-and-test-pgai:
+    services:
+      ollama:
+        image: ollama/ollama:latest
+        ports:
+          - 11434:11434
     runs-on: ubuntu-latest
 
     steps:
@@ -109,7 +113,7 @@ jobs:
 
       - name: Install dependencies
         working-directory: ./projects/pgai
-        run: uv sync
+        run: uv sync --all-extras
 
       - name: Lint
         run: just pgai lint
@@ -122,6 +126,9 @@ jobs:
 
       - name: Run Tests
         run: just pgai test
+        env:
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+
 
       - name: Build the pgai distributable and check artifacts
         run: just pgai build
@@ -77,6 +77,7 @@ jobs:
             org.opencontainers.image.title=pgai-vectorizer-worker
           tags: |
             type=raw,value=${{ env.RELEASE_TAG }}
+            type=raw,value=latest
 
       - name: Login to Docker Hub
         uses: docker/login-action@v3
@@ -89,6 +90,7 @@ jobs:
         with:
           context: "{{defaultContext}}:projects/pgai"
           push: true
-          tags: ${{ steps.meta.outputs.tags }}
+          tags: | 
+            ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}
           platforms: linux/amd64,linux/arm64
@@ -1,3 +1,3 @@
 {
-  "projects/pgai": "0.2.1"
+  "projects/pgai": "0.3.0"
 }
@@ -31,13 +31,17 @@ integration. Update the tests to account for the new function.
 The vectorizer worker reads the database's vectorizer configuration at runtime
 and turns it into a `pgai.vectorizer.Config`.
 
-To add a new integration, add a new embedding class with fields corresponding
-to the database's jsonb configuration to `pgai/vectorizer/embeddings.py`. See
+To add a new integration, add a new file containing the embedding class 
+with fields corresponding to the database's jsonb configuration into the 
+[embedders directory] directory. See
 the existing implementations for examples of how to do this. Implement the
 `Embedder` class' abstract methods. Use first-party python libraries for the
 integration, if available. If no first-party python libraries are available,
 use direct HTTP requests.
 
+Remember to include the import line of your recently created class into the 
+[embedders \_\_init\_\_.py].
+
 Add tests which perform end-to-end testing of the new integration. There are
 two options for handling API calls to the integration API:
 
@@ -49,6 +53,8 @@ used conservatively. We will determine on a case-by-case basis what level of
 testing we would like.
 
 [vcr.py]:https://vcrpy.readthedocs.io/en/latest/
+[embedders directory]:/projects/pgai/pgai/vectorizer/embedders
+[embedders \_\_init\_\_.py]:/projects/pgai/pgai/vectorizer/embedders/__init__.py
 
 ## Documentation
 

@@ -0,0 +1,40 @@
+# Install pgai with Docker
+
+To run pgai, you need to run two containers:
+
+1. A PostgreSQL instance with the pgai extension installed.
+2. A vectorizer worker that syncs your data to the database and creates embeddings (only needed if using pgai vectorizer) .
+
+We have example docker-compose files to get you started:
+
+- [docker compose for pgai](/examples/docker_compose_pgai_ollama/docker-compose.yml) - for using pgai with OpenAI and Voyage AI
+- [docker compose for pgai with Ollama](/examples/docker_compose_pgai_ollama/docker-compose.yml) - for using pgai with Ollama running locally
+
+If you want to run the containers by themselves, see the detailed instructions below.
+
+## Detailed instructions for running the containers 
+
+### Run the PostgreSQL instance
+
+1. Run the docker container. The suggested command is:
+   ```
+   docker run -d --name pgai -p 5432:5432 \
+   -v pg-data:/home/postgres/pgdata/data \
+   -e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
+   ``` 
+
+   This will start a PostgreSQL instance for development purposes using a volume called `pg-data` for data storage. To run in production, you would need to change the password above. See the full  [Docker image instructions](https://docs.timescale.com/self-hosted/latest/install/installation-docker/) for more information. You'll be able to connect to the database using the following connection string: `postgres://postgres:password@localhost/postgres`
+
+2. Create the pgai extension in your database:
+
+    ```
+    docker exec -it pgai psql -U postgres -c "CREATE EXTENSION IF NOT EXISTS ai CASCADE;"
+    ```
+
+### Run the vectorizer worker
+
+1. Run the [vectorizer worker](https://hub.docker.com/r/timescale/pgai-vectorizer-worker) container:
+
+    ```
+    docker run -d --name pgai-vectorizer-worker -e PGAI_VECTORIZER_WORKER_DB_URL=postgres://postgres:password@localhost/postgres timescale/pgai-vectorizer-worker:latest
+    ```
@@ -0,0 +1,72 @@
+# Install the pgai extension from source
+
+To install pgai from source on a PostgreSQL server:
+
+1. **Install the prerequisite software system-wide**
+
+   - **PostgreSQL**: Version 16 or newer is required.
+
+   - **Python3**: if running `python3 --version` in Terminal returns `command
+     not found`, download and install the latest version of [Python3][python3].
+
+   - **Pip**: if running `pip --version` in Terminal returns `command not found`, install it with one of the pip [supported methods][pip].
+
+   - **PL/Python**: follow [How to install Postgres 16 with plpython3u: Recipes for macOS, Ubuntu, Debian, CentOS, Docker][pgai-plpython].
+
+      _macOS_: the standard PostgreSQL brew in Homebrew does not include the `plpython3` extension. These instructions show
+      how to install from an alternate tap.
+
+     - **[Postgresql plugin][asdf-postgres] for the [asdf][asdf] version manager**: set the `--with-python` option
+       when installing PostgreSQL:
+
+       ```bash
+       POSTGRES_EXTRA_CONFIGURE_OPTIONS=--with-python asdf install postgres 16.3
+       ```
+
+   - **pgvector**: follow the [install instructions][pgvector-install] from the official repository. This extension is automatically added to your PostgreSQL database when you install the pgai extension.
+
+
+1. Clone the pgai repo at the latest tagged release:
+
+    ```bash
+    git clone https://github.com/timescale/pgai.git --branch extension-0.6.0
+    cd pgai
+    ```
+
+1. Install the `pgai` PostgreSQL extension:
+
+    ```bash
+    just ext install
+    ```
+    We use [just][just] to run project commands. If you don't have just you can
+    install the extension with:
+
+    ```bash
+    projects/extension/build.py install
+    ```
+
+1. Connect to your database with a postgres client like [psql v16](https://docs.timescale.com/use-timescale/latest/integrations/query-admin/psql/)
+   or [PopSQL](https://docs.timescale.com/use-timescale/latest/popsql/).
+   ```bash
+   psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>"
+   ```
+
+1. Create the pgai extension:
+
+    ```sql
+    CREATE EXTENSION IF NOT EXISTS ai CASCADE;
+    ```
+
+   The `CASCADE` automatically installs `pgvector` and `plpython3u` extensions.
+
+[pgai-plpython]: https://github.com/postgres-ai/postgres-howtos/blob/main/0047_how_to_install_postgres_16_with_plpython3u.md
+[asdf-postgres]: https://github.com/smashedtoatoms/asdf-postgres
+[asdf]: https://github.com/asdf-vm/asdf
+[python3]: https://www.python.org/downloads/
+[pip]: https://pip.pypa.io/en/stable/installation/#supported-methods
+[plpython3u]: https://www.postgresql.org/docs/current/plpython.html
+[pgvector]: https://github.com/pgvector/pgvector
+[pgvector-install]: https://github.com/pgvector/pgvector?tab=readme-ov-file#installation
+[python-virtual-environment]: https://packaging.python.org/en/latest/tutorials/installing-packages/#creating-and-using-virtual-environments
+[create-a-new-service]: https://console.cloud.timescale.com/dashboard/create_services
+[just]: https://github.com/casey/just
@@ -0,0 +1,184 @@
+# SQLAlchemy Integration with pgai Vectorizer
+
+The `vectorizer_relationship` is a SQLAlchemy helper that integrates pgai's vectorization capabilities directly into your SQLAlchemy models.
+Think of it as a normal SQLAlchemy [relationship](https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html), but with a preconfigured model instance under the hood.
+This allows you to easily query vector embeddings created by pgai using familiar SQLAlchemy patterns.
+
+## Installation
+
+To use the SQLAlchemy integration, install pgai with the SQLAlchemy extras:
+
+```bash
+pip install "pgai[sqlalchemy]"
+```
+
+## Basic Usage
+
+Here's a basic example of how to use the `vectorizer_relationship`:
+
+```python
+from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
+from pgai.sqlalchemy import vectorizer_relationship
+
+class Base(DeclarativeBase):
+    pass
+
+class BlogPost(Base):
+    __tablename__ = "blog_posts"
+
+    id: Mapped[int] = mapped_column(primary_key=True)
+    title: Mapped[str]
+    content: Mapped[str]
+
+    # Add vector embeddings for the content field
+    content_embeddings = vectorizer_relationship(
+        dimensions=768
+    )
+```
+Note if you work with alembics autogenerate functionality for migrations, also check [Working with alembic](#working-with-alembic).
+
+### Semantic Search
+
+You can then perform semantic similarity search on the field using [pgvector-python's](https://github.com/pgvector/pgvector-python) distance functions:
+
+```python
+from sqlalchemy import func, text
+
+similar_posts = (
+    session.query(BlogPost.content_embeddings)
+    .order_by(
+        BlogPost.content_embeddings.embedding.cosine_distance(
+            func.ai.openai_embed(
+                "text-embedding-3-small",
+                "search query",
+                text("dimensions => 768")
+            )
+        )
+    )
+    .limit(5)
+    .all()
+)
+```
+
+Or if you already have the embeddings in your application:
+
+```python
+similar_posts = (
+    session.query(BlogPost.content_embeddings)
+    .order_by(
+        BlogPost.content_embeddings.embedding.cosine_distance(
+            [3, 1, 2]
+        )
+    )
+    .limit(5)
+    .all()
+)
+```
+
+## Configuration
+
+The `vectorizer_relationship` accepts the following parameters:
+
+- `dimensions` (int): The size of the embedding vector (required)
+- `target_schema` (str, optional): Override the schema for the embeddings table. If not provided, inherits from the parent model's schema
+- `target_table` (str, optional): Override the table name for embeddings. Default is `{table_name}_embedding_store`
+
+Additional parameters are simply forwarded to the underlying [SQLAlchemy relationship](https://docs.sqlalchemy.org/en/20/orm/relationships.html) so you can configure it as you desire.
+
+Think of the `vectorizer_relationship` as a normal SQLAlchemy relationship, but with a preconfigured model instance under the hood.
+
+
+## Setting up the Vectorizer
+
+After defining your model, you need to create the vectorizer using pgai's SQL functions:
+
+```sql
+SELECT ai.create_vectorizer(
+    'blog_posts'::regclass,
+    embedding => ai.embedding_openai('text-embedding-3-small', 768),
+    chunking => ai.chunking_recursive_character_text_splitter(
+        'content',
+        50,  -- chunk_size
+        10   -- chunk_overlap
+    )
+);
+```
+
+We recommend adding this to a migration script and run it via alembic.
+
+
+## Querying Embeddings
+
+The `vectorizer_relationship` provides several ways to work with embeddings:
+
+### 1. Direct Access to Embeddings
+
+If you access the class proeprty of your model the `vectorizer_relationship` provide a SQLAlchemy model that you can query directly:
+
+```python
+# Get all embeddings
+embeddings = session.query(BlogPost.content_embeddings).all()
+
+# Access embedding properties
+for embedding in embeddings:
+    print(embedding.embedding)  # The vector embedding
+    print(embedding.chunk)      # The text chunk
+```
+The model will have the primary key fields of the parent model as well as the following fields:
+- `chunk` (str): The text chunk that was embedded
+- `embedding` (Vector): The vector embedding
+- `chunk_seq` (int): The sequence number of the chunk
+- `embedding_uuid` (str): The UUID of the embedding
+- `parent` (ParentModel): The parent model instance
+
+### 2. Relationship Access
+
+
+```python
+blog_post = session.query(BlogPost).first()
+for embedding in blog_post.content_embeddings:
+    print(embedding.chunk)
+```
+Access the original posts through the parent relationship
+```python
+for embedding in similar_posts:
+    print(embedding.parent.title)
+```
+
+### 3. Join Queries
+
+You can combine embedding queries with regular SQL queries using the relationship:
+
+```python
+results = (
+    session.query(BlogPost, BlogPost.content_embeddings)
+    .join(BlogPost.content_embeddings)
+    .filter(BlogPost.title.ilike("%search term%"))
+    .all()
+)
+
+for post, embedding in results:
+    print(f"Title: {post.title}")
+    print(f"Chunk: {embedding.chunk}")
+```
+
+## Working with alembic 
+
+
+The `vectorizer_relationship` generates a new SQLAlchemy model, that is available under the attribute that you specify. If you are using alembic's autogenerate functionality to generate migrations, you will need to exclude these models from the autogenerate process.
+These are added to a list in your metadata called `pgai_managed_tables` and you can exclude them by adding the following to your `env.py`:
+
+```python
+def include_object(object, name, type_, reflected, compare_to):
+    if type_ == "table" and name in target_metadata.info.get("pgai_managed_tables", set()):
+        return False
+    return True
+
+context.configure(
+      connection=connection,
+      target_metadata=target_metadata,
+      include_object=include_object
+  )
+```
+
+This should now prevent alembic from generating tables for these models when you run `alembic revision --autogenerate`.