Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add alembic operations for vectorizer #266

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Dec 2, 2024

This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.

@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from c899380 to fd9f1bc Compare December 2, 2024 10:08
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from fd9f1bc to 6f5ff59 Compare December 3, 2024 13:37
@Askir Askir marked this pull request as ready for review December 3, 2024 23:16
@Askir Askir requested a review from a team as a code owner December 3, 2024 23:16
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8742af8 to 36cf4d9 Compare December 4, 2024 13:13
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 6f5ff59 to 525ab5b Compare December 4, 2024 13:20
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:

  • simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
  • Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
  • Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
    leaving a few comments in but I think this is the big issue we need to resolve

projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved
projects/pgai/pgai/alembic/operations.py Show resolved Hide resolved
projects/pgai/pgai/alembic/operations.py Outdated Show resolved Hide resolved
projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved
@Askir Askir force-pushed the jascha/add-vectorizer-field branch 10 times, most recently from 3b47afc to 8fe145e Compare December 12, 2024 13:46
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 525ab5b to 5e76cf9 Compare December 12, 2024 16:44
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8fe145e to 882f91e Compare December 19, 2024 11:40
Base automatically changed from jascha/add-vectorizer-field to main December 19, 2024 12:32
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch 7 times, most recently from c90ae69 to 7b90575 Compare January 7, 2025 13:57
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 7b90575 to 447078f Compare January 7, 2025 14:07
Copy link
Contributor Author

@Askir Askir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the base.py with shared pydantic classes and an optional @required decorator to not have to redefine classes just for optional params.
This should allow to mainly have to edit the base.py classes instead of having to look in two places when adding new config fields to create_vectorizer.

I'm still not convinced that this is the better approach. But I don't feel strongly about it.

Comment on lines +55 to +58
RUN mkdir -p /docker-entrypoint-initdb.d && \
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this to be able to run create extension if not exists timescaledb I'm not sure this is correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I am actually running the migrations in tests and creating a vectorizer with scheduling config is not allowed if timescaledb is not installed.


@cached_property
def _chunker(self) -> CharacterTextSplitter:
return CharacterTextSplitter(
separator=self.separator,
separator=self.separator, # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type: ignore is now necessary as pyright does not know that the decorator makes the field required and complains about possibly passing None.

@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 447078f to 828347f Compare January 7, 2025 14:14
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 828347f to e5e4614 Compare January 7, 2025 14:16
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from e5e4614 to 7bfedb3 Compare January 7, 2025 14:19
@Askir Askir requested a review from cevian January 7, 2025 14:22
new_fields[name] = new_field
else:
new_fields[name] = field
_cls.model_fields = new_fields
Copy link
Contributor Author

@Askir Askir Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I'm trying to build a sample application right now. And in this repo we use pydantic 2.9 where this works but in pydantic 2.10 this already breaks...

File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/chunking.py", line 41, in <module>
    @required
     ^^^^^^^^
  File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 97, in required
    return dec(cls)
           ^^^^^^^^
  File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 94, in dec
    _cls.model_fields = new_fields
    ^^^^^^^^^^^^^^^^^
AttributeError: property 'model_fields' of 'ModelMetaclass' object has no setter

I'll have to go back through this and find another way, but either way this seems brittle. Pydantic is not really designed to allow such overrides their idea is to use the typing system to infer the validation logic, overriding the types breaks with this declarative approach.

Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking a lot better. Some questions remaining but I think this is the right track.



def downgrade() -> None:
op.drop_vectorizer(vectorizer_id=1, drop_all=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better for this to take the target_table name and not the vectorizer_id (which would probably not be known when writing the migration). The target table should be unique and so we should be able to look up the id from that

Comment on lines +55 to +58
RUN mkdir -p /docker-entrypoint-initdb.d && \
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious

# Get all fields including from parent classes
params = {}
for field_name, _field in self.model_fields.items(): # type: ignore
if field_name != "arg_type":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the function_name field included then, how does that work?

return f", {self.arg_type} => ai.{fn_name}({format_sql_params(params)})" # type: ignore


class OpenAIConfig(BaseOpenAIConfig, SQLArgumentMixin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming (and I know naming discussions are always annoying) but why not stick to the sql convention we established and name this EmbeddingOpenAIConfig or EmbeddingConfigOpenAI? (and similar for others). The pro is that the name translation from sql->python is super easy and I think would be easier to understand. The con is that it's long.

Otherwise the translation seems a bit ad-hoc. e.g. Indexing configs have "indexing" in the name but in a different spot than the sql. Let's think about this some more

@@ -0,0 +1,227 @@
import re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my eye this is a huge improvement from before

@@ -33,7 +38,8 @@ def into_chunks(self, item: dict[str, Any]) -> list[str]:
"""


class LangChainCharacterTextSplitter(BaseModel, Chunker):
@required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see these used in 2 places? don't we need it on more models?

@@ -164,7 +164,7 @@ for post, embedding in results:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add docs to adding-embedding-integration.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants