-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add alembic operations for vectorizer #266
base: main
Are you sure you want to change the base?
Conversation
c899380
to
fd9f1bc
Compare
fd9f1bc
to
6f5ff59
Compare
8742af8
to
36cf4d9
Compare
6f5ff59
to
525ab5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:
- simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
- Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
- Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
leaving a few comments in but I think this is the big issue we need to resolve
3b47afc
to
8fe145e
Compare
525ab5b
to
5e76cf9
Compare
projects/pgai/tests/vectorizer/extensions/fixtures/migrations/002_create_vectorizer.py.template
Outdated
Show resolved
Hide resolved
8fe145e
to
882f91e
Compare
c90ae69
to
7b90575
Compare
7b90575
to
447078f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the base.py
with shared pydantic classes and an optional @required
decorator to not have to redefine classes just for optional params.
This should allow to mainly have to edit the base.py
classes instead of having to look in two places when adding new config fields to create_vectorizer.
I'm still not convinced that this is the better approach. But I don't feel strongly about it.
RUN mkdir -p /docker-entrypoint-initdb.d && \ | ||
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \ | ||
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \ | ||
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this to be able to run create extension if not exists timescaledb
I'm not sure this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I am actually running the migrations in tests and creating a vectorizer with scheduling config is not allowed if timescaledb is not installed.
|
||
@cached_property | ||
def _chunker(self) -> CharacterTextSplitter: | ||
return CharacterTextSplitter( | ||
separator=self.separator, | ||
separator=self.separator, # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type: ignore
is now necessary as pyright does not know that the decorator makes the field required and complains about possibly passing None
.
447078f
to
828347f
Compare
828347f
to
e5e4614
Compare
e5e4614
to
7bfedb3
Compare
new_fields[name] = new_field | ||
else: | ||
new_fields[name] = field | ||
_cls.model_fields = new_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I'm trying to build a sample application right now. And in this repo we use pydantic 2.9 where this works but in pydantic 2.10 this already breaks...
File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/chunking.py", line 41, in <module>
@required
^^^^^^^^
File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 97, in required
return dec(cls)
^^^^^^^^
File "/Users/jascha/repositories/pgai/examples/discord_bot/.venv/lib/python3.12/site-packages/pgai/vectorizer/base.py", line 94, in dec
_cls.model_fields = new_fields
^^^^^^^^^^^^^^^^^
AttributeError: property 'model_fields' of 'ModelMetaclass' object has no setter
I'll have to go back through this and find another way, but either way this seems brittle. Pydantic is not really designed to allow such overrides their idea is to use the typing system to infer the validation logic, overriding the types breaks with this declarative approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking a lot better. Some questions remaining but I think this is the right track.
|
||
|
||
def downgrade() -> None: | ||
op.drop_vectorizer(vectorizer_id=1, drop_all=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be better for this to take the target_table name and not the vectorizer_id (which would probably not be known when writing the migration). The target table should be unique and so we should be able to look up the id from that
RUN mkdir -p /docker-entrypoint-initdb.d && \ | ||
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \ | ||
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \ | ||
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you need timescaledb for this pr? This is a dev image so this is fine I'm just curious
# Get all fields including from parent classes | ||
params = {} | ||
for field_name, _field in self.model_fields.items(): # type: ignore | ||
if field_name != "arg_type": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the function_name field included then, how does that work?
return f", {self.arg_type} => ai.{fn_name}({format_sql_params(params)})" # type: ignore | ||
|
||
|
||
class OpenAIConfig(BaseOpenAIConfig, SQLArgumentMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming (and I know naming discussions are always annoying) but why not stick to the sql convention we established and name this EmbeddingOpenAIConfig or EmbeddingConfigOpenAI? (and similar for others). The pro is that the name translation from sql->python is super easy and I think would be easier to understand. The con is that it's long.
Otherwise the translation seems a bit ad-hoc. e.g. Indexing configs have "indexing" in the name but in a different spot than the sql. Let's think about this some more
@@ -0,0 +1,227 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my eye this is a huge improvement from before
@@ -33,7 +38,8 @@ def into_chunks(self, item: dict[str, Any]) -> list[str]: | |||
""" | |||
|
|||
|
|||
class LangChainCharacterTextSplitter(BaseModel, Chunker): | |||
@required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see these used in 2 places? don't we need it on more models?
@@ -164,7 +164,7 @@ for post, embedding in results: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to add docs to adding-embedding-integration.md
This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.