Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLAlchemy 2.0 #7583

Merged
merged 52 commits into from
Nov 21, 2023
Merged

SQLAlchemy 2.0 #7583

merged 52 commits into from
Nov 21, 2023

Conversation

smotornyuk
Copy link
Member

@smotornyuk smotornyuk commented May 11, 2023

Prepare code for SQLAlchemy v2.0.

We are using v1.4 right now, and the best part of it is the fact, that this version contains all the features that will be removed in v2.0(that's why we were able to start using it without big code changes) as well as almost all features that are available in v2.0. It means we can migrate the code staying on v1.4 and at some point in the future just change SQLAlchemy's version in the requirements.txt and everything will work like a charm(ideally).

Plan:

  • [before CKAN v2.11] Replace all deprecated features with the recommended for 2.0
  • [release CKAN v2.11] Now the last two CKAN versions are using SQLAlchemy v1.4, which means that extension maintainers, who support only the two latest CKAN versions can update their extensions to be completely compatible with SQLAlchemy v1.4 and v2.0 at the same time.
  • [after CKAN v2.11] Turn on exceptions for all deprecation warnings from SQLAlchemy. It will help us to identify anything we've missed during the first step
  • [after CKAN v2.11] Write a guide for the extension maintainers
  • [halfway between CKAN v2.11 and v2.12, ~3months after v2.11 release] Upgrade to SQLAlchemy v2.0

I expect that code changes won't be visible outside of CKAN and almost everything will be completely compatible with the existing extensions. At the moment I identified only one place, that requires breaking changes: the where key for a dictionary returned from the IDatastore.datastore_search method. SQLAlchemy v2.0 does not allow %-placeholders(WHERE x = %s). Instead it requires :-placeholders(WHERE x = :x). Here you can find details from changelog entry

@smotornyuk smotornyuk changed the title Sqlalchemy 2.0 SQLAlchemy 2.0 May 11, 2023
@smotornyuk smotornyuk marked this pull request as ready for review May 16, 2023 06:12
@wardi wardi self-assigned this May 16, 2023
@amercader amercader added this to the CKAN 2.11 milestone Oct 17, 2023
ckan/model/__init__.py Outdated Show resolved Hide resolved
ckan/model/__init__.py Outdated Show resolved Hide resolved
ckan/model/meta.py Outdated Show resolved Hide resolved
@pdelboca
Copy link
Member

pdelboca commented Nov 1, 2023

Executing this branch with harvest plugin will cause an error:

  File "/home/pdelboca/Repos/ckan/.venv/lib/python3.10/site-packages/ckanext/harvest/plugin.py", line 279, in configure
    model_setup()
  File "/home/pdelboca/Repos/ckan/.venv/lib/python3.10/site-packages/ckanext/harvest/model/__init__.py", line 50, in setup
    if not model.package_table.exists():
  File "<string>", line 2, in exists
  File "/home/pdelboca/Repos/ckan/.venv/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 468, in warned
    return fn(*args, **kwargs)
  File "/home/pdelboca/Repos/ckan/.venv/lib/python3.10/site-packages/sqlalchemy/sql/schema.py", line 941, in exists
    bind = _bind_or_error(self)
  File "/home/pdelboca/Repos/ckan/.venv/lib/python3.10/site-packages/sqlalchemy/sql/base.py", line 1659, in _bind_or_error
    raise exc.UnboundExecutionError(msg)
sqlalchemy.exc.UnboundExecutionError: Table object 'package' is not bound to an Engine or Connection.  Execution can not proceed without a database to execute against.

Are we expecting everything to work? For your comment I suspect yes:

I expect that code changes won't be visible outside of CKAN and almost everything will be completely compatible with the existing extensions.

So it would be nice to have some guidelines if we are gonna need to migrate or change extensions.

@smotornyuk
Copy link
Member Author

Some guidelines are already there - I added a changelog entry with all incompatible changes that I identified.
As for ckanext-harvest, it's a special case. It still defines and initializes tables everytime during the application startup instead of using database migrations. I'll see what can be done, but instead of introducing an exception for plugins that relying on undocumented logic, I'll switch ckanext-harvest to alembic migrations.

@smotornyuk
Copy link
Member Author

smotornyuk commented Nov 2, 2023

BTW, @pdelboca , It's good that you identified this problem. Can you trigger your github workflow that tests popular extensions against this branch somehow?
I'm sure I saw this initialize-table-on-startup strategy somewhere else

@@ -16,7 +16,8 @@ <h1 class="page-heading">
<ul class="user-list">
{% block users_list_inner %}
{% for user in page.items %}
<li>{{ h.linked_user(user['name'], maxlength=20) }}</li>
{# `user` is a tuple with a single item - username #}
<li>{{ h.linked_user(user[0], maxlength=20) }}</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better if switching sqlalchemy versions didn't affect templates this way, can we update the view to generate the old dict format instead of the template?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sqlalchemy v2 disabled some magic that we are using here, so using old syntax is not an option. Just to be clear, page.items does not contain dictionaries. Even in master. The type is the same on master and sqlalchemy-2.0 - list[Row], where Row is some sort of tuple. In SQLAlchemy v1 this Row allowed inconsistent access to its content by named attributes. In SQLAlchemy v2 this implicit logic was removed.

In order to "generate the old dict format", the view has to be rewritten and we have to manually dictize every user that is passed to the template. While it allows us using user.name once again in templates, such change:

  • implies dictization inside view. We are usually doing it inside action, so it sounds like a wrong thing
  • breaks plugins, that expected list[Row] in template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said in your first bullet our usual pattern is to have view call actions that return lists/dicts to pass to templates. Passing sqlalchemy rows to templates (that now behave differently in sqlalchemy 2) is the oddity here that I think is worth fixing.

I don't imagine there are any plugin templates that depend on user being list[Row] because jinja2 treats everything as a dict-like object even when accessing attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot call an action that returns dictionaries in this case:)

A long time ago we had a list of dictionaries. But the user_list action does not have limit/ offset support, so it returned all the users, which caused timeouts on some portals.

Now user.index view gets a query object from the action, which is itself is pretty strange solution, but that's our reality. In order to fix it we have to add limit/offset to the user_list action. But it's not enough. This action returns a list of users, and we don't know how many users there are in order to build a pager.
we cannot use a total number of users in DB because user_list accepts q parameter that filters users by name. So we have to make two requests: one that returns a list of users inside limit/offset window and another one that returns a query, which we'll use to get the number of users matching the q for pagination widget.

And, again, the only alternative is to violate our rules and perform dictization over query results inside view.

So neither solution is perfect. That's why I am so reluctant it terms of making this change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPD: I added dictization to user.index endpoint. Even though it's not perfect, at least now we have comments that explain, why we are doing things in this way. And when somebody decides to update this view/action, it will save him a bit of time

Comment on lines 404 to 407
clause_str = ('"{0}" in ({1})'.format(
field,
','.join(f":{p}" for p in placeholders)
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while making changes here let's make this sql generation from fieldnames properly safe. I would use my old identifier function but maybe there's a sa.column or something that is better now?

Copy link
Member Author

@smotornyuk smotornyuk Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sa.column solves the issue. Is it worth backporting? every field itself is checked in the beginning of the method against a list of existing column names: field not in fields_types:, so it's impossible to pass an arbitrary string here. The person has to create a datastore with corresponding field first

The only edge case I can imagine - datastore field contains " in its name. I doubt that there is a possible vector for SQL ingestion, because all statements that rely on where-clauses are executed as a single query. So basically user who uploads a "dangerous" resource can only affect this exact resource's datastore table.

else:
clause: tuple[Any, ...] = (u'"{0}" = %s'.format(field), value)
placeholder = f"value_{next(idx_gen)}"
clause: tuple[Any, ...] = (f'"{field}" = :{placeholder}', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here

Copy link
Member Author

@smotornyuk smotornyuk Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+

context['connection'].execute(
sql_drop_index.format(index[0]).replace('%', '%%'))
context['connection'].execute(sa.text(
sql_drop_index.format(index[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially unsafe index name formatting here, but only if a bad name was created in the datastore db by another application/db user

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - sa.column should work here as well

@pdelboca
Copy link
Member

Let's go 🚀

@pdelboca pdelboca merged commit 8ddc2e8 into master Nov 21, 2023
7 checks passed
@pdelboca pdelboca deleted the sqlalchemy-2.0 branch November 21, 2023 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants