Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce SQLAlchemy session scope to avoid stale data #2475

Merged
merged 2 commits into from
Jan 30, 2025

Conversation

josh-feather
Copy link
Contributor

@josh-feather josh-feather commented Jan 30, 2025

This PR changes two things:

  1. It transitions from thread-local to request-local SQLAlchemy sessions in CAPE-web to avoid stale data.
  2. It moves to use a single SQLAlchemy session per processing task inside the Pebble worker processes to avoid stale data.

The change to request-local sessions inside CAPE-web is accomplished by calling scoped_session().remove() at the end of every web request. Whilst this bug didn't appear to affect anything critical, it did manifest itself in the task status views (API and UI) by ocassionally showing stale data after a task status had been updated on the backend.

Using request-local sessions is the recommended approach inside web apps as noted in the SQLAlchemy docs:

As discussed in the section When do I construct a Session, when do I commit it, and when do I close it?, a web application is architected around the concept of a web request, and integrating such an application with the Session usually implies that the Session will be associated with that request. As it turns out, most Python web frameworks, with notable exceptions such as the asynchronous frameworks Twisted and Tornado, use threads in a simple way, such that a particular web request is received, processed, and completed within the scope of a single worker thread. When the request ends, the worker thread is released to a pool of workers where it is available to handle another request.

This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends. So it is a common practice to use scoped_session as a quick way to integrate the Session with a web application. The sequence diagram below illustrates this flow:

Web Server          Web Framework        SQLAlchemy ORM Code
--------------      --------------       ------------------------------
startup        ->   Web framework        # Session registry is established
                    initializes          Session = scoped_session(sessionmaker())

incoming
web request    ->   web request     ->   # The registry is *optionally*
                    starts               # called upon explicitly to create
                                         # a Session local to the thread and/or request
                                         Session()

                                         # the Session registry can otherwise
                                         # be used at any time, creating the
                                         # request-local Session() if not present,
                                         # or returning the existing one
                                         Session.execute(select(MyClass)) # ...

                                         Session.add(some_object) # ...

                                         # if data was modified, commit the
                                         # transaction
                                         Session.commit()

                    web request ends  -> # the registry is instructed to
                                         # remove the Session
                                         Session.remove()

                    sends output      <-
outgoing web    <-
response

The change to avoid stale data inside of the processor is achieved by removing the session after processing. The next task that runs in the same process will grab a new session and will be forced to pull new objects from the database, preventing any data inconsistencies.

This manifested itself in an issue where the task status fails to change to reported within the process function if the same worker process had previously processed the same task. Although the function call is executed inside the process function, SQLAlchemy does not issue an update statement to the database. This occurs because the task object is already present in the ORM cache from a previous job, with a status of reported.

…avoid stale data

This is achieved by calling `scoped_session().remove()` at the end of every web request. Whilst this bug didn't appear to affect anything critical, it did manifest itself in the task status views (API and UI) by ocassionally showing stale data after a task status had been updated on the backend.

Using request-local sessions is the recommended approach inside web apps as noted in the SQLAlchemy [docs](https://docs.sqlalchemy.org/en/20/orm/contextual.html#using-thread-local-scope-with-web-applications):

> As discussed in the section When do I construct a Session, when do I commit it, and when do I close it?, a web application is architected around the concept of a web request, and integrating such an application with the Session usually implies that the Session will be associated with that request. As it turns out, most Python web frameworks, with notable exceptions such as the asynchronous frameworks Twisted and Tornado, use threads in a simple way, such that a particular web request is received, processed, and completed within the scope of a single worker thread. When the request ends, the worker thread is released to a pool of workers where it is available to handle another request.
>
> This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends. So it is a common practice to use scoped_session as a quick way to integrate the Session with a web application. The sequence diagram below illustrates this flow:

```
Web Server          Web Framework        SQLAlchemy ORM Code
--------------      --------------       ------------------------------
startup        ->   Web framework        # Session registry is established
                    initializes          Session = scoped_session(sessionmaker())

incoming
web request    ->   web request     ->   # The registry is *optionally*
                    starts               # called upon explicitly to create
                                         # a Session local to the thread and/or request
                                         Session()

                                         # the Session registry can otherwise
                                         # be used at any time, creating the
                                         # request-local Session() if not present,
                                         # or returning the existing one
                                         Session.execute(select(MyClass)) # ...

                                         Session.add(some_object) # ...

                                         # if data was modified, commit the
                                         # transaction
                                         Session.commit()

                    web request ends  -> # the registry is instructed to
                                         # remove the Session
                                         Session.remove()

                    sends output      <-
outgoing web    <-
response
```
This fixes a bug where the process function fails to update the database when setting the task status to `reported` if the pebble process had previously processed the task.

By removing the session after processing, the next task that runs in the same process will grab a new session and will be forced to pull new objects from the database. Preventing any data inconsistencies.
@doomedraven
Copy link
Collaborator

\c @dsecuma @xiangchen96 FYI

@doomedraven doomedraven merged commit 0bc6dfd into kevoreilly:master Jan 30, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants