Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Dec 11, 2024

Since 2010(!) sqlite has had a WAL, or Write-Ahead Log mode of journalling
which allos multiple concurrent readers and one writer. More than good enough
for us for "local" use.

The primary driver for this change was a realisation that it is possible and
to reduce the amount of code in complexity in DagProcessorManager before
reworking it for AIP-72 support :- we have a lot of code in the
DagProcessorManager to support if async_mode that makes understanding the
flow complex.

Some useful docs and articles about this mode:

This still keeps the warning against using SQLite in production, but it
greatly reduces the restrictions what combos and settings can use this. In
short, when using an SQLite db it is now possible to:

  • use LocalExecutor, including with more than 1 concurrent worker slot
  • have multiple DAG parsing processes (even before AIP-72/TaskSDK changes to
    that)

We execute the PRAGMA journal_mode every time we connect, which is more
often that is strictly needed as this is one of the few modes thatis
persistent and a property of the DB file just for ease and to ensure that it
it is in the mode we want.

I have tested this with breeze -b sqlite start_airflow and a kicking off a
lot of tasks concurrently.

Will this be without problems? No, not entirely, but due to the
scheduler+webserver+api server process we've already got the case where
multiple processes are operating on the DB file. This change just makes the
best use of that following the guidance of the SQLite project: Ensuring that
only a single process accesses the DB concurrently is not a requirement
anymore!

… mode

Since 2010(!) sqlite has had a WAL, or Write-Ahead Log mode of journalling
which allos multiple concurrent readers and one writer. More than good enough
for us for "local" use.

The primary driver for this change was a realisation that it is possible and
to reduce the amount of code in complexity in DagProcessorManager before
reworking it for AIP-72 support :- we have a lot of code in the
DagProcessorManager to support `if async_mode` that makes understanding the
flow complex.

Some useful docs and articles about this mode:

- [The offical docs](https://sqlite.org/wal.html)
- [Simon Willison's TIL](https://til.simonwillison.net/sqlite/enabling-wal-mode)
- [fly.io article about scaling read concurrency](https://fly.io/blog/sqlite-internals-wal/)

This still keeps the warning against using SQLite in production, but it
greatly reduces the restrictions what combos and settings can use this. In
short, when using an SQLite db it is now possible to:

- use LocalExecutor, including with more than 1 concurrent worker slot
- have multiple DAG parsing processes (even before AIP-72/TaskSDK changes to
  that)

We execute the `PRAGMA journal_mode` every time we connect, which is more
often that is strictly needed as this is one of the few modes thatis
persistent and a property of the DB file just for ease and to ensure that it
it is in the mode we want.

I have tested this with `breeze -b sqlite start_airflow` and a kicking off a
lot of tasks concurrently.

Will this be without problems? No, not entirely, but due to the
scheduler+webserver+api server process we've _already_ got the case where
multiple processes are operating on the DB file. This change just makes the
best use of that following the guidance of the SQLite project: Ensuring that
only a single process accesses the DB concurrently is not a requirement
anymore!
@boring-cyborg boring-cyborg bot added area:CLI area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Dec 11, 2024
def set_sqlite_pragma(dbapi_connection, connection_record):
cursor = dbapi_connection.cursor()
cursor.execute("PRAGMA foreign_keys=ON")
cursor.execute("PRAGMA journal_mode=WAL")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the change, everything else is removing code that isn't needed anymore!

@ashb
Copy link
Member Author

ashb commented Dec 11, 2024

Looks like I left a load of validate_database_executor_compatibility in the tests.

@ashb ashb merged commit cb74a41 into apache:main Dec 11, 2024
97 checks passed
@ashb ashb deleted the remove-sync-flag-to-dagprocessor branch December 11, 2024 13:36
@potiuk
Copy link
Member

potiuk commented Dec 11, 2024

This is a fantastic improvement. And it will make "airflow standalone" finally getting really usefuil for "local experience".

ellisms pushed a commit to ellisms/airflow that referenced this pull request Dec 13, 2024
… mode (apache#44839)

Since 2010(!) sqlite has had a WAL, or Write-Ahead Log mode of journalling
which allos multiple concurrent readers and one writer. More than good enough
for us for "local" use.

The primary driver for this change was a realisation that it is possible and
to reduce the amount of code in complexity in DagProcessorManager before
reworking it for AIP-72 support :- we have a lot of code in the
DagProcessorManager to support `if async_mode` that makes understanding the
flow complex.

Some useful docs and articles about this mode:

- [The offical docs](https://sqlite.org/wal.html)
- [Simon Willison's TIL](https://til.simonwillison.net/sqlite/enabling-wal-mode)
- [fly.io article about scaling read concurrency](https://fly.io/blog/sqlite-internals-wal/)

This still keeps the warning against using SQLite in production, but it
greatly reduces the restrictions what combos and settings can use this. In
short, when using an SQLite db it is now possible to:

- use LocalExecutor, including with more than 1 concurrent worker slot
- have multiple DAG parsing processes (even before AIP-72/TaskSDK changes to
  that)

We execute the `PRAGMA journal_mode` every time we connect, which is more
often that is strictly needed as this is one of the few modes thatis
persistent and a property of the DB file just for ease and to ensure that it
it is in the mode we want.

I have tested this with `breeze -b sqlite start_airflow` and a kicking off a
lot of tasks concurrently.

Will this be without problems? No, not entirely, but due to the
scheduler+webserver+api server process we've _already_ got the case where
multiple processes are operating on the DB file. This change just makes the
best use of that following the guidance of the SQLite project: Ensuring that
only a single process accesses the DB concurrently is not a requirement
anymore!
jedcunningham added a commit to astronomer/airflow that referenced this pull request Dec 13, 2024
That import was removed in apache#44839, but apache#44710 wasn't up-to-date with main so
static checks there didn't fail. This simply adds it back.
jedcunningham added a commit that referenced this pull request Dec 14, 2024
That import was removed in #44839, but #44710 wasn't up-to-date with main so
static checks there didn't fail. This simply adds it back.
ashb added a commit to astronomer/airflow that referenced this pull request Dec 16, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime objects
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 16, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime objects
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 16, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime objects
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 16, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 16, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 17, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 17, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 18, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 18, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 18, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit to astronomer/airflow that referenced this pull request Dec 19, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMIANTE and END messages over the control socket can go.

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
ashb added a commit that referenced this pull request Dec 19, 2024
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in #44898 and the
"subprocess" machinery introduced in #44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in #44839 the need for
  separate TERMINATE and END messages over the control socket can go.

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025
… mode (apache#44839)

Since 2010(!) sqlite has had a WAL, or Write-Ahead Log mode of journalling
which allos multiple concurrent readers and one writer. More than good enough
for us for "local" use.

The primary driver for this change was a realisation that it is possible and
to reduce the amount of code in complexity in DagProcessorManager before
reworking it for AIP-72 support :- we have a lot of code in the
DagProcessorManager to support `if async_mode` that makes understanding the
flow complex.

Some useful docs and articles about this mode:

- [The offical docs](https://sqlite.org/wal.html)
- [Simon Willison's TIL](https://til.simonwillison.net/sqlite/enabling-wal-mode)
- [fly.io article about scaling read concurrency](https://fly.io/blog/sqlite-internals-wal/)

This still keeps the warning against using SQLite in production, but it
greatly reduces the restrictions what combos and settings can use this. In
short, when using an SQLite db it is now possible to:

- use LocalExecutor, including with more than 1 concurrent worker slot
- have multiple DAG parsing processes (even before AIP-72/TaskSDK changes to
  that)

We execute the `PRAGMA journal_mode` every time we connect, which is more
often that is strictly needed as this is one of the few modes thatis
persistent and a property of the DB file just for ease and to ensure that it
it is in the mode we want.

I have tested this with `breeze -b sqlite start_airflow` and a kicking off a
lot of tasks concurrently.

Will this be without problems? No, not entirely, but due to the
scheduler+webserver+api server process we've _already_ got the case where
multiple processes are operating on the DB file. This change just makes the
best use of that following the guidance of the SQLite project: Ensuring that
only a single process accesses the DB concurrently is not a requirement
anymore!
got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025
That import was removed in apache#44839, but apache#44710 wasn't up-to-date with main so
static checks there didn't fail. This simply adds it back.
got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025
As part of Airflow 3 DAG definition files will have to use the Task SDK for
all their classes, and anything involving running user code will need to be
de-coupled from the database in the user-code process.

This change moves all of the "serialization" change up to the
DagFileProcessorManager, using the new function introduced in apache#44898 and the
"subprocess" machinery introduced in apache#44874.

**Important Note**: this change does not remove the ability for dag processes
to access the DB for Variables etc. That will come in a future change.

Some key parts of this change:

- It builds upon the WatchedSubprocess from the TaskSDK. Right now this puts a
  nasty/unwanted depenednecy between the Dag Parsing code upon the TaskSDK.
  This will be addressed before release (we have talked about introducing a
  new "apache-airflow-base-executor" dist where this subprocess+supervisor
  could live, as the "execution_time" folder in the Task SDK is more a feature
  of the executor, not of the TaskSDK itself.)
- A number of classes that we need to send between processes have been
  converted to Pydantic for ease of serialization.
- In order to not have to serialize everything in the subprocess and deserialize everything
  in the parent Manager process, we have created a `LazyDeserializedDAG` class
  that provides lazy access to much of the properties needed to create update
  the DAG related DB objects, without needing to fully deserialize the entire
  DAG structure.
- Classes switched to attrs based for less boilerplate in constructors.
- Internal timers convert to `time.monotonic` where possible, and `time.time`
  where not, we only need second diff between two points, not datetime
  objects.
- With the earlier removal of "sync mode" for SQLite in apache#44839 the need for
  separate TERMINATE and END messages over the control socket can go.

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:CLI area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:providers area:Scheduler including HA (high availability) scheduler provider:celery provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants