Skip to content

Conversation

@vikramkoka
Copy link
Contributor

Here is a very early draft PR to introduce and socialize the concept of a "common message queue" abstraction similar to the "Common SQL" and "Common IO" abstractions in Airflow.

This will be a provider package similar to those and is intended to be an abstraction over Apache Kafka, Amazon SQL, and Google PubSub to begin with. It can then be expanded to other messaging systems based on community adoption.

The initial goal with this is to provide a simple abstraction for integrating Event Driven Scheduling coming with Airflow 3 to message notification systems such as Kafka, currently being used to publish data availability.

At this stage, this is very much a WIP draft intended to solicit input from the community.

Here is a very early draft PR to introduce and socialize the concept of a "common message queue" abstraction similar to the "Common SQL" and "Common IO" abstractions in Airflow.

This will be a provider package similar to those and is intended to be an abstraction over Apache Kafka, Amazon SQL, and Google PubSub to begin with. It can then be expanded to other messaging systems based on community adoption.

The initial goal with this is to provide a simple abstraction for integrating Event Driven Scheduling coming with Airflow 3 to message notification systems such as Kafka, currently being used to publish data availability.

At this stage, this is very much a WIP draft intended to solicit input from the community.
Updated the Common Message Queue Readme with an example of an Event Driven Dag
Updated the message queue Operator and Sensor to fix an issue in my sync
Changed the Message Queue Sensor Operator to be a Deferrable Trigger
Fixed typos and import errors in the MsgQueueHook
@vincbeck
Copy link
Contributor

Implementation wise, here is my thinking. I am starting by MessageQueueTrigger.

Given msg_queue, MessageQueueTrigger needs to figure which hook it will use to poll/pop a message from the queue. Example: if msg_queue.starts_with("https://sqs."): hook = SqsHook(...). Then we can use the hook to retrieve the message. The hook will contain the logic for each provider (AWS, Google, Kafka, ...). This means, this new provider will have a dependency with all these providers. Do you think this is an issue? Did you have something else in mind?

Updated invocation of MsqQueueSensorTrigger to MsgQueueTrigger in example invocation
@vikramkoka
Copy link
Contributor Author

Implementation wise, here is my thinking. I am starting by MessageQueueTrigger.

Given msg_queue, MessageQueueTrigger needs to figure which hook it will use to poll/pop a message from the queue. Example: if msg_queue.starts_with("https://sqs."): hook = SqsHook(...). Then we can use the hook to retrieve the message. The hook will contain the logic for each provider (AWS, Google, Kafka, ...). This means, this new provider will have a dependency with all these providers. Do you think this is an issue? Did you have something else in mind?

You are right Vincent. I did think about the "Composition vs. Inheritance" approach tradeoff.

The composition style interface as defined here is easier for the DAG author, but more maintenance for us.
I talked about this with Ash and Jed as well and because of the underlying plumbing already present in Airflow for finding connections, et al, this seemed reasonable as an approach in order to make the end-user experience better.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good. Some more nit that I would have on the Python code/Interface but we can leave this until it is in real review.

Would be great to add an example DAG as well for the showcase.

@vincbeck
Copy link
Contributor

I am iterating on that PR but the new provider is not recognized. I get:

ModuleNotFoundError: No module named 'airflow.providers.common.msgq'

With the new restructure, what is the process to add a new provider? Do I just need to create provider.yaml and pyproject.toml and it will be automatically detected/indexed? @potiuk

@vincbeck
Copy link
Contributor

vincbeck commented Feb 21, 2025

I updated the PR. I focused only on the trigger side. Please let me know if this is what you had in mind in terms of implementation regarding the trigger. I really see it as a proxy of the provider triggers. I could not test it because the new provider is not recognized but once that solved I should be able to test it.

@potiuk
Copy link
Member

potiuk commented Feb 25, 2025

I am iterating on that PR but the new provider is not recognized. I get:

ModuleNotFoundError: No module named 'airflow.providers.common.msgq'

With the new restructure, what is the process to add a new provider? Do I just need to create provider.yaml and pyproject.toml and it will be automatically detected/indexed? @potiuk

You need to look at the main pyproject.toml and add the provider in the same way as others (there are few places). Then uv sync or image build should work after that.

And yes I updated https://github.com/apache/airflow/blob/main/providers/MANAGING_PROVIDERS_LIFECYCLE.rst#creating-a-new-community-provider - with the new structure and how to add a new provider, but that part is likely missing so after you figure it out, PRs there are most welcome.

BTW. It will likely slightly change in the future as we will move airflow-core and others, but still it would be great to keep it updated.

@potiuk
Copy link
Member

potiuk commented Feb 25, 2025

Generally @vincbeck -> look at everything below [dependency-groups] in the root pyproject.toml - the provider should be added in all those places.

@vincbeck
Copy link
Contributor

Thank you :D

@vincbeck vincbeck marked this pull request as ready for review February 25, 2025 21:26
@vincbeck
Copy link
Contributor

I dont understand what Sphinx is complaining about: /opt/airflow/docs/apache-airflow-providers-common-messaging/_api/airflow/providers/common/messaging/providers/base_provider/index.rst: WARNING: document isn't included in any toctree

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

I dont understand what Sphinx is complaining about: /opt/airflow/docs/apache-airflow-providers-common-messaging/_api/airflow/providers/common/messaging/providers/base_provider/index.rst: WARNING: document isn't included in any toctree

Sphinx is - as usual - speaking riddles :) . It means that there is index.rst file generated by autoapi (base_provider module) and that index is not mentioned anywhere. This means that you have to add it to some "table of content" file and refer to it - because otherwise that file is not reachable from anywhere. And it means some documentation is missing to explain what it is - usually a reference doc (see in other providers)

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

I think it would be great to somehow explain that "toctree" better :)

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

Likely some documentation about base_provider should be added here https://github.com/apache/airflow/blob/f4fd6fd5ae45cd20924149aa0201d2da08a63112/providers/common/messaging/docs/providers.rst

You need to extend base_provider ble ble ble....

@vincbeck
Copy link
Contributor

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

It is already there: https://github.com/apache/airflow/pull/46694/files#diff-f54feaaca8fd8ecfad946ef2cc5b389e082660ba53d305843bab44f5a014d582R36

So if the index to the "init.py" is not linked (and does not need to be linked) from anywhere - it should be excluded in "docs/conf.py" explicitly for provider package builds.

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

And yes - I reverse engineered it having similar issues. Likely it should be done bettter, so we do not have to do it manually.

@vincbeck
Copy link
Contributor

docs/conf.py

Yeah I did not want to do that but I think I'll do that, I really cannot find a solution. What I do not understand is there are a lot of modules in others providers that are not documented (for good reasons like the module utils in amazon) but Sphinx is not complaining about that. I do not know why it does like this new provider.

But anyway, thanks for your help and I'll add these 2 paths to exclude_patterns.

@potiuk
Copy link
Member

potiuk commented Feb 27, 2025

Yeah I did not want to do that but I think I'll do that, I really cannot find a solution. What I do not understand is there are a lot of modules in others providers that are not documented (for good reasons like the module utils in amazon) but Sphinx is not complaining about that. I do not know why it does like this new provider.

They are likely referred to in class docstrings or others. The thing is that if your class or module is not referred ANYWHERE - the only way you can reach it is by direct URL. And this is what Sphinx complains about.

@vincbeck
Copy link
Contributor

Yeah I did not want to do that but I think I'll do that, I really cannot find a solution. What I do not understand is there are a lot of modules in others providers that are not documented (for good reasons like the module utils in amazon) but Sphinx is not complaining about that. I do not know why it does like this new provider.

They are likely referred to in class docstrings or others. The thing is that if your class or module is not referred ANYWHERE - the only way you can reach it is by direct URL. And this is what Sphinx complains about.

It is probably that! Makes a bit more sense in all that Sphinx dialect :) Thanks

@vincbeck
Copy link
Contributor

All green :) I also tested it manually and triggered few DAGs using MessageQueueTrigger and it works nicely. I really like the user experience, users do not have to worry about the actual implementation of the queue provider

@potiuk potiuk merged commit ca4f094 into main Mar 1, 2025
148 checks passed
@potiuk
Copy link
Member

potiuk commented Mar 1, 2025

NICE!

@vincbeck vincbeck deleted the common-msgQ branch March 3, 2025 14:46
shahar1 pushed a commit to shahar1/airflow that referenced this pull request Mar 5, 2025
This is provider package similar to those and is intended to be an abstraction over Apache Kafka, Amazon SQL, and Google PubSub to begin with. It can then be expanded to other messaging systems based on community adoption.

The initial goal with this is to provide a simple abstraction for integrating Event Driven Scheduling coming with Airflow 3 to message notification systems such as Kafka, currently being used to publish data availability.

---------

Co-authored-by: vincbeck <vincbeck@amazon.com>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
This is provider package similar to those and is intended to be an abstraction over Apache Kafka, Amazon SQL, and Google PubSub to begin with. It can then be expanded to other messaging systems based on community adoption.

The initial goal with this is to provide a simple abstraction for integrating Event Driven Scheduling coming with Airflow 3 to message notification systems such as Kafka, currently being used to publish data availability.

---------

Co-authored-by: vincbeck <vincbeck@amazon.com>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants