-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Python 3.11 for Google Provider (upgrading all dependencies) #27292
Comments
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: #27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: #27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: #27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: #27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
Upgrading dependencies for google provider package can be tested with Airflow System Tests and CI that is in under construction atm. FYI @bhirsz |
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: #27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
It seems that python-bigquery-sqlalchemy already supports Python 3.11 |
It seems that google-api-python-client also supports 3.11. |
I will make a round of rebase/check again :) |
Python 3.11 has been released as scheduled on October 25, 2022 and this is the first attempt to see how far Airflow (mostly dependencies) are from being ready to officially support 3.11. So far we had to exclude the following dependencies: - [ ] Pyarrow dependency: apache/arrow#14499 - [ ] Google Provider: apache#27292 and googleapis/python-bigquery#1386 - [ ] Databricks Provider: databricks/databricks-sql-python#59 - [ ] Papermill Provider: nteract/papermill#700 - [ ] Azure Provider: Azure/azure-uamqp-python#334 and Azure/azure-sdk-for-python#27066 - [ ] Apache Beam Provider: apache/beam#23848 - [ ] Snowflake Provider: snowflakedb/snowflake-connector-python#1294 - [ ] JDBC Provider: jpype-project/jpype#1087 - [ ] Hive Provider: cloudera/python-sasl#30 We might decide to release Airflow in 3.11 with those providers disabled in case they are lagging behind eventually, but for the moment we want to work with all the projects in concert to be able to release all providers (Google Provider requires quite a lot of work and likely Google Team stepping up and community helping with migration to latest Goofle cloud libraries)
Cool. Time to try 3.11 build back then. |
I din't found a working constraint of |
As of I've opened googleapis/python-aiplatform#2006 as it appears to have been unrequested. @potiuk Question for you that I've wondered about after chasing down a few of these updates. Has there been any thought given to breaking apart the google-provider into extras? i.e. Rationale: It would allow users of the google provider to pick and chose which sub features they want to use and introduce less dependencies. It would also let us leave certain sub features behind in case google supports them less or deprecates them (as they've been known to do).. |
Not only thoughts. There is an issue for actually spliting the provider : #15933 - but this one is complex because of common parts so maintaining such split provider would be difficult to maintain (we learned a lot about it when we added However- when it comes to extras, it could be a better solution indeed, I have not thought about it, but it might actually make it much easier for users and would let us pick and choose which extras in google provider we might want to have enabled for which python versions. We already even have AirflowOptionalProviderFeatureException which would be nice in this case - we could throw appropriate error explaining that this and that extra is needed for this and that module. I think I like the idea better than splitting the provider. But I have to think a bit about it, from the first glance it looks like an easy solution to this problem. |
the issue with spliting the provider is mostly that no one from Google picked it. Once someone picks it and start working on it we will be able to overcome the tech difficulties. We don't know yet how the provider will be split but we do know it must be done. |
I am not so sure. Actually - using extras might be way simpler approach that is going to solve most of the pains with getting all the libraries in I think, without introducing the huge hassle of extracting common code and using it from multiple "google" providers. If we do split google provider, then the maintenance pain of common.sql willl absolutely pale in comparision comparing to problems we are going to have - and there were at least 4 or 5 traps of the common code extraction and maintenance which were really painful to protect against and fix. If we find a way to solve most of the user problems about dependencies with extras as suggested by @r-richmond (which I think is actually possible) then I see no reason to split the provider to be honest. Splitting the google provider will be massive undertaking and if we do that, it will take us more than a few iterations on multiple providers to solve most of the teething problems that we will not realise when splitting and those problems will keep on coming back for as long as the common part of the google provider will keep on evolving - we will keep on breaking things with older versions of "specific" providers when we will release the new common code. This is all but given that it will happen and we have almost no way to protect against it. Look how small common.sql "API" surface was and how many problems we had:
Not all of those - but most were directly caused by decision to extract common code for a number of SQL operators. And the main problem why those errors affected users was because there is no way to test new release of "common" code with all possble releases of all possible providers that are using it. You can at most test semi-thoroughly the latest versions of the providers and common code together. This is what we do. Thats' why splitting google provider is SCARY. because you will have order of magnitude more of similar problems and we will have no way to avoid them. And even more. Google common code will keep evolving in much faster rate than common.sql code. Our problem wiht common.sql stopped at the moment it stopped changing. But Google common code will never stop changing. So decision about splitting gogole provider is not as "light" as you think. And that's why I am very, very sceptical about splitting it (otherwise I would have done that myself a long time ago). Of course using extras does not solve "all" problems - but I think most, It won't solve the case where you would like to use different provider version for one Google service and different for another. But - to be honest - if we get to the point where someone needs to do it, then we have bigger issue and this is one of those problems that leads to more issues than it solves. I would very strongly prefer the situation that user has to modify their dags for google - if they want to (for example) use new features from another service. Yes, it's a bit of pain for them - but far, far, less pain for everyone else (including them) in the future, where some incompatibilities in the common code will cause even more problems. |
I don't fully agree and I don't think it's the same case. Back to the Google case. We are not adding anything new. This is more about re-organizing the existed code. To me it seems that the main reason it's not split is the the common folder which is being used by almost all google space and it will be hard to break it to individual providers. However this folder is not changing that much. Check the commit history and when it does change most of the commits are about styling |
Those are all non-styling potentially breking changes for the common part of Google. Seems like we have a substantial change in it almost every month.
|
I'm experiencing some issues upgrading the Google-ads python package. Version 18 is deprecated since the beginning of this week and higher versions on protobuf > 4.5.x . Is the google-cloud-secret-manager dependency needed or could it be easily upgraded to the newer 2.x versions? |
Im sure it is still needed. I'd recommend trying to upgrade that package first in a separate pr. FWIW I've had several of these situations where I want package a upgraded but have to do package b & c first. |
There is a WIP from Google team to upgrade the SDKs #30067 |
@potiuk Given #30067 (comment) I was curious if there has been any additional conversations around extras vs provider breakout. (I have a small preference towards extras since it seems easier / faster to implement given the conversations above). |
No - no discussions. And I think they are not needed. I personally think once we get it updated now and keep on updating to the new versions (which should happen pretty much automatically as soon as we remove pretty much all the upper-binding dependencies) the problem will all but disappear. Vast majority of the problem came from the fact that we were half before and half after a huge backwards incompatible change introduced accross all google python libraries some 4 years ago. The #30067 puts that dychotomy to an end. I am actually going to actively chase and remove all the upper-bind limitations that we have elsewhere, becuase this is IMHO the only way we can long term keep sanity. We already have in place the system that checks if there are no breaking changes in deps released in main and for a long time we are faster to detect and fix them than anyone else (see for example this issue from today where our canary builds detected and we fixed alembic incompatibility before the first user reported it to us: #31313 With google eventually implementing (discussion on this are in progress) the System Dashboard similar to the Amazon one, we will get even further than that becasue we will start detecting errors that impacts working with the actual GCS services. |
Having all that in place, I do not see really the need to split Google at all - maybe the extras will save a bit of space when installing the provider, but there will be very little need to split it IMHO. |
Just also to explain why - there was a bit of a story few days back about Aamazon going back to smarter monolyth from microservices https://thenewstack.io/return-of-the-monolith-amazon-dumps-microservices-for-video-monitoring/ and this goes hand-in-hand with my observations (and the reason why we still have monorepo for airflow and providers). Splitting up into pieces looks cool but in a number of cases it is not a "golden bullet" while it adds isolation and decouples stuff, when there are hidden couplins, it might bring way more cost than it brings benefits- maintaining and solving problems coming uf with such a split might easily cost more than potential benefits. So once we get rid of the root cause of the problem (which in fact was not very related to the internal google package structure but more to the fact that we had "half-baked" cake, then we should carefully see what are the needs and cost of any splitting approach and see if any of that is needed. IMHO we should not discuss |
Makes 💯 sense to me
Yes my main interest stems from the desire to save space & more importantly ignore google libraries I don't use. Particularly the ones that lag behind python version and other dependency updates. |
I know it is eaarly (Python 3.11 has just been released yesterday) but we are hoping in Apache Airflow to a much faster cycle of adding new Python releases - especially that Pyhon 3.11 introduces huge performance improvements (25% is the average number claimed) due to a very focused effort to increase single-threaded Python performance (Specialized interpreter being the core of it but also many other improvements) without actually changing any of the Python code.
The google provider will be a huge drag on Airlfow's compatibility for Python 3.11 and we might even decide to release Airlfow without Google Provider support for 3.11. Though it would be great to avoid that.
Google Provider (as it was originally mentioned in #12116) has still a number of old google cloud libraries < 2.0.0 that for sure will not get good. Also support for libraries such as bigquery for Python 3.11 has to be added (but this is external to it and tracked in googleapis/python-bigquery#1386)
Nice summary of Py3.11 support is here: https://pyreadiness.org/3.11/ - it's not very green obviously, but I hope it gets greener soon.
I just opened the PR to add 3.11 support yesterday and plan to keep it open until it gets green :)
#27264
I think it would be fantastic if we could work out all the problems and add migrate all the old dependencies:
Looking forward to cooperation on that one :)
Committer
The text was updated successfully, but these errors were encountered: