-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional properties should be allowed in provider schema #13440
Additional properties should be allowed in provider schema #13440
Conversation
I've found a potential problem with providers.yaml and schema. Seems that providers that next wave of providers we are going to release will not be compatible with 2.0.0 because they already have 'logo' added as additional property in yaml and providers will not be able to register themselves (schema will fail validation) Changing additionalProeperties to "true" solves the problem in 'forward-compatible' way, however this does not help for 2.0.0 users who won't be able to discover future providers. Taking into account, however that 2.0.0 has a number of problems and 2.0.1 is going to be really the first 'stable` release I think, it might make sense to add >=2.0.1 limitation in the future providers (I guess that might motivate users of 2.0.0 to upgrade to 2.0.1). We could also yank 2.0.0 release (I think we should do it regardless with the number of small, but annoying problems we have). Let me know WDYT. |
As discussed with @mik-laj -> we might also for the time being remove the new fields + fix 2.0.1 and when the time comes we yank (or are confident 2.0.0 is not used) we can stop removing the fields and add >= 2.0.1 |
003af46
to
b805e14
Compare
@ashb @kaxil @mik-laj -> I've implemented the "removal" of extra properties:
|
b805e14
to
f581f72
Compare
We could also add similar mechansm for 'customized_form_field_behaviour' schema., though I do not expect any changes there, and we can always add it if we decide to. |
All seems to be working :) https://github.com/apache/airflow/pull/13440/checks?check_run_id=1639652121 |
f581f72
to
15ce958
Compare
The additional properties should be allowed in provider schema, otherwise future version of providers will not be compatible with older versions of Airflow. Specifying 'additionalProperties' as allowed we are opening up to adding more properties to provider.yaml. This change fixes this is for now by removing extra fields added since the Airlow 2.0.0 schema and verifying that the 2.0.0 schema correctly validates such modified dictionary. In the future we might deprecate 2.0.0 and add >=2.0.1 limitation to the provider packages in which case we will be able to remove this modification of the provider_info dict. Also added additional test for provider packages whether they install on Airflow 2.0.0. This tests might remain even after the deprecation of 2.0.0 - we can just move it to 2.0.1. However this will give us much bigger confidence that the providers will continue work even for older versions of Airflow 2.0. We might have to modify that test and only include the providers that are backwards-compatible, in case we have some providers that depend on future Airflow versions. For now we assume all providers should be installable from master on 2.0.0.
15ce958
to
007bab7
Compare
This can cause some problems. Although it is a devel dependency, these dependencies appear in the constraints.txt file, which may make installing a new version of this library difficult in the future. Google has a slightly different policy on adding dependencies to their libraries, which seems sensible to apply to our project as well, as in some cases this is how our project should be treated.
https://github.com/googleapis/python-bigquery/blob/master/CONTRIBUTING.rst Another example is the Line 451 in 4437137
As a result, the user may have to use a version that was released 1 year, 11 months ago - 4.7.1. Line 80 in 38fbcad
Latest version is 4.9.3 (3 months ago) However, we can address this issue when we get a report about it. I just wanted to warn you against adding development dependencies if it's not necessary. |
I'm not fan of setting additionalProperties to true in general -- it means you can make silly mistakes like typoing a field name and not get any protection. I'd like to discus reverting this change (or at least the schema deprecation part), as my understanding of this is we are changing the provider yaml schema to support something "outside" of Airflow code. |
Revert is perhaps the wrong word -- but separating the two purposes more clearly -- and not changing the "runtime" json schema. One option might be to have two YAML documents - one for provider info, one for docs. This could be two files, or it could be two "YAML documents" in the same file. ---
package-name: apache-airflow-providers-apache-cassandra
name: Apache Cassandra
description: |
`Apache Cassandra <http://cassandra.apache.org/>`__.
# ...
---
logo: X
versions:
- 1.0.0 (Because |
We could indeed separate those two. I have no problem with that, and I think two files might be enough. Happy to make that change. And we can rather easily change that even now (I can make the change in couple of days). But I believe 'additionalProperties' SHOULD be true. If we want to validate 3rd-party customer providers (we want) if we want to maintain forward-compatibility. Otherwise (as in the cases we had) we shut the door to adding new features. Rather than 'additionalProperties' set to false, I think much better are "required" for the important stuff and making things require when they become indespensible (in which case we make a breaking change and make other fields required). This is very much principle in any kind of protocols like GRPC and others that new fields in the protocol should be pretty much ignored if the other end does not know them, precisely for forward-compatibility reason. I think this information is static enough that typos in "new" fields are not very likely to happen, and for anyone using IDEs, they will get auto-completion when preparing the files. I would not worry about it. |
There are many more columns that are not used by Airflow but are only used by documentation. It seems to me that these will be all on this code block. airflow/airflow/provider.yaml.schema.json Lines 17 to 180 in 3341d21
However, splitting this into several files seems to me to cause maintenance issues, but we can try to define one schema file that will contain all the information and have additionalProperties field set to false and a second specification that will be a subset of the first. This one can be used in runtime and can be forward-compatible.
|
I very much like the idea of @mik-laj of splitting the schemas rather than files. This should be bullet-proof for both cases and is far less maintenance. |
This is poorly supported by additional tools like the IDE. They often validate the entire file against one specification. |
@mik-laj - answering your comment here:
I think adding jsonpath-ng is pretty much equivalent of adding jsonschema - they are both part of the same set of standards that complement each other (and they both implement established and popular standards). JsonSchema and JSonPath are different side of the same coin (same as XMLSchema and XMLPath). And they go hand-in-hand. If we installed one, installing the other is no brainer. As far as old requirements - first of all, as a developer you are not supposed to install 'devel` in a released version, you only should use in airflow code checked out from master. And the way our constraints work that when package gets no upper bound, the constraints will get updated rather quickly after new version of that package gets released and our tests in master pass. Without anyone's involvement. So what you can expect for such dependencies you will have always latest 'good' version in the master constraints. The 'jsonpath-ng' has no upper bound, so in this cases constraints mechanism will automatically upgrade "master' whenever the 'jsonpath-ng' new version will be released (and all tests pass). Which means that in master constraints we will not get "11 months old" requirements. This is solution is really great - we have not only What our system provides is better because it figures the "latest" set of constraints that includes all our requirements (both core and providers) and figures out the latest "set of those", Tests them, and only pushes them automatically when all tests passed. There is no other solution I looked at (and believe me I looked at many) that can provide that. Especially that dependabot does not cope well with the situation that we do not want to have fixed requirements but we generate the expected set of constraints via PIP automatically. None of the google libraries has this probl, What's even more, even if we want to add a patch to a released version, we will use (by using constraints) the version of devel dependency which was OK for that version. We can still upgrade it (and this BTW also happens in v1-10-test/stable branch) so that if we make a patch to an old version, new constraints (with possibly updated version of the dependency) will be updated (but again - only after all tests pass). And if other packages/providers of our have another limitation here, this is also fine, because we should not have any dependency that community-managed providers should conflict with. It's community responsibility to keep both core and community managed providers working together without conflicts.
The I think our solution is state of the art and we are solving a lot of problems with dependencies in a better way than many of other applications and librarieres. And the fact that we are both - applications and library, leads to custom, complex solution. But I have not seen any other (including pip-tools, or poetry) that would solve both approaches at the same time. Our solution does and rather reliably now. |
Thinking about this more, I'm not sure we even need the "doc build only" information in these files -- we have Or it could live in the airflow-site repo. My thinking there is that it is only used for building docs, so having to make extra commits in the apache/airflow monorepo seems un-necessary given airflow-site is what needs that information. Happy to make these changes myself if you both agree with this approach @mik-laj @potiuk. In summary:
|
The problem with this information that it is version dependendent and ti MUST be in the providers folder. This information will change over time, when we add new folders, packages etc. We actually even run pre-commit checks that verify if the code is consistent with this information. |
Which information must live in the providers folder? I'm not talking about removing the provider.yaml, just removing the fields from it that are only used for building the site (i.e. at least logo, versions.) |
The checks performed start here: https://github.com/apache/airflow/blob/master/scripts/ci/pre_commit/pre_commit_check_provider_yaml_files.py#L151 They are validating a number of things to make sure that when provider.yaml file will be used, the generated documentation is consistent with the code. |
The logo has been added to this repository to make it easier to update the website. Now as new integration is added it is easy to add a logo as well. In the next step, when the documentation is published, the website is also updated automatically. I hope that soon each integration will have a logo and will therefore be promoted on the website as soon as it is published The previous website could be out of date for a very long time because it was a very tedious and long process, and now each contributor can only do one integration. As for the versions, deleting them will also be problematic, because it is used to generate the package list |
We can look at the folders under docs-arcivie -- i.e. do a listdir on https://github.com/apache/airflow-site/tree/master/docs-archive/apache-airflow-providers-apache-hive. Anyway, I think this is probably easier to look at in code than abstract (and yes, I may just not know enough of what is used where). I'll PR or shut up :) |
As discussed before - the #13488 with separate runtime schema. I think it is cleanest and best approach. I very much like the idea that all provider info is in one place and then we can split out only the information that is need at runtime. This is way better than splitting off the files and much more logical approach IMHO, especially that we can do the validation and that some information (like versions, package name, description ) are shared between runtime and doc. I am not sure even why we would want to split them in this case. |
Yes and no. That would require that every time we want to build documentation, we have that two repositories available. This is quite problematic because it would require some extra work from the user and the repository is quite large. Additionally, it would make the process of publishing a new package more difficult, as you would have to follow 4 steps instead of 2.
|
The additional properties should be allowed in provider schema, otherwise future version of providers will not be compatible with older versions of Airflow. Specifying 'additionalProperties' as allowed we are opening up to adding more properties to provider.yaml. This change fixes this is for now by removing extra fields added since the Airlow 2.0.0 schema and verifying that the 2.0.0 schema correctly validates such modified dictionary. In the future we might deprecate 2.0.0 and add >=2.0.1 limitation to the provider packages in which case we will be able to remove this modification of the provider_info dict. Also added additional test for provider packages whether they install on Airflow 2.0.0. This tests might remain even after the deprecation of 2.0.0 - we can just move it to 2.0.1. However this will give us much bigger confidence that the providers will continue work even for older versions of Airflow 2.0. We might have to modify that test and only include the providers that are backwards-compatible, in case we have some providers that depend on future Airflow versions. For now we assume all providers should be installable from master on 2.0.0. (cherry picked from commit 523e2f4)
The additional properties should be allowed in provider schema,
otherwise future version of providers will not be compatible with
older versions of Airflow.
Specifying 'additionalProperties' as allowed we are opening up to
adding more properties to provider.yaml.
This change fixes this is for now by removing extra fields
added since the Airlow 2.0.0 schema and verifying that the 2.0.0
schema correctly validates such modified dictionary.
In the future we might deprecate 2.0.0 and add >=2.0.1 limitation
to the provider packages in which case we will be able to remove
this modification of the provider_info dict.
Also added additional test for provider packages whether they
install on Airflow 2.0.0. This tests might remain even after the
deprecation of 2.0.0 - we can just move it to 2.0.1. However this
will give us much bigger confidence that the providers will
continue work even for older versions of Airflow 2.0.
We might have to modify that test and only include the providers
that are backwards-compatible, in case we have some providers
that depend on future Airflow versions. For now we assume
all providers should be installable from master on 2.0.0.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.