Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional properties should be allowed in provider schema #13440

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jan 2, 2021

The additional properties should be allowed in provider schema,
otherwise future version of providers will not be compatible with
older versions of Airflow.

Specifying 'additionalProperties' as allowed we are opening up to
adding more properties to provider.yaml.

This change fixes this is for now by removing extra fields
added since the Airlow 2.0.0 schema and verifying that the 2.0.0
schema correctly validates such modified dictionary.

In the future we might deprecate 2.0.0 and add >=2.0.1 limitation
to the provider packages in which case we will be able to remove
this modification of the provider_info dict.

Also added additional test for provider packages whether they
install on Airflow 2.0.0. This tests might remain even after the
deprecation of 2.0.0 - we can just move it to 2.0.1. However this
will give us much bigger confidence that the providers will
continue work even for older versions of Airflow 2.0.

We might have to modify that test and only include the providers
that are backwards-compatible, in case we have some providers
that depend on future Airflow versions. For now we assume
all providers should be installable from master on 2.0.0.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@potiuk potiuk added this to the Airflow 2.0.1 milestone Jan 2, 2021
@potiuk potiuk requested review from mik-laj, ashb and kaxil January 2, 2021 16:21
@potiuk
Copy link
Member Author

potiuk commented Jan 2, 2021

I've found a potential problem with providers.yaml and schema. Seems that providers that next wave of providers we are going to release will not be compatible with 2.0.0 because they already have 'logo' added as additional property in yaml and providers will not be able to register themselves (schema will fail validation)

Changing additionalProeperties to "true" solves the problem in 'forward-compatible' way, however this does not help for 2.0.0 users who won't be able to discover future providers.

Taking into account, however that 2.0.0 has a number of problems and 2.0.1 is going to be really the first 'stable` release I think, it might make sense to add >=2.0.1 limitation in the future providers (I guess that might motivate users of 2.0.0 to upgrade to 2.0.1). We could also yank 2.0.0 release (I think we should do it regardless with the number of small, but annoying problems we have).

Let me know WDYT.

@potiuk
Copy link
Member Author

potiuk commented Jan 2, 2021

As discussed with @mik-laj -> we might also for the time being remove the new fields + fix 2.0.1 and when the time comes we yank (or are confident 2.0.0 is not used) we can stop removing the fields and add >= 2.0.1

@potiuk potiuk force-pushed the additional-properties-in-provider-schemas-allowed branch 3 times, most recently from 003af46 to b805e14 Compare January 3, 2021 13:06
@potiuk
Copy link
Member Author

potiuk commented Jan 3, 2021

@ashb @kaxil @mik-laj -> I've implemented the "removal" of extra properties:

  • i remove the extra fields added since 2.0.0 (I am using jsonpath-ng for that - very neat library that implements jsonpath and we can easily specify the path to filter out from Dictionary object (in case of logo simply integrations..logo does the trick).
  • when preparing provider packages I check if provider info modified vlidates with the old schema 2.0 (stored in deprecated_schemas)
  • I added test in CI to install and import all classes from master providers using 2.0.0 airflow rather than master version. This will help us to validate if providers are still installable with 'old' Airflow 2.0.0. That's actually pretty nice as we will be pretty sure that at least at the import and dependencies level we got things right in master. If we decide to deprecate 2.0.0 we can move it to 2.0.1

@potiuk potiuk force-pushed the additional-properties-in-provider-schemas-allowed branch from b805e14 to f581f72 Compare January 3, 2021 13:15
@potiuk
Copy link
Member Author

potiuk commented Jan 3, 2021

We could also add similar mechansm for 'customized_form_field_behaviour' schema., though I do not expect any changes there, and we can always add it if we decide to.

@potiuk
Copy link
Member Author

potiuk commented Jan 3, 2021

@potiuk potiuk force-pushed the additional-properties-in-provider-schemas-allowed branch from f581f72 to 15ce958 Compare January 3, 2021 20:51
The additional properties should be allowed in provider schema,
otherwise future version of providers will not be compatible with
older versions of Airflow.

Specifying 'additionalProperties' as allowed we are opening up to
adding more properties to provider.yaml.

This change fixes this is for now by removing extra fields
added since the Airlow 2.0.0 schema and verifying that the 2.0.0
schema correctly validates such modified dictionary.

In the future we might deprecate 2.0.0 and add >=2.0.1 limitation
to the provider packages in which case we will be able to remove
this modification of the provider_info dict.

Also added additional test for provider packages whether they
install on Airflow 2.0.0. This tests might remain even after the
deprecation of 2.0.0 - we can just move it to 2.0.1. However this
will give us much bigger confidence that the providers will
continue work even for older versions of Airflow 2.0.

We might have to modify that test and only include the providers
that are backwards-compatible, in case we have some providers
that depend on future Airflow versions. For now we assume
all providers should be installable from master on 2.0.0.
@potiuk potiuk force-pushed the additional-properties-in-provider-schemas-allowed branch from 15ce958 to 007bab7 Compare January 3, 2021 22:43
@mik-laj
Copy link
Member

mik-laj commented Jan 4, 2021

I am using jsonpath-ng for that - very neat library that implements jsonpath and we can easily specify the path to filter out from Dictionary object (in case of logo simply integrations..logo does the trick

This can cause some problems. Although it is a devel dependency, these dependencies appear in the constraints.txt file, which may make installing a new version of this library difficult in the future. Google has a slightly different policy on adding dependencies to their libraries, which seems sensible to apply to our project as well, as in some cases this is how our project should be treated.

The feature must not add unnecessary dependencies (where "unnecessary" is of course subjective, but new dependencies should be discussed).

https://github.com/googleapis/python-bigquery/blob/master/CONTRIBUTING.rst

Another example is the beautifulsoup4 library, which even has version requirements in setup.py

'beautifulsoup4~=4.7.1',

As a result, the user may have to use a version that was released 1 year, 11 months ago - 4.7.1.
beautifulsoup4==4.7.1

Latest version is 4.9.3 (3 months ago)

However, we can address this issue when we get a report about it. I just wanted to warn you against adding development dependencies if it's not necessary.

@potiuk potiuk merged commit 523e2f4 into apache:master Jan 4, 2021
@potiuk potiuk deleted the additional-properties-in-provider-schemas-allowed branch January 4, 2021 10:15
@ashb
Copy link
Member

ashb commented Jan 4, 2021

Seems that providers that next wave of providers we are going to release will not be compatible with 2.0.0 because they already have 'logo' added as additional property in yaml and providers will not be able to register themselves (schema will fail validation)

logo property is not relevant to runtime code at all -- that is only used for building our docs/website. This change (which i have no problem with in principle) is caused by using one file for two purposes.

I'm not fan of setting additionalProperties to true in general -- it means you can make silly mistakes like typoing a field name and not get any protection.

I'd like to discus reverting this change (or at least the schema deprecation part), as my understanding of this is we are changing the provider yaml schema to support something "outside" of Airflow code.

@ashb
Copy link
Member

ashb commented Jan 4, 2021

Revert is perhaps the wrong word -- but separating the two purposes more clearly -- and not changing the "runtime" json schema.

One option might be to have two YAML documents - one for provider info, one for docs. This could be two files, or it could be two "YAML documents" in the same file.

---
package-name: apache-airflow-providers-apache-cassandra
name: Apache Cassandra
description: |
    `Apache Cassandra <http://cassandra.apache.org/>`__.

# ...
---
logo: X
versions:
  - 1.0.0

(Because versions is also not a "runtime" property, but docs only

@potiuk
Copy link
Member Author

potiuk commented Jan 4, 2021

We could indeed separate those two. I have no problem with that, and I think two files might be enough. Happy to make that change. And we can rather easily change that even now (I can make the change in couple of days).

But I believe 'additionalProperties' SHOULD be true. If we want to validate 3rd-party customer providers (we want) if we want to maintain forward-compatibility. Otherwise (as in the cases we had) we shut the door to adding new features.

Rather than 'additionalProperties' set to false, I think much better are "required" for the important stuff and making things require when they become indespensible (in which case we make a breaking change and make other fields required). This is very much principle in any kind of protocols like GRPC and others that new fields in the protocol should be pretty much ignored if the other end does not know them, precisely for forward-compatibility reason.

I think this information is static enough that typos in "new" fields are not very likely to happen, and for anyone using IDEs, they will get auto-completion when preparing the files. I would not worry about it.

@mik-laj
Copy link
Member

mik-laj commented Jan 4, 2021

There are many more columns that are not used by Airflow but are only used by documentation. It seems to me that these will be all on this code block.

"versions": {
"description": "List of available versions in Pypi. Sorted descending according to release date.",
"type": "array",
"items": {
"type": "string"
}
},
"integrations": {
"description": "List of integrations supported by the provider.",
"type": "array",
"items": {
"type": "object",
"properties": {
"integration-name": {
"type": "string",
"description": "Name of the integration."
},
"external-doc-url": {
"type": "string",
"description": "URL to external documentation for the integration."
},
"how-to-guide": {
"description": "List of paths to how-to-guide for the integration. The path must start with '/docs/'",
"type": "array",
"items": {
"type": "string"
}
},
"logo": {
"description": "Path to the logo for the integration. The path must start with '/integration-logos/'",
"type": "string"
},
"tags": {
"description": "List of tags describing the integration. While we're using RST, only one tag is supported per integration.",
"type": "array",
"items": {
"type": "string",
"enum": [
"apache",
"aws",
"azure",
"gcp",
"gmp",
"google",
"protocol",
"service",
"software",
"yandex"
]
},
"minItems": 1,
"maxItems": 1
}
},
"additionalProperties": true,
"required": [
"integration-name",
"external-doc-url",
"tags"
]
}
},
"operators": {
"type": "array",
"items": {
"type": "object",
"properties": {
"integration-name": {
"type": "string",
"description": "Integration name. It must have a matching item in the 'integration' section of any provider."
},
"python-modules": {
"description": "List of python modules containing the operators.",
"type": "array",
"items": {
"type": "string"
}
}
},
"additionalProperties": true,
"required": [
"integration-name",
"python-modules"
]
}
},
"sensors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"integration-name": {
"type": "string",
"description": "Integration name. It must have a matching item in the 'integration' section of any provider."
},
"python-modules": {
"description": "List of python modules containing the sensors.",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"integration-name",
"python-modules"
],
"additionalProperties": true
}
},
"hooks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"integration-name": {
"type": "string",
"description": "Integration name. It must have a matching item in the 'integration' section of any provider."
},
"python-modules": {
"description": "List of python modules containing the hooks.",
"type": "array",
"items": {
"type": "string"
}
}
},
"additionalProperties": true,
"required": [
"integration-name",
"python-modules"
]
}
},
"transfers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"how-to-guide": {
"description": "Path to how-to-guide for the transfer. The path must start with '/docs/'",
"type": "string"
},
"source-integration-name": {
"type": "string",
"description": "Integration name. It must have a matching item in the 'integration' section of any provider."
},
"target-integration-name": {
"type": "string",
"description": "Target integration name. It must have a matching item in the 'integration' section of any provider."
},
"python-module": {
"type": "string",
"description": "List of python modules containing the transfers."
}
},
"additionalProperties": true,
"required": [
"source-integration-name",
"target-integration-name",
"python-module"
]
}
},

However, splitting this into several files seems to me to cause maintenance issues, but we can try to define one schema file that will contain all the information and have additionalProperties field set to false and a second specification that will be a subset of the first. This one can be used in runtime and can be forward-compatible.

@potiuk
Copy link
Member Author

potiuk commented Jan 4, 2021

I very much like the idea of @mik-laj of splitting the schemas rather than files. This should be bullet-proof for both cases and is far less maintenance.

@mik-laj
Copy link
Member

mik-laj commented Jan 4, 2021

it could be two "YAML documents" in the same file.

This is poorly supported by additional tools like the IDE. They often validate the entire file against one specification.

@potiuk
Copy link
Member Author

potiuk commented Jan 4, 2021

@mik-laj - answering your comment here:

This can cause some problems. Although it is a devel dependency, these dependencies appear in the constraints.txt file, which may make installing a new version of this library difficult in the future. Google has a slightly different policy on adding dependencies to their libraries, which seems sensible to apply to our project as well, as in some cases this is how our project should be treated.

I think adding jsonpath-ng is pretty much equivalent of adding jsonschema - they are both part of the same set of standards that complement each other (and they both implement established and popular standards). JsonSchema and JSonPath are different side of the same coin (same as XMLSchema and XMLPath). And they go hand-in-hand. If we installed one, installing the other is no brainer.

As far as old requirements - first of all, as a developer you are not supposed to install 'devel` in a released version, you only should use in airflow code checked out from master. And the way our constraints work that when package gets no upper bound, the constraints will get updated rather quickly after new version of that package gets released and our tests in master pass. Without anyone's involvement. So what you can expect for such dependencies you will have always latest 'good' version in the master constraints.

The 'jsonpath-ng' has no upper bound, so in this cases constraints mechanism will automatically upgrade "master' whenever the 'jsonpath-ng' new version will be released (and all tests pass). Which means that in master constraints we will not get "11 months old" requirements.

This is solution is really great - we have not only latest but also tested version of the software (And we have all dependencies tested in sync together). Usually you can get only latest when you use poetry or pip-tools, or single latest version PR when you use dependabot.

What our system provides is better because it figures the "latest" set of constraints that includes all our requirements (both core and providers) and figures out the latest "set of those", Tests them, and only pushes them automatically when all tests passed. There is no other solution I looked at (and believe me I looked at many) that can provide that. Especially that dependabot does not cope well with the situation that we do not want to have fixed requirements but we generate the expected set of constraints via PIP automatically. None of the google libraries has this probl,

What's even more, even if we want to add a patch to a released version, we will use (by using constraints) the version of devel dependency which was OK for that version. We can still upgrade it (and this BTW also happens in v1-10-test/stable branch) so that if we make a patch to an old version, new constraints (with possibly updated version of the dependency) will be updated (but again - only after all tests pass). And if other packages/providers of our have another limitation here, this is also fine, because we should not have any dependency that community-managed providers should conflict with. It's community responsibility to keep both core and community managed providers working together without conflicts.

Another example is the beautifulsoup4 library, which even has version requirements in setup.py

The beautifulsoup4==4.7.1 is only that old in constraints because somebody put it like that in the devel requirements and it should be fixed. There is no need why devel requirements should have any limitations unless there is a very good reason for (it and there is a condition in the comment on when we will be able to remove the limitation). I have no idea who and how added it this way, and I hope it can be removed (but I have no idea who and how added it - maybe you can try to fix that separately)?

I think our solution is state of the art and we are solving a lot of problems with dependencies in a better way than many of other applications and librarieres. And the fact that we are both - applications and library, leads to custom, complex solution. But I have not seen any other (including pip-tools, or poetry) that would solve both approaches at the same time. Our solution does and rather reliably now.

@ashb
Copy link
Member

ashb commented Jan 5, 2021

Thinking about this more, I'm not sure we even need the "doc build only" information in these files -- we have docs/apache-airflow-providers-$x/ where we could store this information.

Or it could live in the airflow-site repo. My thinking there is that it is only used for building docs, so having to make extra commits in the apache/airflow monorepo seems un-necessary given airflow-site is what needs that information.

Happy to make these changes myself if you both agree with this approach @mik-laj @potiuk.

In summary:

  • Have providers.yaml file only be what is needed at runtime for Airflow
  • Move the information used to build the docs/site to somewhere in to the airflow-site repo (versions, logo, etc)

@potiuk
Copy link
Member Author

potiuk commented Jan 5, 2021

The problem with this information that it is version dependendent and ti MUST be in the providers folder. This information will change over time, when we add new folders, packages etc. We actually even run pre-commit checks that verify if the code is consistent with this information.

@ashb
Copy link
Member

ashb commented Jan 5, 2021

Which information must live in the providers folder? I'm not talking about removing the provider.yaml, just removing the fields from it that are only used for building the site (i.e. at least logo, versions.)

@potiuk
Copy link
Member Author

potiuk commented Jan 5, 2021

Which information must live in the providers folder? I'm not talking about removing the provider.yaml, just removing the fields from it that are only used for building the site (i.e. at least logo, versions.)

The checks performed start here: https://github.com/apache/airflow/blob/master/scripts/ci/pre_commit/pre_commit_check_provider_yaml_files.py#L151

They are validating a number of things to make sure that when provider.yaml file will be used, the generated documentation is consistent with the code.

@mik-laj
Copy link
Member

mik-laj commented Jan 5, 2021

Which information must live in the providers folder? I'm not talking about removing the provider.yaml, just removing the fields from it that are only used for building the site (i.e. at least logo, versions.)

The logo has been added to this repository to make it easier to update the website. Now as new integration is added it is easy to add a logo as well. In the next step, when the documentation is published, the website is also updated automatically. I hope that soon each integration will have a logo and will therefore be promoted on the website as soon as it is published The previous website could be out of date for a very long time because it was a very tedious and long process, and now each contributor can only do one integration.
See: #13110

As for the versions, deleting them will also be problematic, because it is used to generate the package list
http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers/packages-ref.html
Deleting this field will mean that we need to figure out a way to get the version information from airflow-site in this repository. Currently, we only sync information from airflow to airflow-site.

@ashb
Copy link
Member

ashb commented Jan 5, 2021

As for the versions, deleting them will also be problematic, because it is used to generate the package list
http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow-providers/packages-ref.html
Deleting this field will mean that we need to figure out a way to get the version information from airflow-site in this repository. Currently, we only sync information from airflow to airflow-site.

We can look at the folders under docs-arcivie -- i.e. do a listdir on https://github.com/apache/airflow-site/tree/master/docs-archive/apache-airflow-providers-apache-hive.

Anyway, I think this is probably easier to look at in code than abstract (and yes, I may just not know enough of what is used where). I'll PR or shut up :)

@potiuk
Copy link
Member Author

potiuk commented Jan 5, 2021

As discussed before - the #13488 with separate runtime schema. I think it is cleanest and best approach. I very much like the idea that all provider info is in one place and then we can split out only the information that is need at runtime. This is way better than splitting off the files and much more logical approach IMHO, especially that we can do the validation and that some information (like versions, package name, description ) are shared between runtime and doc.

I am not sure even why we would want to split them in this case.

@mik-laj
Copy link
Member

mik-laj commented Jan 5, 2021

We can look at the folders under docs-arcivie -- i.e. do a listdir on https://github.com/apache/airflow-site/tree/master/docs-archive/apache-airflow-providers-apache-hive.

Yes and no. That would require that every time we want to build documentation, we have that two repositories available. This is quite problematic because it would require some extra work from the user and the repository is quite large. Additionally, it would make the process of publishing a new package more difficult, as you would have to follow 4 steps instead of 2.

  1. build documentation for all new packages.
  2. copy files to airflow-site repository.
  3. build documentation with the package list.
  4. Copy files for one documentation package only.
    This adds so much complexity that I already preferred to keep this information in provider.yaml. This is a much lower cost than maintaining a documentation build process that requires two repositories.

kaxil pushed a commit that referenced this pull request Jan 21, 2021
The additional properties should be allowed in provider schema,
otherwise future version of providers will not be compatible with
older versions of Airflow.

Specifying 'additionalProperties' as allowed we are opening up to
adding more properties to provider.yaml.

This change fixes this is for now by removing extra fields
added since the Airlow 2.0.0 schema and verifying that the 2.0.0
schema correctly validates such modified dictionary.

In the future we might deprecate 2.0.0 and add >=2.0.1 limitation
to the provider packages in which case we will be able to remove
this modification of the provider_info dict.

Also added additional test for provider packages whether they
install on Airflow 2.0.0. This tests might remain even after the
deprecation of 2.0.0 - we can just move it to 2.0.1. However this
will give us much bigger confidence that the providers will
continue work even for older versions of Airflow 2.0.

We might have to modify that test and only include the providers
that are backwards-compatible, in case we have some providers
that depend on future Airflow versions. For now we assume
all providers should be installable from master on 2.0.0.

(cherry picked from commit 523e2f4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants