Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package best practices doc #139

Merged
merged 8 commits into from
Aug 3, 2022
Merged

Conversation

joellabes
Copy link
Contributor

Based on the lessons learned from adding a bunch of packages to the Hub, here's a spec for would-be contributors to validate against before opening a hubcap PR.

package-best-practices.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
dbeatty10
dbeatty10 previously approved these changes Jul 28, 2022
Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited about this! Looks awesome 🤩

This looks good enough to me to merge as-is -- we could always revisit with updates as-needed.

The review comments include responses and ideas.

The GitHub emojis don't have nearly a wide enough range to express how excited I am about this! 🏆

joellabes and others added 2 commits July 29, 2022 14:31
Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
@joellabes
Copy link
Contributor Author

Shipping it! Thanks for the feedback @dbeatty10 - I've taken a very literal approach to defining version tags, and just said "it has to meet the regex". I think that's probably the best we can do if we have taken parts of two version tagging systems? If we have taken all of PEP-440 then this gets easier but I don't think we did.

Comment on lines 34 to 36
- Packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.
- Packages importing another package which has reached major version one or later SHOULD allow all subsequent minor and patch releases of that major version.
- Packages importing another package which is major version zero MAY pin to the current minor version only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last 3 bullet points can all be replaced with simple guidance:

Suggested change
- Packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.
- Packages importing another package which has reached major version one or later SHOULD allow all subsequent minor and patch releases of that major version.
- Packages importing another package which is major version zero MAY pin to the current minor version only.
- The widest version range possible SHOULD be specified to give the dependency resolver the appropriate latitude.

My guess is that after lots of discussion exploring many scenarios, ☝️ will still hold as an essential summary, a cataclysm sentence that "contains the most information in the fewest words".

The three bullet points make assumptions about the versioning scheme being used by the imported package which might be strict SemVer 2.0, but just as likely might be a variant or something else entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three bullet points make assumptions about the versioning scheme being used by the imported package which might be strict SemVer 2.0, but just as likely might be a variant or something else entirely.

Could we change that SHOULD from Releases and Updates to a MUST?

Those bullets are written in the same spirit as when you go to a fancy hotel and there's a No Elephants Allowed In the Ballroom sign - you know they didn't put it up until they had to.

The first one (no patches) is the most common and catastrophic failure state I see; I'm not wed to the others but they seemed like the obvious next question: "If I'm not allowed to pin to a patch, what do you recommend?"

What do you think of something like this?

Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution. In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution. In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.

🎯 looks perfect

Could we change that SHOULD from Releases and Updates to a MUST?

I'm guessing that we can't say "Packages MUST follow the guidance of Semantic Versioning 2.0.0" because we don't have any auditing mechanism -- there is no easy way to validate to what degree a package is following it or not. We can validate the version string with a regex, but we can't reasonably determine if they are following the SemVer guidance or not. They are on the honor system for this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair!


### Interoperability
- Packages MUST NOT override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.
- Packages MUST use the cross-database macros built into dbt Core where available, such as `{{ datediff }}` and `{{ type_string() }}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST or SHOULD? Could a BigQuery-specific spend package BigQuery-isms? Or what if the current cross-database implementation doesn't cover some edge case that the package needs or wants to accommodate?

Optional ideas for re-wording slightly:

Suggested change
- Packages MUST use the cross-database macros built into dbt Core where available, such as `{{ datediff }}` and `{{ type_string() }}`.
- Packages SHOULD use the [cross-database macros](https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros) built into dbt Core where available, such as `{{ except() }}` and `{{ type_string() }}`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went back and forth on this and maybe came down too hard 🤔. How about packages intended for use with multiple warehouses MUST...

I like the link to the docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an edge case that the macros don't cover, then I think the "where available" works as an escape hatch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love your suggestion to introduce the nuanced language of RFC 6919 where it is useful! 🤩

News flash: I'm not an expert on the interpretation and application of RFC 2119 or 6919. The following is all informed by my nascent understanding and interpretations.

After reading the "Guidance in the use of these Imperatives" section of RFC 2119, it feels like SHOULD is the applicable imperative here. The following stood out to me in particular:

they must not be used to try to impose a particular method on implementors where the method is not required for interoperability.

A package could ignore some of our "musts" and operate just fine. I didn't check, but I'm guessing there are examples currently in the Hub that would back this up.

But most (not all) of our current usage of MUST seems like it should be SHOULD instead. SHOULD appears to be a really strong word in the spec! It shouldn't be confused at all with OPTIONAL.

So I think we shouldn't feel like we are pandering or compromising on our expectations in any way when we use SHOULD. We should feel like we're laying down something strong, confident, and very assertive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tragically, 6919 was published on 1 April 2013 and was never ratified. (See also https://datatracker.ietf.org/doc/html/rfc2324)

Happy to roll with SHOULD here

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really excellent work -- looks fantastic 🤩

I left comments for a few nit-picky things. The "wide version ranges" comment deserves the most focus and consideration.

joellabes and others added 2 commits August 2, 2022 12:49
Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
@joellabes joellabes requested a review from dbeatty10 August 2, 2022 01:12
- Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution.
- In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.
### Interoperability
- Packages MUST NOT override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 I wrote this thinking about the aftermath of that Fivetran package that overrode freshness collection and had unfortunate side effects.

But, if someone like this community member turned their gist into a package, that would be an acceptable overriding of behaviour yeah? Does this rule need a bit of wordsmithing or do we just drop it down to REALLY SHOULD NOT per RFC 6919?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for REALLY SHOULD NOT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REALLY SHOULD NOT doesn't actually exist 😢 (discussed above) - should've made clearer that this was a joke

@dbeatty10
Copy link
Contributor

Re-reading the SemVer 2.0.0 spec and specifically the FAQ, we should strongly consider adding the following:

Package versions SHOULD be a minimum of 1.0.0 to align with the spirit of semantic versioning for production software. If a package isn't ready for production, then it isn't yet ready for the Hub.

If the Hub is intended to also host packages that should not be used by anyone in production, then I fully retract the suggestion above.

@joellabes
Copy link
Contributor Author

Re-reading the SemVer 2.0.0 spec and specifically the FAQ, we should strongly consider adding the following:

Package versions SHOULD be a minimum of 1.0.0 to align with the spirit of semantic versioning for production software. If a package isn't ready for production, then it isn't yet ready for the Hub.

If the Hub is intended to also host packages that should not be used by anyone in production, then I fully retract the suggestion above.

Yeahhh... but.... we are nowhere near following that rule ourselves, nor are any of the other packages in the Hub. The rest of this document is is codifying existing behaviour to save us from abstractly saying "you don't meet our norms". This would be a change of policy which I don't really want to spend time enforcing at the moment.

And FWIW, we already say that people SHOULD follow SemVer, so I don't think we need to call out one FAQ specifically. I'd be happy to suggest to people who are submitting future packages that they might want to tag them 1.0.0 instead of 0.1.0 though

@dbeatty10
Copy link
Contributor

I'm attracted to encoding this for a couple reasons:

  • Many projects that look like SemVer and claim to be SemVer are really just ZeroVer
  • No one will be harmed by diving in at 1.0.0 -- c'mon in! the water's fine! 🏄
  • Acceptable version numbers themselves are one of the very few things we can actually programmatically enforce for new packages.

I'm very aware of our own failure to follow that one 🤒 Probably even more reason for us to turn the tide in ways that we can.

@joellabes
Copy link
Contributor Author

What about things like dbt-project-evaluator, dbt-audit-helper or dbt-codegen?

They're primarily for interactive use and I wouldn't necessarily encourage people to use in an unmonitored production environment, especially not codegen.

I am particularly uncomfortable with the editorialising of "if it's not ready for prod it's not ready for the Hub". Can we meet in the middle and put an addendum on the "should follow SemVer" along the lines of "Note in particular the recommendation for production-ready packages to be version 1.0.0 or above"?

@dbeatty10
Copy link
Contributor

Production-ready packages

Note in particular the recommendation for production-ready packages to be version 1.0.0 or above

This would be movement in the right direction.

Would it end up looking something like this?

### Releases and updates
- Packages' git tags MUST validate against the regex defined in [version.py](/hubcap/version.py).
- Packages SHOULD follow the guidance of [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html).
    - Note in particular the recommendation for production-ready packages to be version `1.0.0` or above

Right-shifted SemVer

What about things like dbt-project-evaluator, dbt-audit-helper or dbt-codegen?

What constitutes being "prod"? 🤷

To the best that I can tell, some or all of those projects follow "right-shifted SemVer" -- just shift everything one separator to the left to recover the MAJOR.MINOR.PATCH semantic meaning! (PATCH will always be 0 following this shift.)

If that is the case, then there's "major" releases hiding in there along with an implied public API -- they could have just started out at 1.0.0 and then it's less complicated to understand the semantic meaning of each of their versions.

Starting out using ZeroVer probably has measurable gravity that is hard to escape.

Interestingly, the SemVer FAQ explains scenarios when the bump to 1.0.0 is later than it should be. It doesn't explain scenarios when it is too early though.

@joellabes
Copy link
Contributor Author

yeah I think you’re probably right that it's mostly just right shifted semver. im having horrible flashbacks to high school calculus when we learned about integration and always had to remember the + c to reflect whatever constant was lost and unrecoverable.

I feel good about this now - we're encouraging good behaviour for new packages, we can gently nudge the existing ecosystem to make better choices over time, but we're not marking the entire package hub as non-compliant which feels a bit rough. I also suspect that this is another place where we were following in dbt Core's footsteps: if it had started at v1, would all the packages be zero-based? I doubt it.

package-best-practices.md Outdated Show resolved Hide resolved
- The documentation SHOULD indicate which data warehouses are expected to work with this package.

### Customisability
- Packages MUST NOT hard-code table references, and MUST use `ref` or `source` instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking commentary.

The current text definitely conveys the spirit of the guidance, but an overly lawyerly maintainer could hard-code a view reference and claim full compliance.

Two opportunities for further clarity:

  • table → relation
  • use applicable adapter methods rather than directly utilizing describe, show, or information_schema

I'm fuzzy on the detailed guidance to accomplish the latter, so that's why I don't have an explicit suggestion that covers both opportunities.

Our glossary doesn't include an entry "relation" currently, it only has the following disparate entries:

The closest documentation I could find quickly focuses on the definition and usage of the Relation Python object:

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This will add great value.

Approving with the belief and trust that we'll make edits in the future as-needed.

@joellabes joellabes merged commit 38d53dd into main Aug 3, 2022
@joellabes joellabes deleted the joel/add-package-best-practices branch August 3, 2022 20:52
@dbeatty10 dbeatty10 linked an issue Oct 31, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Codify package config guidelines
2 participants