Package best practices doc #139

joellabes · 2022-07-28T02:08:37Z

Based on the lessons learned from adding a bunch of packages to the Hub, here's a spec for would-be contributors to validate against before opening a hubcap PR.

package-best-practices.md

README.md

package-best-practices.md

dbeatty10

So excited about this! Looks awesome 🤩

This looks good enough to me to merge as-is -- we could always revisit with updates as-needed.

The review comments include responses and ideas.

The GitHub emojis don't have nearly a wide enough range to express how excited I am about this! 🏆

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>

joellabes · 2022-07-29T03:18:37Z

Shipping it! Thanks for the feedback @dbeatty10 - I've taken a very literal approach to defining version tags, and just said "it has to meet the regex". I think that's probably the best we can do if we have taken parts of two version tagging systems? If we have taken all of PEP-440 then this gets easier but I don't think we did.

package-best-practices.md

dbeatty10 · 2022-07-29T13:05:59Z

package-best-practices.md

+- Packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.
+- Packages importing another package which has reached major version one or later SHOULD allow all subsequent minor and patch releases of that major version. 
+- Packages importing another package which is major version zero MAY pin to the current minor version only.


I think the last 3 bullet points can all be replaced with simple guidance:

Suggested change

- Packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.

- Packages importing another package which has reached major version one or later SHOULD allow all subsequent minor and patch releases of that major version.

- Packages importing another package which is major version zero MAY pin to the current minor version only.

- The widest version range possible SHOULD be specified to give the dependency resolver the appropriate latitude.

My guess is that after lots of discussion exploring many scenarios, ☝️ will still hold as an essential summary, a cataclysm sentence that "contains the most information in the fewest words".

The three bullet points make assumptions about the versioning scheme being used by the imported package which might be strict SemVer 2.0, but just as likely might be a variant or something else entirely.

The three bullet points make assumptions about the versioning scheme being used by the imported package which might be strict SemVer 2.0, but just as likely might be a variant or something else entirely.

Could we change that SHOULD from Releases and Updates to a MUST?

Those bullets are written in the same spirit as when you go to a fancy hotel and there's a No Elephants Allowed In the Ballroom sign - you know they didn't put it up until they had to.

The first one (no patches) is the most common and catastrophic failure state I see; I'm not wed to the others but they seemed like the obvious next question: "If I'm not allowed to pin to a patch, what do you recommend?"

What do you think of something like this?

Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution. In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.

Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution. In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.

🎯 looks perfect

Could we change that SHOULD from Releases and Updates to a MUST?

I'm guessing that we can't say "Packages MUST follow the guidance of Semantic Versioning 2.0.0" because we don't have any auditing mechanism -- there is no easy way to validate to what degree a package is following it or not. We can validate the version string with a regex, but we can't reasonably determine if they are following the SemVer guidance or not. They are on the honor system for this one.

dbeatty10 · 2022-07-29T13:14:31Z

package-best-practices.md

+
+### Interoperability
+- Packages MUST NOT override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.
+- Packages MUST use the cross-database macros built into dbt Core where available, such as `{{ datediff }}` and `{{ type_string() }}`.


MUST or SHOULD? Could a BigQuery-specific spend package BigQuery-isms? Or what if the current cross-database implementation doesn't cover some edge case that the package needs or wants to accommodate?

Optional ideas for re-wording slightly:

Suggested change

- Packages MUST use the cross-database macros built into dbt Core where available, such as `{{ datediff }}` and `{{ type_string() }}`.

- Packages SHOULD use the [cross-database macros](https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros) built into dbt Core where available, such as `{{ except() }}` and `{{ type_string() }}`.

Went back and forth on this and maybe came down too hard 🤔. How about packages intended for use with multiple warehouses MUST...

I like the link to the docs

If there's an edge case that the macros don't cover, then I think the "where available" works as an escape hatch

Love your suggestion to introduce the nuanced language of RFC 6919 where it is useful! 🤩

News flash: I'm not an expert on the interpretation and application of RFC 2119 or 6919. The following is all informed by my nascent understanding and interpretations.

After reading the "Guidance in the use of these Imperatives" section of RFC 2119, it feels like SHOULD is the applicable imperative here. The following stood out to me in particular:

they must not be used to try to impose a particular method on implementors where the method is not required for interoperability.

A package could ignore some of our "musts" and operate just fine. I didn't check, but I'm guessing there are examples currently in the Hub that would back this up.

But most (not all) of our current usage of MUST seems like it should be SHOULD instead. SHOULD appears to be a really strong word in the spec! It shouldn't be confused at all with OPTIONAL.

So I think we shouldn't feel like we are pandering or compromising on our expectations in any way when we use SHOULD. We should feel like we're laying down something strong, confident, and very assertive.

Tragically, 6919 was published on 1 April 2013 and was never ratified. (See also https://datatracker.ietf.org/doc/html/rfc2324)

Happy to roll with SHOULD here

dbeatty10

This is really excellent work -- looks fantastic 🤩

I left comments for a few nit-picky things. The "wide version ranges" comment deserves the most focus and consideration.

Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>

joellabes · 2022-08-02T09:12:43Z

package-best-practices.md

+- Packages SHOULD specify the widest possible range of supported versions, to minimise issues in dependency resolution. 
+    - In particular, packages SHOULD NOT pin to a patch version of their imported package unless they are aware of an incompatibility.
+### Interoperability
+- Packages MUST NOT override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.


@jtcohen6 I wrote this thinking about the aftermath of that Fivetran package that overrode freshness collection and had unfortunate side effects.

But, if someone like this community member turned their gist into a package, that would be an acceptable overriding of behaviour yeah? Does this rule need a bit of wordsmithing or do we just drop it down to REALLY SHOULD NOT per RFC 6919?

+1 for REALLY SHOULD NOT

REALLY SHOULD NOT doesn't actually exist 😢 (discussed above) - should've made clearer that this was a joke

dbeatty10 · 2022-08-02T14:52:35Z

Re-reading the SemVer 2.0.0 spec and specifically the FAQ, we should strongly consider adding the following:

Package versions SHOULD be a minimum of 1.0.0 to align with the spirit of semantic versioning for production software. If a package isn't ready for production, then it isn't yet ready for the Hub.

If the Hub is intended to also host packages that should not be used by anyone in production, then I fully retract the suggestion above.

joellabes · 2022-08-02T22:08:02Z

Re-reading the SemVer 2.0.0 spec and specifically the FAQ, we should strongly consider adding the following:

Package versions SHOULD be a minimum of 1.0.0 to align with the spirit of semantic versioning for production software. If a package isn't ready for production, then it isn't yet ready for the Hub.

If the Hub is intended to also host packages that should not be used by anyone in production, then I fully retract the suggestion above.

Yeahhh... but.... we are nowhere near following that rule ourselves, nor are any of the other packages in the Hub. The rest of this document is is codifying existing behaviour to save us from abstractly saying "you don't meet our norms". This would be a change of policy which I don't really want to spend time enforcing at the moment.

And FWIW, we already say that people SHOULD follow SemVer, so I don't think we need to call out one FAQ specifically. I'd be happy to suggest to people who are submitting future packages that they might want to tag them 1.0.0 instead of 0.1.0 though

dbeatty10 · 2022-08-02T22:57:39Z

I'm attracted to encoding this for a couple reasons:

Many projects that look like SemVer and claim to be SemVer are really just ZeroVer
No one will be harmed by diving in at 1.0.0 -- c'mon in! the water's fine! 🏄
Acceptable version numbers themselves are one of the very few things we can actually programmatically enforce for new packages.

I'm very aware of our own failure to follow that one 🤒 Probably even more reason for us to turn the tide in ways that we can.

joellabes · 2022-08-02T23:23:48Z

What about things like dbt-project-evaluator, dbt-audit-helper or dbt-codegen?

They're primarily for interactive use and I wouldn't necessarily encourage people to use in an unmonitored production environment, especially not codegen.

I am particularly uncomfortable with the editorialising of "if it's not ready for prod it's not ready for the Hub". Can we meet in the middle and put an addendum on the "should follow SemVer" along the lines of "Note in particular the recommendation for production-ready packages to be version 1.0.0 or above"?

dbeatty10 · 2022-08-03T03:33:30Z

Production-ready packages

Note in particular the recommendation for production-ready packages to be version 1.0.0 or above

This would be movement in the right direction.

Would it end up looking something like this?

### Releases and updates
- Packages' git tags MUST validate against the regex defined in [version.py](/hubcap/version.py).
- Packages SHOULD follow the guidance of [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html).
    - Note in particular the recommendation for production-ready packages to be version `1.0.0` or above

Right-shifted SemVer

What about things like dbt-project-evaluator, dbt-audit-helper or dbt-codegen?

What constitutes being "prod"? 🤷

To the best that I can tell, some or all of those projects follow "right-shifted SemVer" -- just shift everything one separator to the left to recover the MAJOR.MINOR.PATCH semantic meaning! (PATCH will always be 0 following this shift.)

If that is the case, then there's "major" releases hiding in there along with an implied public API -- they could have just started out at 1.0.0 and then it's less complicated to understand the semantic meaning of each of their versions.

Starting out using ZeroVer probably has measurable gravity that is hard to escape.

Interestingly, the SemVer FAQ explains scenarios when the bump to 1.0.0 is later than it should be. It doesn't explain scenarios when it is too early though.

joellabes · 2022-08-03T04:45:12Z

yeah I think you’re probably right that it's mostly just right shifted semver. im having horrible flashbacks to high school calculus when we learned about integration and always had to remember the + c to reflect whatever constant was lost and unrecoverable.

I feel good about this now - we're encouraging good behaviour for new packages, we can gently nudge the existing ecosystem to make better choices over time, but we're not marking the entire package hub as non-compliant which feels a bit rough. I also suspect that this is another place where we were following in dbt Core's footsteps: if it had started at v1, would all the packages be zero-based? I doubt it.

package-best-practices.md

dbeatty10 · 2022-08-03T14:56:24Z

package-best-practices.md

+- The documentation SHOULD indicate which data warehouses are expected to work with this package.
+
+### Customisability 
+- Packages MUST NOT hard-code table references, and MUST use `ref` or `source` instead.


Non-blocking commentary.

The current text definitely conveys the spirit of the guidance, but an overly lawyerly maintainer could hard-code a view reference and claim full compliance.

Two opportunities for further clarity:

table → relation

use applicable adapter methods rather than directly utilizing describe, show, or information_schema

I'm fuzzy on the detailed guidance to accomplish the latter, so that's why I don't have an explicit suggestion that covers both opportunities.

Our glossary doesn't include an entry "relation" currently, it only has the following disparate entries:

https://docs.getdbt.com/terms/table

https://docs.getdbt.com/terms/view

https://docs.getdbt.com/terms/cte

The closest documentation I could find quickly focuses on the definition and usage of the Relation Python object:

https://docs.getdbt.com/reference/dbt-classes#relation

dbeatty10

👍 This will add great value.

Approving with the belief and trust that we'll make edits in the future as-needed.

First draft of package best practices

e8f3e82

joellabes requested review from VersusFacit and dbeatty10 as code owners July 28, 2022 02:08

joellabes commented Jul 28, 2022

View reviewed changes

package-best-practices.md Outdated Show resolved Hide resolved