Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing performance #3034

Merged
merged 1 commit into from
Feb 5, 2021
Merged

Parsing performance #3034

merged 1 commit into from
Feb 5, 2021

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Jan 26, 2021

Resolves #2862 (partially -- there is still more performance work to come)

This work implements a number of changes to improve performance, including supporting libyaml, caching mapped_fields in the classes for 'from_dict', removing deepcopy on fqn_search, moving validation out of the 'from_dict' methods, amd special handling for dbt internal not_null and unique tests. Uses a new TestMacroNamespace instead of original in order to limit the number of macros in the context.

In addition it replaces the Hologram 'from_dict' and 'to_dict ' calls with Mashumaro calls, which compiles those methods into each subclass.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge. I took a first quick pass and left a few comments before signing out for the night.

core/dbt/parser/schemas.py Show resolved Hide resolved
core/dbt/parser/schemas.py Outdated Show resolved Hide resolved
Comment on lines +8 to +12
from yaml import (
CLoader as Loader,
CSafeLoader as SafeLoader,
CDumper as Dumper
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's definitely document how folks can try installing the C version to take advantage of these speedups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will write up some documentation. That would go in docs.getdbt.com?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

test/unit/test_manifest.py Outdated Show resolved Hide resolved
core/setup.py Outdated
'logbook>=1.5,<1.6',
'typing-extensions>=3.7.4,<3.8',
'jsonschema',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Woah... are we done with hologram entirely??
  • We'll need dbt-mashumaro here somewhere. I think dependency_links?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed the Hologram code in dbt.dataclass_schema and updated Hologram.

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a FEAT! Really incredible work here @gshank. The numbers speak for themselves and I am so impressed at the performance perf gains we're making here.

I dropped a couple of questions inline, and some other flags for things I want to revisit with fresher eyes. Nice work so far!

core/dbt/clients/jinja.py Show resolved Hide resolved
core/dbt/context/manifest.py Show resolved Hide resolved
core/dbt/contracts/graph/manifest.py Outdated Show resolved Hide resolved
core/dbt/contracts/graph/manifest.py Outdated Show resolved Hide resolved
@@ -454,182 +440,46 @@ class TestConfig(NodeConfig):
severity: Severity = Severity('ERROR')


SnapshotVariants = Union[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 👋 👋

core/dbt/contracts/relation.py Show resolved Hide resolved
core/dbt/contracts/util.py Show resolved Hide resolved
core/dbt/dataclass_schema.py Outdated Show resolved Hide resolved
core/dbt/helper_types.py Outdated Show resolved Hide resolved
core/dbt/parser/schemas.py Show resolved Hide resolved
Copy link
Contributor

@kwigley kwigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I don't have much feedback. I like the seperation of data validation and serialization! There are a few places where we lose typing, but I think we can trade that for readability. Let's figure out how we should organize dbt-mashumaro in order package this up when we get there. I'm requesting changes because I know some things are still in flux (?) so I want to know when I can take another pass of reading through this, some pieces of this are a black box to me still :).

core/dbt/config/profile.py Outdated Show resolved Hide resolved
core/dbt/config/selectors.py Show resolved Hide resolved
core/dbt/context/base.py Show resolved Hide resolved
core/dbt/context/macros.py Show resolved Hide resolved
core/dbt/context/macro_resolver.py Show resolved Hide resolved
core/dbt/parser/schemas.py Show resolved Hide resolved
@@ -13,3 +13,5 @@ mypy==0.782
wheel
twine
pytest-logbook>=1.2.0,<1.3
git+https://github.com/fishtown-analytics/dbt-mashumaro.git@dbt-customizations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to move this into setup.py before distrubuting obviously, I'm happy to follow up after the PR with how we should package this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We'll also have to do a new release of Hologram, which I added to dev_requirements.txt for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is the comment I'm still tracking. It's not possible to install from setup.py on this branch. We need an answer here before we can put this out as a beta version, which I'd like to do as soon as possible. Could one of you open a follow-on issue?

Unless I'm mistaken, it also means we can't tell people to test this out (after merging) via:

pip install git+https://github.com/fishtown-analytics/dbt.git

@gshank gshank force-pushed the parsing-performance branch 2 times, most recently from 76c17d5 to bb8fe42 Compare February 1, 2021 23:53
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on my end so as to not be a blocker to merge, but would definitely love to:

  1. Ensure that we're encoding timestamps in the ISO 8601 format (sounds like you're already on this one @gshank ?)
  2. Get to a good place w/ the mashurmaro fork and some other deps (i see that jsonschema is unpinned, for instance)

Given that we're early in the release cycle for Margaret Mead, I don't feel strongly about how much of this happens in this PR vs. a future one. Really excited about the perf gains here!

core/dbt/parser/schemas.py Show resolved Hide resolved
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Incredible work here @gshank. Thanks both @drewbanin and @kwigley for the detailed reviews.

The big thing on my mind is (unsurprisingly) distribution. I'd love to get this into beta-testers' hands as quickly as possible.

@@ -13,3 +13,5 @@ mypy==0.782
wheel
twine
pytest-logbook>=1.2.0,<1.3
git+https://github.com/fishtown-analytics/dbt-mashumaro.git@dbt-customizations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is the comment I'm still tracking. It's not possible to install from setup.py on this branch. We need an answer here before we can put this out as a beta version, which I'd like to do as soon as possible. Could one of you open a follow-on issue?

Unless I'm mistaken, it also means we can't tell people to test this out (after merging) via:

pip install git+https://github.com/fishtown-analytics/dbt.git

mapped_fields in the classes for 'from_dict', removing deepcopy
on fqn_search, separating validation from 'from_dict',
and special handling for dbt internal not_null and unique tests.
Use TestMacroNamespace instead of original in order to limit
the number of macros in the context.  Integrate mashumaro into
dbt to improve performance of 'from_dict' and 'to_dict'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate performance issues with schema parsing
4 participants