Performance issues with a nontrivial number of sources #2474

drewbanin · 2020-05-20T13:47:45Z

Describe the bug

In dbt v0.17.0-rc1, it appears that performance degrades greatly when a nontrivial number of sources are added to a project. This is a regression: I could not reproduce this performance failure mode in dbt v0.16.x.

I tested this out by repeatedly running dbt ls and recording runtimes. The data looks like:

sources	runtime (s)	time per node (s)
25	4.049509287	0.1619803715
50	7.128911018	0.1425782204
75	12.29483604	0.1639311473
100	18.71185374	0.1871185374
125	29.37486005	0.2349988804
150	38.06821609	0.2537881072
175	50.92429304	0.2909959602
200	67.39295316	0.3369647658
225	79.97533703	0.3554459423

This data indicates that a dbt ls command with 225 sources takes 1m20s to run. A corresponding dbt ls on 0.16.1 runs in 4s!

There may be some not-ideal algorithmic complexity issues to look into here. Additionally, the fixed cost for parsing a single source is super high. Most of this latency appears to come from the serialization and deserialization of data in the source patching part of the codebase.

The patch_source method accounts for the majority of the runtime of this dbt ls command, but notably, there are no sources to patch in my example project!

The relevant part of the codebase is around here:

https://github.com/fishtown-analytics/dbt/blob/75dbb0bc19376b2905d5bbb66284b9be3bf3c93c/core/dbt/parser/sources.py#L44-L68

Possible resolutions

Can we skip the source patching code if the source is not patched?
The slowest parts of this execution are around serialization and deserialization in hologram (I think). Is there an easy way to make this serialization/deserialization significantly faster?

The output of dbt --version:

dbt v0.17.0-rc1

The operating system you're using: macOS

The output of python --version: 3.7.7

The text was updated successfully, but these errors were encountered:

beckjake · 2020-05-20T14:38:22Z

I assure you, there's no way dbt 0.17.0rc1 is running with python 2.7.7. 😄
Maybe we should have people run dbt debug instead, so we can capture homebrew/virtualenv installs?

drewbanin · 2020-05-20T14:52:54Z

oops - i meant 3.7.7 - that was a typo

drewbanin · 2020-05-21T13:46:45Z

fixed by #2478

drewbanin added bug Something isn't working performance labels May 20, 2020

drewbanin added this to the Octavius Catto milestone May 20, 2020

beckjake mentioned this issue May 20, 2020

fix source patching perf with no patches #2478

Merged

4 tasks

drewbanin closed this as completed May 21, 2020

beckjake mentioned this issue May 21, 2020

docs/schema.yml parsing can be very slow at large scales #2480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with a nontrivial number of sources #2474

Performance issues with a nontrivial number of sources #2474

drewbanin commented May 20, 2020 •

edited

Loading

beckjake commented May 20, 2020

drewbanin commented May 20, 2020

drewbanin commented May 21, 2020

Performance issues with a nontrivial number of sources #2474

Performance issues with a nontrivial number of sources #2474

Comments

drewbanin commented May 20, 2020 • edited Loading

Describe the bug

Possible resolutions

beckjake commented May 20, 2020

drewbanin commented May 20, 2020

drewbanin commented May 21, 2020

drewbanin commented May 20, 2020 •

edited

Loading