Validate that Dagster Enterprise plan is in fact better for sensors #34368

evantahler · 2024-01-19T00:11:27Z

More deets in slack https://airbytehq-team.slack.com/archives/C059CTCUPDL/p1705615802664099

Our hypothesis is that the sensor startup was taking all the runtime, not the sensor itself (due to for example the resources that we were loading up)? The outcome being that the ways to solve this are

Reduce (coded) resources required for the sensor code (how we mitigated)
Add (infra) resources for the sensor itself so loading this stuff doesnt take so long - because its fast on local (what we thought we'd be getting, but are sus that it's not happening)
Increase the timeout - just basically masks the problem but still helps with outages

My plan today was to

create a new branch
revert our resource fix (e.g. add a bunch of resources to one sensor)
Deploy to production to see if the sensor falls over
If not introduce a change a new sensor that calls a no-op job does a more significant amount of processing (read all metadata files and log the contents of each)

evantahler · 2024-01-19T16:12:21Z

Relevant files can be found in "Connector Ops/Dagster" google drive: https://drive.google.com/drive/folders/1WE9dyKi7QIiUG3lRQn3TjbToxXBkld37?usp=drive_link

erohmensing · 2024-01-27T04:48:04Z

@erohmensing test

erohmensing · 2024-01-29T20:55:55Z

First test: Adding back all resources as a dependency for one of the sensors. This fell over at one point, but was also probably on the cusp (as it fell over between version bumps and not due to a significant amount of new information we suspect we added.

If we've gotten any new resources, we really hope this succeeds.

Adding these resources puts the sensor time up to ~29 seconds instead of ~2. Previously they fell over, which gives me the suspicion that we're running about 2x as fast (30 seconds versus the tipover point of one minute, though we don't have confirmation of how long they would have actually taken if they could have run to completion)

erohmensing · 2024-01-29T21:11:30Z

January 5, olddd metadata service code, sensor with all resources: 30s
January 5 with a metadata service code update: pushes it over 1 min

Remove extra resources: Down to ~8 seconds

Time goes quickly down to average about ~2 (maybe after all the missing stuff got processed? unsure)

This particular sensor didn't seem to get any boost from the resources change from dagster's side (but 2 seconds to 2 seconds is obviously miniscule)

Bringing the sensor back to that state it was in on January 5th (first picture) brings it right back to ~29 seconds
I did this 2 separate ways - manual revert of the workaround, and checking out the first fix branch that actually deployed on jan 5th (first one). This would have been the one that caused it to start failing

If nothing were different, we'd expect this one to time out - so presumably we are getting some differences here. The question is whether they came from the changes on January 16th or if we lost something we already had by deploying on the 5th

erohmensing · 2024-01-29T22:04:18Z

Separate investigation:

Added the original heavy job from the PR where we introduced automaterialization - it still falls over.

In fact it looks like it falls over worse(?) than when we did the original implementation, since previously, the issue, from the loom at least, looks like it was that successful runs were making subsequent runs fail. Not sure where to go from here.

erohmensing · 2024-01-29T23:12:19Z

Tried to deploy the state before the Jan 5th deploy by checking it out, reverting the change, and then also cherrypicking the pendulum change (otherwise it wouldn't deploy) but this didn't go anywhere, as airbyte-ci was pretty borked at the time.

TL;DR

There was a LOT going on. We

didn't deploy for 3 months
deployed a lot of metadata-lib changes when deploying the new version
had to work through a pendulum issue afterward
had some sensor definitions that already had way too many (asset) resources it was requesting
deployed a new version of dagster

The sensor that once took 30seconds, then failed upon our deploy, now takes 30 seconds again. So that's... good, although its a bit unclear why it was working in the first place if it is really the new CPU that bumped it. In any case that sensor was reworked so that its down to 2 seconds (before and after the new CPU)

As a last ditch effort I'll check out the other sensors to see if they changed since Jan 16th.

erohmensing · 2024-01-29T23:31:19Z

Reverting the metadata-lib change without resetting the sensor keeps it short:

add all resources back to the sensor. I expect it to fall over brings it back to 30s, not any faster than it was initially

This makes me feel like

We do not have more resources now than in the oct 18th deploy when it was running for 30s
the metadata-lib change did not play a role in this sensor falling over

evantahler added the from/connector-ops label Jan 19, 2024

evantahler assigned erohmensing Jan 19, 2024

erohmensing closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that Dagster Enterprise plan is in fact better for sensors #34368

Validate that Dagster Enterprise plan is in fact better for sensors #34368

evantahler commented Jan 19, 2024 •

edited

Loading

evantahler commented Jan 19, 2024

erohmensing commented Jan 27, 2024

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

Validate that Dagster Enterprise plan is in fact better for sensors #34368

Validate that Dagster Enterprise plan is in fact better for sensors #34368

Comments

evantahler commented Jan 19, 2024 • edited Loading

evantahler commented Jan 19, 2024

erohmensing commented Jan 27, 2024

erohmensing commented Jan 29, 2024 • edited Loading

erohmensing commented Jan 29, 2024 • edited Loading

erohmensing commented Jan 29, 2024 • edited Loading

erohmensing commented Jan 29, 2024 • edited Loading

TL;DR

erohmensing commented Jan 29, 2024 • edited Loading

evantahler commented Jan 19, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading

erohmensing commented Jan 29, 2024 •

edited

Loading