Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that Dagster Enterprise plan is in fact better for sensors #34368

Closed
evantahler opened this issue Jan 19, 2024 · 7 comments
Closed

Validate that Dagster Enterprise plan is in fact better for sensors #34368

evantahler opened this issue Jan 19, 2024 · 7 comments
Assignees

Comments

@evantahler
Copy link
Contributor

evantahler commented Jan 19, 2024

More deets in slack https://airbytehq-team.slack.com/archives/C059CTCUPDL/p1705615802664099

Our hypothesis is that the sensor startup was taking all the runtime, not the sensor itself (due to for example the resources that we were loading up)? The outcome being that the ways to solve this are

  1. Reduce (coded) resources required for the sensor code (how we mitigated)
  2. Add (infra) resources for the sensor itself so loading this stuff doesnt take so long - because its fast on local (what we thought we'd be getting, but are sus that it's not happening)
  3. Increase the timeout - just basically masks the problem but still helps with outages

My plan today was to

  1. create a new branch
  2. revert our resource fix (e.g. add a bunch of resources to one sensor)
  3. Deploy to production to see if the sensor falls over
  4. If not introduce a change a new sensor that calls a no-op job does a more significant amount of processing (read all metadata files and log the contents of each)
@evantahler
Copy link
Contributor Author

Relevant files can be found in "Connector Ops/Dagster" google drive: https://drive.google.com/drive/folders/1WE9dyKi7QIiUG3lRQn3TjbToxXBkld37?usp=drive_link

@erohmensing
Copy link
Contributor

@erohmensing test

@erohmensing
Copy link
Contributor

erohmensing commented Jan 29, 2024

First test: Adding back all resources as a dependency for one of the sensors. This fell over at one point, but was also probably on the cusp (as it fell over between version bumps and not due to a significant amount of new information we suspect we added.

If we've gotten any new resources, we really hope this succeeds.

Adding these resources puts the sensor time up to ~29 seconds instead of ~2. Previously they fell over, which gives me the suspicion that we're running about 2x as fast (30 seconds versus the tipover point of one minute, though we don't have confirmation of how long they would have actually taken if they could have run to completion)

@erohmensing
Copy link
Contributor

erohmensing commented Jan 29, 2024

January 5, olddd metadata service code, sensor with all resources: 30s
January 5 with a metadata service code update: pushes it over 1 min

image

Remove extra resources: Down to ~8 seconds
image

Time goes quickly down to average about ~2 (maybe after all the missing stuff got processed? unsure)
image

This particular sensor didn't seem to get any boost from the resources change from dagster's side (but 2 seconds to 2 seconds is obviously miniscule)
image

Bringing the sensor back to that state it was in on January 5th (first picture) brings it right back to ~29 seconds
I did this 2 separate ways - manual revert of the workaround, and checking out the first fix branch that actually deployed on jan 5th (first one). This would have been the one that caused it to start failing
image
image

If nothing were different, we'd expect this one to time out - so presumably we are getting some differences here. The question is whether they came from the changes on January 16th or if we lost something we already had by deploying on the 5th

@erohmensing
Copy link
Contributor

erohmensing commented Jan 29, 2024

Separate investigation:

Added the original heavy job from the PR where we introduced automaterialization - it still falls over.
image

In fact it looks like it falls over worse(?) than when we did the original implementation, since previously, the issue, from the loom at least, looks like it was that successful runs were making subsequent runs fail. Not sure where to go from here.

@erohmensing
Copy link
Contributor

erohmensing commented Jan 29, 2024

Tried to deploy the state before the Jan 5th deploy by checking it out, reverting the change, and then also cherrypicking the pendulum change (otherwise it wouldn't deploy) but this didn't go anywhere, as airbyte-ci was pretty borked at the time.

TL;DR

There was a LOT going on. We

  • didn't deploy for 3 months
  • deployed a lot of metadata-lib changes when deploying the new version
  • had to work through a pendulum issue afterward
  • had some sensor definitions that already had way too many (asset) resources it was requesting
  • deployed a new version of dagster

The sensor that once took 30seconds, then failed upon our deploy, now takes 30 seconds again. So that's... good, although its a bit unclear why it was working in the first place if it is really the new CPU that bumped it. In any case that sensor was reworked so that its down to 2 seconds (before and after the new CPU)

As a last ditch effort I'll check out the other sensors to see if they changed since Jan 16th.

@erohmensing
Copy link
Contributor

erohmensing commented Jan 29, 2024

Reverting the metadata-lib change without resetting the sensor keeps it short:
image

add all resources back to the sensor. I expect it to fall over brings it back to 30s, not any faster than it was initially

This makes me feel like

  1. We do not have more resources now than in the oct 18th deploy when it was running for 30s
  2. the metadata-lib change did not play a role in this sensor falling over

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants