-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate that Dagster Enterprise plan is in fact better for sensors #34368
Comments
Relevant files can be found in "Connector Ops/Dagster" google drive: https://drive.google.com/drive/folders/1WE9dyKi7QIiUG3lRQn3TjbToxXBkld37?usp=drive_link |
@erohmensing test |
First test: Adding back all resources as a dependency for one of the sensors. This fell over at one point, but was also probably on the cusp (as it fell over between version bumps and not due to a significant amount of new information we suspect we added. If we've gotten any new resources, we really hope this succeeds. Adding these resources puts the sensor time up to ~29 seconds instead of ~2. Previously they fell over, which gives me the suspicion that we're running about 2x as fast (30 seconds versus the tipover point of one minute, though we don't have confirmation of how long they would have actually taken if they could have run to completion) |
January 5, olddd metadata service code, sensor with all resources: 30s Remove extra resources: Down to ~8 seconds Time goes quickly down to average about ~2 (maybe after all the missing stuff got processed? unsure) This particular sensor didn't seem to get any boost from the resources change from dagster's side (but 2 seconds to 2 seconds is obviously miniscule) Bringing the sensor back to that state it was in on January 5th (first picture) brings it right back to ~29 seconds If nothing were different, we'd expect this one to time out - so presumably we are getting some differences here. The question is whether they came from the changes on January 16th or if we lost something we already had by deploying on the 5th |
Separate investigation: Added the original heavy job from the PR where we introduced automaterialization - it still falls over. In fact it looks like it falls over worse(?) than when we did the original implementation, since previously, the issue, from the loom at least, looks like it was that successful runs were making subsequent runs fail. Not sure where to go from here. |
Tried to deploy the state before the Jan 5th deploy by checking it out, reverting the change, and then also cherrypicking the pendulum change (otherwise it wouldn't deploy) but this didn't go anywhere, as airbyte-ci was pretty borked at the time. TL;DRThere was a LOT going on. We
The sensor that once took 30seconds, then failed upon our deploy, now takes 30 seconds again. So that's... good, although its a bit unclear why it was working in the first place if it is really the new CPU that bumped it. In any case that sensor was reworked so that its down to 2 seconds (before and after the new CPU) As a last ditch effort I'll check out the other sensors to see if they changed since Jan 16th. |
Reverting the metadata-lib change without resetting the sensor keeps it short: add all resources back to the sensor. I expect it to fall over brings it back to 30s, not any faster than it was initially This makes me feel like
|
More deets in slack https://airbytehq-team.slack.com/archives/C059CTCUPDL/p1705615802664099
Our hypothesis is that the sensor startup was taking all the runtime, not the sensor itself (due to for example the resources that we were loading up)? The outcome being that the ways to solve this are
My plan today was to
The text was updated successfully, but these errors were encountered: