Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability To Colocated Auction Runloop #1930

Merged
merged 6 commits into from
Oct 9, 2023
Merged

Conversation

nlordell
Copy link
Contributor

@nlordell nlordell commented Oct 8, 2023

Description

This PR refactors the autopilot runloop to add some additional observability to the solver competition. This will allow us to have a better idea of how individual solvers are behaving.

As a note on the code and how to record metrics - I created methods on the Metrics type in order to increment/observe things. I'm not 100% happy with this, but wanted to see what others think in terms of legibility.

Changes

  • Added gauge for last auction ID seen in the runloop.
  • Specific error types for all driver interactions
  • Added metrics counters for results of driver interactions (solving, revealing and settling), and specifically a histogram for solve timings.

How to test

The code should be covered by existing E2E tests in the services.

@nlordell nlordell requested a review from a team as a code owner October 8, 2023 19:29
Copy link
Contributor

@fleupold fleupold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created methods on the Metrics type in order to increment/observe things. I'm not 100% happy with this, but wanted to see what others think in terms of legibility.

Why are you not happy with this? Seems fine and readable to me...

"reveal auction id missmatch"
);
.map_err(RevealError::Failure)?;
if !response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Are we putting the auction ID data also on-chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we started doing this.

Comment on lines +477 to +479
// Take extra care to not accidentally keep the borrow alive within
// the `while` body, which would block senders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, just learned about this behavior? This seems like quite a foot gun on this abstraction (ideally we can just get the most recent block without causing a potential write lock)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my understanding is that this is related to where the compiler places the Drop. For example, see this issue.

That being said, I don't think it was a problem in our usage here, I just adapted the existing unbounded loop to be a while loop and just reworded the comment that was already there (I assumed it was done this way for a reason, I didn't verify it as such):

loop {
// This could be a while loop. It isn't, because some care must be taken to not
// accidentally keep the borrow alive, which would block senders. Technically
// this is fine with while conditions but this is clearer.
if self.current_block.borrow().number > deadline {
break;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should review whether this is actually needed (I also think it's not in this case) just to avoid further confusion in the future.
I'm fine with leaving this as is since this PR is urgent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this quick test that seems to suggest that the intermediary value from the while condition expression get dropped.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=271343a442033c63dfbc869ad62e064a

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
@nlordell
Copy link
Contributor Author

nlordell commented Oct 9, 2023

Why are you not happy with this? Seems fine and readable to me...

I feel like the Metrics::get().observe(...) is peppered in the code and slightly inconsistent, and adds cognitive load when reading (instead of having a single place where we record all metrics for example).

@fleupold
Copy link
Contributor

fleupold commented Oct 9, 2023

I feel like the Metrics::get().observe(...) is peppered in the code and slightly inconsistent

In the new architecture we would define a separate observe module and do all logging/metrics there? I kind of like that as well but is probably a large refactor for the autopilot...

@nlordell
Copy link
Contributor Author

nlordell commented Oct 9, 2023

In the new architecture we would define a separate observe module and do all logging/metrics there?

My comment is unrelated to this, and more in the context of this individual module:

we update metrics in single_auction and in competition methods rather arbitrarily, instead of in a single function of the RunLoop type. The metrics adds additional control flow at those points, and it would be nice for it to be more consolidated.

And not about the metrics architecture in the driver. I don't think either @MartinquaXD nor myself are particularly fond of logging from a single module (there are some downsides) so I would advise against it (and generally change how we do things in the driver to be more consistent with the rest of the project; I personally see it as an unsuccessful experiment).

Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice observability improvement! Just seems to me that we would currently not get any meaningful timing information out of the driver runs due to join_all().
Approving assuming this gets addressed.

crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Outdated Show resolved Hide resolved
Comment on lines +477 to +479
// Take extra care to not accidentally keep the borrow alive within
// the `while` body, which would block senders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should review whether this is actually needed (I also think it's not in this case) just to avoid further confusion in the future.
I'm fine with leaving this as is since this PR is urgent.

@MartinquaXD
Copy link
Contributor

I feel like the Metrics::get().observe(...) is peppered in the code and slightly inconsistent

IMO we could also move the Metrics::get() part into each .observe() function implementation. Then it's just Metrics::observe() which seems slightly nicer IMO. No strong opinion, though.

@nlordell nlordell merged commit fba44ba into main Oct 9, 2023
7 checks passed
@nlordell nlordell deleted the refactor-autopilot-runloop branch October 9, 2023 13:26
@github-actions github-actions bot locked and limited conversation to collaborators Oct 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants