-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some timers. #141
Fix some timers. #141
Conversation
dfda64d
to
e8c1b8a
Compare
Reading this code makes me all sad inside. |
Writing this code a while back filled me with RAGE. |
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the timing around the request makes sense. But we need to figure out what's going on with the timer you're seeing not get stopped. I'm guessing there's a response being sent from a requestMapper
that's not playing nice with timers. If that's the case, we need to update docs somewhere to reflect that implicit requirement of request mappers.
@@ -365,6 +365,7 @@ public void getData( | |||
uriInfo, | |||
this | |||
); | |||
RequestLog.stopTiming("DataApiRequest"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this needed? The call on line 374 should switch from the timer that was open to the next one. If anything, perhaps we should wrap a timing phase around the request mapping (which would also stop this one). I guess my question is: Under what conditions was this timer not being stopped, since it looks like this shouldn't be a problem based on the code in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's right, I forgot that switching also stops timers. I was doing this in between another task that had a lot of "hurry up and wait" so I kept losing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'll need to do some more digging. I know that we're having a similar problem with the hydrating dimension rows timer in Lucene. Despite the code ostensibly being there, it doesn't seem like we are actually executing the line that starts the timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know why switchTimer
isn't working. Here's the implementation:
stopMostRecentTimer();
startTiming(nextPhase);
However, the DataApiRequest
timer is no longer the most recent timer, because of all the timings we added to the DataApiRequest
. So, switchTimer
doesn't do what you think it should if you have timings happening inside any method calls between startTimer
and switchTimer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
growls I always hated this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm telling you, when someone says "X is convenient" you should hear "X will cause unexpected bugs a few refactors from now."
--We forgot to stop the `DataApiRequest` timer. --When timing the posting of the Druid Query we failed to account for the thread hopping during asynchronous programming. As a result, not only were we failing to stop the timer in the correct instance of the RequestLog (the one that is used by the response handling thread), but we also attempted to stop a timer in a RequestLog that has since been cleared. So we end up with two warnings: One for attempting to stop a timer that doesn't exist, and one for logging a timer that was never stopped. Since there isn't a good way to time just the posting of the druid query, and it probably isn't necessary anyway, the timing has been removed.
4347632
to
e82d0f6
Compare
--`RequestLog::switchTiming` works by stopping the most recent timer, and then starting the specified timer. As a result, we can fail to stop a timer A if we start and stop timer B inside a method invoked while timer A is running. As a result, in order to ensure that all of your timers are working correctly, you need to be cognizant of the internal workings of every method you invoke within the timed block. --Therefore, in an effort to make timed code more modular, we've removed the `RequestLog::switchTiming` method, and replaced all instances of it with try-finally blocks that explicitly stop and start each timer. It's ugly, but it's also much less brittle.
e82d0f6
to
cc328de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few tweaks, and making sure we didn't lose a timer. Otherwise, this is looking really good, thanks!
* | ||
* @param nextPhase the name of the stopwatch to be started | ||
*/ | ||
public static void switchTiming(String nextPhase) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should deprecate and call out in changelog, not directly remove.
|
||
if (requestMapper != null) { | ||
apiRequest = (DataApiRequest) requestMapper.apply(apiRequest, containerRequestContext); | ||
} | ||
|
||
RequestLog.switchTiming("DruidQueryMerge"); | ||
RequestLog.startTiming("DruidQueryMerge"); | ||
TemplateDruidQuery templateQuery; | ||
// Build the query template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're gonna hate me for this, but in these refactor spots, can we float the comment to the start of the "block" (ie. above the start of the timer)? Having it in the middle is slightly confusing.
RequestLog.switchTiming("logRequestMetrics"); | ||
logRequestMetrics(apiRequest, readCache, druidQuery); | ||
RequestLog.record(new DruidFilterInfo(apiRequest.getFilter())); | ||
RequestLog.switchTiming(REQUEST_WORKFLOW_TIMER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've lost the start to the request workflow timer I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have tests to verify that the request log is correctly stimulated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably do. Especially since I looked through the logs and it looks like there are a few places where we aren't properly stopping timers (the JobsServlet
and I don't think our TestDruidWebService
is stopping the request workflow timer correctly).
I tried poking at fixing those problems, but my quick fix in the TestDruidWebService
didn't seem to stop all the warnings about the request workflow timer, and the JobsServlet
desperately needs to be cleaned up. We start and stop timers all over the place. Since that was more than I wanted to take on right now, I dropped the effort.
The tricky part is that I'm not sure how to do that test. Most methods that run timers, we could probably check easily enough that all the appropriate timers have been stopped. Not sure how to verify that a timer has been started. Especially in the DataServlet
.
Ideally we'd use a mock, but RequestLog
is statically accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we know a technique for fixing the static dependency, but it's a bigger effort than this PR.
Do we care about timers that have started but not stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if the timers haven't been stopped, then when we go to log their durations, we'll get incorrect numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the last little thing about moving the changelog comment to the right section.
@@ -49,10 +49,11 @@ Current | |||
|
|||
### Changed: | |||
|
|||
- [`RequestLog::switchTiming` has been removed]((https://github.com/yahoo/fili/pull/141) | |||
- [`RequestLog::switchTiming` has been deprecated]((https://github.com/yahoo/fili/pull/141) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to call out deprecation in deprecation section of change log
@@ -49,6 +49,16 @@ Current | |||
|
|||
### Changed: | |||
|
|||
- [`RequestLog::switchTiming` has been deprecated]((https://github.com/yahoo/fili/pull/141) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where the other comment I left went, but this note should move into the deprecated
section
reaffirming my 👍 |
--We forgot to stop the
DataApiRequest
timer.--When timing the posting of the Druid Query we
failed to account for the thread hopping during
asynchronous programming. As a result, not only
were we failing to stop the timer in the correct
instance of the RequestLog (the one that is used
by the response handling thread), but we also
attempted to stop a timer in a RequestLog that
has since been cleared. So we end up with two
warnings: One for attempting to stop a timer that
doesn't exist, and one for logging a timer that
was never stopped.
Since there isn't a good way to time just the
posting of the druid query, and it probably
isn't necessary anyway, the timing has been
removed.