Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some timers. #141

Merged
merged 5 commits into from
Jan 23, 2017
Merged

Fix some timers. #141

merged 5 commits into from
Jan 23, 2017

Conversation

archolewa
Copy link
Contributor

--We forgot to stop the DataApiRequest timer.

--When timing the posting of the Druid Query we
failed to account for the thread hopping during
asynchronous programming. As a result, not only
were we failing to stop the timer in the correct
instance of the RequestLog (the one that is used
by the response handling thread), but we also
attempted to stop a timer in a RequestLog that
has since been cleared. So we end up with two
warnings: One for attempting to stop a timer that
doesn't exist, and one for logging a timer that
was never stopped.

Since there isn't a good way to time just the
posting of the druid query, and it probably
isn't necessary anyway, the timing has been
removed.

@archolewa archolewa force-pushed the fix-timings branch 2 times, most recently from dfda64d to e8c1b8a Compare January 18, 2017 21:43
@michael-mclawhorn
Copy link
Contributor

Reading this code makes me all sad inside.

@archolewa
Copy link
Contributor Author

Writing this code a while back filled me with RAGE. :rage4:

@michael-mclawhorn
Copy link
Contributor

👍

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the timing around the request makes sense. But we need to figure out what's going on with the timer you're seeing not get stopped. I'm guessing there's a response being sent from a requestMapper that's not playing nice with timers. If that's the case, we need to update docs somewhere to reflect that implicit requirement of request mappers.

@@ -365,6 +365,7 @@ public void getData(
uriInfo,
this
);
RequestLog.stopTiming("DataApiRequest");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed? The call on line 374 should switch from the timer that was open to the next one. If anything, perhaps we should wrap a timing phase around the request mapping (which would also stop this one). I guess my question is: Under what conditions was this timer not being stopped, since it looks like this shouldn't be a problem based on the code in this file.

Copy link
Contributor Author

@archolewa archolewa Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's right, I forgot that switching also stops timers. I was doing this in between another task that had a lot of "hurry up and wait" so I kept losing context.

Copy link
Contributor Author

@archolewa archolewa Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'll need to do some more digging. I know that we're having a similar problem with the hydrating dimension rows timer in Lucene. Despite the code ostensibly being there, it doesn't seem like we are actually executing the line that starts the timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know why switchTimer isn't working. Here's the implementation:

stopMostRecentTimer();
startTiming(nextPhase);

However, the DataApiRequest timer is no longer the most recent timer, because of all the timings we added to the DataApiRequest. So, switchTimer doesn't do what you think it should if you have timings happening inside any method calls between startTimer and switchTimer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

growls I always hated this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm telling you, when someone says "X is convenient" you should hear "X will cause unexpected bugs a few refactors from now."

--We forgot to stop the `DataApiRequest` timer.

--When timing the posting of the Druid Query we
failed to account for the thread hopping during
asynchronous programming. As a result, not only
were we failing to stop the timer in the correct
instance of the RequestLog (the one that is used
by the response handling thread), but we also
attempted to stop a timer in a RequestLog that
has since been cleared. So we end up with two
warnings: One for attempting to stop a timer that
doesn't exist, and one for logging a timer that
was never stopped.

Since there isn't a good way to time just the
posting of the druid query, and it probably
isn't necessary anyway, the timing has been
removed.
@archolewa archolewa force-pushed the fix-timings branch 3 times, most recently from 4347632 to e82d0f6 Compare January 20, 2017 19:51
--`RequestLog::switchTiming` works by stopping the most recent timer,
and then starting the specified timer. As a result, we can fail to stop
a timer A if we start and stop timer B inside a method invoked while
timer A is running. As a result, in order to ensure that all of your
timers are working correctly, you need to be cognizant of the internal
workings of every method you invoke within the timed block.

--Therefore, in an effort to make timed code more modular, we've removed
the `RequestLog::switchTiming` method, and replaced all instances of it
with try-finally blocks that explicitly stop and start each timer. It's
ugly, but it's also much less brittle.
Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few tweaks, and making sure we didn't lose a timer. Otherwise, this is looking really good, thanks!

*
* @param nextPhase the name of the stopwatch to be started
*/
public static void switchTiming(String nextPhase) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should deprecate and call out in changelog, not directly remove.


if (requestMapper != null) {
apiRequest = (DataApiRequest) requestMapper.apply(apiRequest, containerRequestContext);
}

RequestLog.switchTiming("DruidQueryMerge");
RequestLog.startTiming("DruidQueryMerge");
TemplateDruidQuery templateQuery;
// Build the query template
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're gonna hate me for this, but in these refactor spots, can we float the comment to the start of the "block" (ie. above the start of the timer)? Having it in the middle is slightly confusing.

RequestLog.switchTiming("logRequestMetrics");
logRequestMetrics(apiRequest, readCache, druidQuery);
RequestLog.record(new DruidFilterInfo(apiRequest.getFilter()));
RequestLog.switchTiming(REQUEST_WORKFLOW_TIMER);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the start to the request workflow timer I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have tests to verify that the request log is correctly stimulated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably do. Especially since I looked through the logs and it looks like there are a few places where we aren't properly stopping timers (the JobsServlet and I don't think our TestDruidWebService is stopping the request workflow timer correctly).

I tried poking at fixing those problems, but my quick fix in the TestDruidWebService didn't seem to stop all the warnings about the request workflow timer, and the JobsServlet desperately needs to be cleaned up. We start and stop timers all over the place. Since that was more than I wanted to take on right now, I dropped the effort.

The tricky part is that I'm not sure how to do that test. Most methods that run timers, we could probably check easily enough that all the appropriate timers have been stopped. Not sure how to verify that a timer has been started. Especially in the DataServlet.

Ideally we'd use a mock, but RequestLog is statically accessible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we know a technique for fixing the static dependency, but it's a bigger effort than this PR.
Do we care about timers that have started but not stopped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the timers haven't been stopped, then when we go to log their durations, we'll get incorrect numbers.

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the last little thing about moving the changelog comment to the right section.

@@ -49,10 +49,11 @@ Current

### Changed:

- [`RequestLog::switchTiming` has been removed]((https://github.com/yahoo/fili/pull/141)
- [`RequestLog::switchTiming` has been deprecated]((https://github.com/yahoo/fili/pull/141)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to call out deprecation in deprecation section of change log

@@ -49,6 +49,16 @@ Current

### Changed:

- [`RequestLog::switchTiming` has been deprecated]((https://github.com/yahoo/fili/pull/141)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where the other comment I left went, but this note should move into the deprecated section

@michael-mclawhorn
Copy link
Contributor

reaffirming my 👍

@cdeszaq cdeszaq merged commit 351c49d into master Jan 23, 2017
@cdeszaq cdeszaq deleted the fix-timings branch January 23, 2017 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants