Skip to content

Conversation

@igchor
Copy link
Contributor

@igchor igchor commented Jul 24, 2024

The latency tracker is useful for measuring and optimizing specific parts of code that are hard to profile using external tracers.

Sample output:

<LEVEL_ZERO>[INFO]: [command_list_cache_t::getRegularCommandList] average latency: 0ns
<LEVEL_ZERO>[INFO]: [command_list_cache_t::getRegularCommandList] number of samples: 0
<LEVEL_ZERO>[INFO]: [command_list_cache_t::getImmediateCommandList] average latency: 15895ns
<LEVEL_ZERO>[INFO]: [command_list_cache_t::getImmediateCommandList] number of samples: 200

@igchor igchor requested a review from a team as a code owner July 24, 2024 18:47
@github-actions github-actions bot added the level-zero L0 adapter specific issues label Jul 24, 2024
Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote something like this before when I was investigating latency spikes in enqueue, but I took a different approach. I made a structure that then formed a hierarchy of trackers, and on destruction of the top-most object (per-thread global) printed a (a hacky one-off) histogram of collected data.

Something like:

static per_thread FooTracker foo; // this also had some options to e.g., only collect data if the top-most operation took more than N ns.

void foo() {
   TRACKER(&foo); //this took function name and line #
   { 
       TRACKER_SCOPE(&foo, "op1");
       // some expensive operation
   }
   {
      TRACKER_SCOPE(&foo); // or anonymous scope
   }
}

with the tracker global object usable across module boundaries to track latency of operations in a deep stack of things.

I never cleaned up that code (I can try to dig it somewhere out of my hundreds of UR branches :P), and the histogram implementation was awful (I had to tune buckets by hand), but it gave me a better overall picture of the latency of an operation, rather than just an average.


private:
const char *name;
double avg_{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a double and not just an integer value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still need to cast it to double for calculations in trackValue I think. I'm not sure if it matters really. But I have changed the estimate() return value to be uint64_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still need to cast it to double for calculations in trackValue I think. I'm not sure if it matters really.

Well, I just prefer not using floating point math if we don't have to, especially for something so simple as sum += value; ++cnt; avg = sum / cnt;. But yeah, I don't think matters all that much.

@igchor igchor force-pushed the latency_tracker_v2 branch from 0a537da to d2f8523 Compare July 25, 2024 16:32
@igchor
Copy link
Contributor Author

igchor commented Jul 25, 2024

I wrote something like this before when I was investigating latency spikes in enqueue, but I took a different approach. I made a structure that then formed a hierarchy of trackers, and on destruction of the top-most object (per-thread global) printed a (a hacky one-off) histogram of collected data.

Something like:

static per_thread FooTracker foo; // this also had some options to e.g., only collect data if the top-most operation took more than N ns.

void foo() {
   TRACKER(&foo); //this took function name and line #
   { 
       TRACKER_SCOPE(&foo, "op1");
       // some expensive operation
   }
   {
      TRACKER_SCOPE(&foo); // or anonymous scope
   }
}

with the tracker global object usable across module boundaries to track latency of operations in a deep stack of things.

I never cleaned up that code (I can try to dig it somewhere out of my hundreds of UR branches :P), and the histogram implementation was awful (I had to tune buckets by hand), but it gave me a better overall picture of the latency of an operation, rather than just an average.

That makes sense. I think it would be good to have some histogram as well in the future. In CacheLib we used one implemented on top of folly but that's a huge dependency + I'm not sure what's the overhead for that. For this rolling latency the overhead is really not noticeable for things we are measuring.

@pbalcer
Copy link
Contributor

pbalcer commented Jul 25, 2024

That makes sense. I think it would be good to have some histogram as well in the future. In CacheLib we used one implemented on top of folly but that's a huge dependency + I'm not sure what's the overhead for that.

Yea, I'd rather we find some good small compact histogram library. Otherwise, we can always put this behind a ifdef in cmake.

For this rolling latency the overhead is really not noticeable for things we are measuring.

Yeah, let's go with this simple implementation for now.

@igchor
Copy link
Contributor Author

igchor commented Jul 31, 2024

That makes sense. I think it would be good to have some histogram as well in the future. In CacheLib we used one implemented on top of folly but that's a huge dependency + I'm not sure what's the overhead for that.

Yea, I'd rather we find some good small compact histogram library. Otherwise, we can always put this behind a ifdef in cmake.

For this rolling latency the overhead is really not noticeable for things we are measuring.

Yeah, let's go with this simple implementation for now.

Please take a look at: #1912

@igchor igchor closed this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

level-zero L0 adapter specific issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants