Skip to content
This repository was archived by the owner on Dec 13, 2018. It is now read-only.

Add support for reporting application metrics via logging #708

Closed
analogrelay opened this issue Sep 19, 2017 · 73 comments
Closed

Add support for reporting application metrics via logging #708

analogrelay opened this issue Sep 19, 2017 · 73 comments
Milestone

Comments

@analogrelay
Copy link
Contributor

analogrelay commented Sep 19, 2017

Overview

We want to enable applications to emit arbitrary named metrics and have them captured by telemetry/logging systems (App Insights, Elasticsearch/ELK, etc.) as well as other metrics systems (Statsd, InfluxDB, etc.). An abstraction is desired in order to allow applications to emit metrics without being coupled to the destination of the metrics, particularly with features like App Insights auto-lightup where the app doesn't directly depend upon App Insights.

Proposal

We will add support for metrics reporting to ILogger. It will be done via a new companion interface IMetricsLogger, which loggers may optionally implement. Applications can emit metrics through this interface (see API below), and logger providers which support metrics can receive and process those metrics. Metrics are (name, value) tuples, where value is always floating point number. Metrics are expected to be viewed in aggregate, hence it doesn't make sense to use a generic data type like object for the metric, so metric values are always doubles. Supporting arbitrary types of metric values is a non-goal. Reporting metrics should be as efficient as possible, especially in the case where no logger provider is registered to listen for them, or they are filtered out.

Why add to ILogger?

Initially, I considered developing a new abstraction for Metrics. However, I encountered a few concerns:

  1. We would need to develop a structure for naming/organizing metrics (likely using type/namespace names as a base)
  2. We would need to develop a structure for filtering metrics (by categories, names, granularities, etc.)
  3. Many telemetry systems provide support for both log and metric collection, and would have to provide a new system
  4. Consumers have to pull in two different diagnostics reporters from DI (ILogger<T>, IMetricReporter<T>, etc.)
  5. This would create Yet Another Diagnostics Abstraction (on top of EventSource, EventCounters, DiagnosticSource, Perf Counters, yada yada yada (see what I did there? ;P))

The fact that 1 and 2 above already exist in the Logging pipeline, and many metric sinks are also logging sinks, led to the idea of integrating metrics into Logging rather than creating a completely new abstraction.

The main reason why we would not want to add this to ILogger would be a risk

The API

IMetricLogger

Optional companion interface for ILogger implementations to indicate their support for metrics. The ILogger instances returned by ILoggerProvider are expected to implement this interface if and only if they wish to receive metrics.

namespace Microsoft.Extensions.Logging
{
    public interface IMetricLogger
    {
        /// <summary>
        /// Define a new metric with the provided name and return an <see cref="IMetric"/> that can be used to report values for that metric.
        /// </summary>
        /// <param name="name">The name of the metric to define</param>
        /// <returns>An <see cref="IMetric"/> that can be used to report values for the metric</returns>
        IMetric DefineMetric(string name);
    }
}

IMetric

Interface that allows metric data to be reported. Metric values

namespace Microsoft.Extensions.Logging
{
    public interface IMetric
    {
        /// <summary>
        /// Record a new value for this metric.
        /// </summary>
        /// <param name="value">The value to record for this metric</param>
        void RecordValue(double value);
    }
}

LoggerMetricsExtensions

Provides an extension method on ILogger to enable consumers to emit metrics, even if the underlying logger provider doesn't support it (of course, the metrics will be ignored in that case).

namespace Microsoft.Extensions.Logging
{
    public static class LoggerMetricsExtensions
    {
        /// <summary>
        /// Define a new metric with the provided name and return an <see cref="IMetric"/> that can be used to report values for that metric.
        /// </summary>
        /// <remarks>
        /// If none of the registered logger providers support metrics, values recorded by this metric will be lost.
        /// </remarks>
        /// <param name="name">The name of the metric to define</param>
        /// <returns>An <see cref="IMetric"/> that can be used to report values for the metric</returns>
        public static IMetric DefineMetric(this ILogger logger, string name)
        {
            if(logger is IMetricLogger metricLogger)
            {
                return metricLogger.DefineMetric(name);
            }
            return NullMetric.Instance;
        }
    }
}

MetricValueExtensions

Extension methods for IMetric to support other common metric value types (as mentioned above, the underlying type is still double, so the value must be representable as a double).

Suggestion: We could have a .Time() API that returns an IDisposable which uses a Stopwatch to do the timing for you as well. That would be easy to add later though.

using System;

namespace Microsoft.Extensions.Logging
{
    public static class MetricValueExtensions
    {
        /// <summary>
        /// Record a new value for this metric.
        /// </summary>
        /// <remarks>
        /// This is a convenience method that will convert the <see cref="TimeSpan"/> to a <see cref="double"/> via
        /// the <see cref="TimeSpan.TotalMilliseconds"/> property.
        /// </remarks>
        /// <param name="metric">The metric to record the value on.</param>
        /// <param name="value">The value to record for this metric.</param>
        public static void RecordValue(this IMetric metric, TimeSpan value) => metric.RecordValue(value.TotalMilliseconds);
    }
}

Logger updates

The aggregate Logger class we return from LoggerFactory.CreateLogger would be updated to support IMetricsLogger, even if none of the ILoggers it aggregates support it. It would only send metrics to the loggers that are known to support the interface.

Define/Record pattern

In order to keep the performance cost as low as possible for recording the actual metric value, the API uses a Define/Record pattern. Consumers should expect to call DefineMetric("foo") once for a particular value of "foo" and store the result as globally as possible (in singleton types, etc.). It is DefineMetric that does the bulk of the work to set up the metric recording. After calling DefineMetric, the RecordValue call on the IMetric interface can choose how to store the value based on the providers needs. For example, in a provider that plans to send pre-aggregated metrics, the RecordValue could simply update rolling aggregate values and then drop the actual recording (thus having constant storage requirements for each metric). If a provider needs to report individual metrics, it can use dynamically-allocated buffers or ring buffers, depending upon the configuration and the needs of the provider. As an example of an IMetric implementation based on pre-aggregating data, see https://gist.github.com/anurse/03c6746204317bd3616c372b4df9dbba

Filtering Metrics

Metrics should be filterable, and they should participate in the existing Logger filtering pipeline. My current proposal is that Metrics are treated like LogLevel.Critical messages, in that if the category is enabled at all, the metrics are written. We could consider capturing a LogLevel in .DefineMetric and using that to provide different "levels" of metrics. Since the existing filters only let you filter by Provider, Category, and Level, it is very difficult to filter out specific metrics by name (as that would require a new Filter structure). This may make LogLevel support for metrics more important.

@pakrym and I talked and came up with a specific suggestion

Metrics can be filtered at the provider and category name, but there is no concept of a LogLevel for metrics. However, filters can disable all metrics for a particular provider/category combination. To do this, we'll add an AllowMetrics boolean to LoggerFilterRule which defaults to true. When writing a metric, we'll check the provider and category name against the filter options as well as checking this boolean.

Scopes and Property Bags

FEEDBACK REQUESTED

My initial thought is that Metrics exist outside of scopes, as they are generally aggregated and Scopes imply per-event data. Providers could, in theory, give their IMetric instances access to read active scope data, and attach them to metrics if they choose (for example, systems that report unaggregated metrics may want to). For similar reasons, property bags (i.e. Log Values/Structured Logging, etc.) don't really make sense for metrics in general as they are usually aggregated. Having said that, many providers do support arbitrary properties (InfluxDB, App Insights), so some level of support may be useful. We could easily add an optional parameter to .RecordValue to allow adding arbitrary properties to each metric, and providers could choose to disregard it if it doesn't make sense for that provider.

Impact on existing Providers

There is minimal impact on existing Providers. Because the interface provides an opt-in model for a provider to receive metrics, existing providers are unaffected. I'd imagine providers like Serilog would not participate in this system as they don't have an infrastructure for metrics (/cc @nblumhardt , I may be wrong!). Providers like App Insights may choose to either add metrics support to their existing providers or add new providers that only handle metrics (and ignore log messages) that are co-registered.

@analogrelay
Copy link
Contributor Author

@analogrelay
Copy link
Contributor Author

PR of initial implementation coming very soon!

@nblumhardt
Copy link
Contributor

Thanks for the ping @anurse! Looks good from a Serilog perspective, AFAICT.

I've done some work with InfluxDB in this area; the single numeric value sounds fine, but the lack of support for key/value tags seems like it will limit the API's usefulness with Influx (not sure about other platforms).

In scenario terms -

  • Recording transaction processing times, I want to slice the data by the particular transactions, by customer, by machine name, by database host IP... but still roll up aggregates over any of those dimensions (i.e. by considering them to belong to the same metric)
  • Recording app startup time, I want to slice the data by app version, runtime version, OS and so-on... but still roll those up, and see the trend over time across multiple dimensions
  • ... etc. (happy to provide more if it would help)

@analogrelay
Copy link
Contributor Author

I want to slice the data by the particular transactions, by customer, by machine name, by database host IP... but still roll up aggregates over any of those dimensions

Yeah, I came to the same conclusion in my explorations with Influx. I think the pattern I'm leaning towards is allowing key-value pairs in RecordValue and allowing Providers to ignore them if they aren't relevant (ditto Scopes).

@analogrelay
Copy link
Contributor Author

analogrelay commented Sep 19, 2017

Properties/Scopes are a tricky one to solve well.

  1. We want to be able to record a metric value without requiring an allocation. Thus, we want to use the same generic pattern we using for logging where we accept an arbitrary T (which could be a struct)
  2. However, metrics are usually stashed away somewhere and then flushed to the back-end in a background process. This means structs would have to be boxed (unless you do something like a ring buffer, but even that is a stretch).
  3. Scopes are even trickier because there could be a lot of data stored in the scopes, and much of it may be irrelevant to metrics. For example, the Request ID is not very relevant to metrics (use a log message if you want info on a single request), but Request Path is relevant (as multiple requests can share that attribute). It's impossible for a metric provider to know which fields matter for the metric. The more data associated with the metric, the more impact it has on the performance of the code being measured, which makes it a lot more difficult to use.

I'd lean towards supporting arbitrary properties, but discouraging providers from reading Scope data (we can't stop them from reading it ;)). It does mean that if I want to attach some arbitrary property to an EF metric (for example), I can't do it unless EF exposes a way to do it. Perhaps that's OK for now.

@alhardy
Copy link

alhardy commented Sep 19, 2017

Are metrics really a logging concern? If I were interested in tracking application metrics would it make sense to have to take a dependency on Microsoft.Extensions.Logging?

Have you guys looked at App Metrics for example?

This reminds me of when I noticed Health Checking functionality being added.

Once I noticed Health Checks being implemented by Microsoft I went to efforts to abstract that out of App Metrics, I don't see any progress on https://github.com/dotnet-architecture/HealthChecks anymore?

@muratg
Copy link

muratg commented Sep 19, 2017

@glennc

@galvesribeiro
Copy link

galvesribeiro commented Sep 20, 2017

@anurse a while ago, I added Telemetry APIs to Orleans (dotnet/orleans#992) which essentially add support to an abstraction that makes both trace/logging and metrics tracking mostly inspired on AI and NewRelic APIs.

Recently we removed the log/tracing from those APIs in favor of use Microsoft.Extensions.Logging abstractions so we would leverage most of the packages (like Serilog) that were already with its plugins for those abstractions.

Now, I'm going to revamp the telemetry APIs v2 and someone from the Orleans core team pointed me this issue out and would be good if we could use those metrics abstractions but I have some concerns about the way it is being implemented:

  1. Why don't make IMetric<T> where T can be int, float, decimal, long? I think using double only is way restrictive.
  2. Why not have it on Microsoft.Extensions.Metrics? I agree with @alhardy that metrics is not a Logging concern. If you think from the APM perspective, both metrics and trace/logging are telemetry-related concerns and they have correlation. So if you want to keep both together, I guess you would need a Microsoft.Extensions.Telemetry package otherwise, keep them separated would be better. I understood you want to leverage the categories infrastructure from MEL, but I guess it can be abstracted in another packages and be referenced by both MEL and MEM.
  3. I'm not sure if Define/Record pattern is a good thing... Metrics carry much more metadata then that. They can or can not be aggregated, batched, auto-flushed or not, etc. There are much more configuration and it must be defined at the method creation time.
  4. Property bags is definitively something that you will want to have support on all methods tracking metrics. Those bags carry context that makes sense for a given metric. i.e. if you are tracking an HttpRequest, you would include all the headers, browser and server details, etc. While tracking an exception, you would need context as well. Essentially an IDictionary<string, object> where object must be JSON serializable.

I would be interested in use it in Orleans but the way it is, it would limit us a lot.

@alhardy
Copy link

alhardy commented Sep 20, 2017

And here I am asked to remove my dependency on Microsoft.Extensions.Logging and Microsoft.Extensions.DependencyInjection AppMetrics/AppMetrics#186

I'm not sure why Microsoft is interested in including their own implementation to report metrics to OSS TSDBs like InfluxDB when they already have AppInsights? There doesn't seem to be much support from Microsoft in terms of OSS, wasn't this the idea of the dotnet foundation, rather than including everything possible in the dotnet framework?

App Metrics is my first attempt at an OSS project (other then contributing to other projects), rather than support from Microsoft, equivalent implementations are underway, Metrics and Health Checks. There are also other examples of this type of thing .NET developers have raised recently and in the past.

Microsoft I think has done amazing things all-round recently, very impressed, but the above makes me loose motivation in contributing to the dotnet community :(

@Eilon
Copy link
Member

Eilon commented Sep 20, 2017

@alhardy we haven't decided to anything just yet! That's why we've logged this issue: to invite community discussion!

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

Here is my thoughts on this as someone who works in a very large corporate environment that has large logging as well as large telegraf/influxdb setups.

  1. Absolutely the metrics need to be "baked" into the framework. So that all framework components will support/use them.
  2. There is plenty of scope for "OSS", in the space of writing the adapters and aggregators
  3. We have lost a bit with windows perf counters going xplat this would help fill that gap
  4. Having to register your metric upfront leads to the problem you defined above around needing arbitrary extra key/value metadata added, instead.....

Make it so you can fire any metric and make the metric name a "namespace" effectively. So your metric maybe something like

http:response:200:/api/blabla|10

Or something like that. Then it's really up to the adapters to make it efficient and smart. In fact if you make the "key" just opaque bytes then you can deal with that however and cache those values.

Anyway just my 2p but this is a very important topic and one that I see as a weak point of c# aspnetcore apps in a big mixed Java/Go/C# ecosystem so I am happy this is being discussed!

@alhardy
Copy link

alhardy commented Sep 20, 2017

@Drawaes Yeah OK good points, in particular 1 and 3 👍

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 20, 2017

As always, I'm adding my rant that you should sit together with Vance and the corefx team to not come up with yet another solution for very similar things.

E.g. metrics for operation-based things (like incoming / outgoing http requests) are already covered by Activity. People/metrics providers can just subscribe to activities and therefore re-use all the stuff that it offers (and has been requested above: timing, count, object context, tags).

Don't make us implement yet another abstraction in our frameworks/libraries - 283 lines of instrumentation code (as in the incoming ASP.NET Core request Diagnostics class) is enough. :)

See https://github.com/dotnet/corefx/issues/18114 for more context.

Give Activity some love and make sure it's easily usable and well documented.

People who need other kinds of metrics can always use e.g. AppMetrics which offers much more functionality anyway.

By introducing such a simple metrics abstraction in Microsoft.Extensions.*, you're just creating more Microsoft-internal competition and you're probably killing a great OSS project.

@davidfowl
Copy link
Member

@cwe1ss how would you suggest we implement exposing the current connection count in SignalR and the number of websocket messages sent per second?

@davidfowl
Copy link
Member

As always, I'm adding my rant that you should sit together with Vance and the corefx team to not come up with yet another solution for very similar things.

We're best buddies with @vancem 😄

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 20, 2017 via email

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 20, 2017

(This would only cover the Connection count though).
Aren't websocket messages per sec something for PerfCounter/EventCounter?

@davidfowl
Copy link
Member

@cwe1ss an activity to count the number of concurrent connections? I understand you want a single abstraction but that's just a hack right? The reason that metric system take numbers is because they are counting things. Trying to shoehorn everything into the same API makes a mediocre experience for producers and consumers as we'll end up inventing hidden semantics that can't be expressed in the API.

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 20, 2017

no, the Activity is used for the instrumentation side to indicate an operation - nothing more, nothing less. Why should I as a framework/library developer care if you want to use something like Zipkin vs something like Prometheus? It's up to the consumers to decide what to do with it (in the DiagnosticSource events) - e.g. Zipkin might create a trace/span, Prometheus might create multiple counters (total, per country, whatever, ...), ...

if creating an activity and running through the subscribers is too much overhead (maybe for msgs in a connection, but surely not for the connection itself, right?) then why not use existing stuff like EventCounter/PerfCounter.

I'm not saying that I like Activity/EventCounter/... 100% but it's what's there in corefx - and it's brand new - so why invent something else??

We've had this discussion before so we probably just have to agree to disagree - just wanted to place my rant again so that I can tell my grandkids one day that it's not my fault that they have to use 20 different Diagnostics things in .NET 😀

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 20, 2017

PS: I'm not familiar with web sockets. Is the connection duration also an interesting metric? By instrumenting it with Activity you would get that automatically as well.

@ThatRendle
Copy link

It's not at all clear why IMetricsLogger would be an optional, additional interface that ILogger implementations can add. Logs go to ELK or Splunk or Seq wherever. Metrics go to time-series databases (Influx/Prometheus/Graphite) or stats aggregators (statsd/collectd/Telegraf). They're not remotely the same thing.

I can see that it would make sense for the Microsoft Application Insights® ILogger implementation to also accept metrics, because that platform provides both those things, and I know there are other telemetry systems that do likewise, but many of the solutions out there do not. If @alhardy wants to add an implementation of this IMetricsLogger abstraction to App Metrics, where is the sense in him having to provide any kind of implementation of ILogger at all? It's just going to NOOP on Log(...) and return false from IsEnabled and I-don't-know-what on BeginScope().

Yes, there is a degree of commonality regarding points 1 and 2 in @anurse's list (type/namespace bits and so on) but there's a world of difference outside of there.

@ThatRendle
Copy link

@anurse DefineMetric should have a key/value pairs parameter as well.

@ThatRendle
Copy link

Would a Counter metric work by doing RecordValue(1)?

Related: is there a case to be made for having an on-interface RecordValue(long value) overload as well as double? In many of the line formats integers and floats are denoted separately (e.g. 1i vs 1.0).

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

I don't like the define... what if I define it 2x, what if I don't define it before I send a metric?

Also I tend to agree that logging and metrics a separate.. .. like @markrendle called it, we are Kafka/ splunk - telegraf/ influxdb

@ThatRendle
Copy link

ThatRendle commented Sep 20, 2017

I'm guessing that once the metric is defined the first time, it hangs around and gets re-used like Loggers, although that appears to be up to the implementer of IMetricLogger. And I guess that the baked-in meta-implementation of Logger would do that? Something like

private readonly ConcurrentDictionary<string, IMetric> _metrics;

public IMetric DefineMetric(string name, params (string key, string value)[] tags)
{
    return _metrics.GetOrAdd(name, n => CreateNewMetric(n, tags));
}

private IMetric CreateNewMetric(string name, params (string key, string value)[] tags)
{
    var metric = new MetaMetric();
    foreach (var loggerInfo in Loggers)
    {
        if (loggerInfo.Logger is IMetricLogger metricLogger)
        {
            metric.Add(metricLogger.DefineMetric(name, tags));
        }
    }
    return metric;
}

So if the meta-implementation does that, do the actual implementations even need to cache them internally?

Of course now there's a question whether the tags would need to be part of the key that uniquely identifies the metric, and the performance overhead of turning a name plus an arbitrary array of (string,string) tuples into a key...

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

Sure i get what it will do. My question is why leak the "create it and cache it" abstraction. Why can't send any arbitrary key, if it's the first time the it's seen it then create it.

@ThatRendle
Copy link

Well, if you've got a type that's already designed to be a singleton (like an API client wrapper), it's better to define the metrics in the constructor into readonly IMetric instances and then just use them on every call.

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

What if I have a metric for every time a certain http code is returned. Do I have to register say all 1000 numbers or whatever in case I send one? That is my problem with forcing preregistration.

Now you might say I could tracking I made it or not and I certainly could but why should the client app do that or care. One big difference between metrics and logging is that metrics should be a lot more fire and forget... ( hence udp and arbitrary keys in statsd)

@galvesribeiro
Copy link

This is what I plan todo for Orleans telemetry APIs v2 in case we don't find the proper abstractions here:

Just as a context, the facet system is somehow a way to configure some of the services that are injected to other types.

public abstract class MetricFacet : Attribute
{
    public string Name { get; set; }
    public bool IsShared { get; set; }
    public bool AutoFlush { get; set; }
}

// Each type of metric would have a facet attribute that would describe how it can be configured
// Since Facets system from was designed with the purpose of offer a common standardized way to 
// inject framework services, I thought it would be a good fit for telemetry/metrics
public class CounterMetricFacet : MetricFacet
{
    // Should aggregate "BatchCount" events before flush (even if AutoFlush == true)
    public bool Aggregate { get; set; }
    public int BatchCount { get; set; } = 10;
}

public class TimeSpanMetricFacet : MetricFacet
{
    // Similar to batch, but only report data when a window is expired.
    public TimeSpan Window { get; set; }
}

// Basic metric interface which all metrics should implement
// Regardless of the type of the metric, they all need to flush it to the registered telemetry consumers
public interface IMetric<T>
{
    Task Flush();
}

// The CounterMetric metric behaves similar to a performance counter, counting an arbitrary number.
// Based on the Facet configuration, it can aggregate a number of events and only flush once.
public interface CounterMetric<T> : IMetric<T>
{
    Task Increment(Dictionary<string, object> meta);
    Task Decrement(Dictionary<string, object> meta);
    Task Increment(T amount, Dictionary<string, object> meta);
    Task Decrement(T amount, Dictionary<string, object> meta);
}

// Another example of metric which would be used to track down some execution times
public interface ITimeSpanMetric : IMetric<TimeSpan>
{
    Task BeginRecording(Dictionary<string, object> meta);
    Task EndRecording(Dictionary<string, object> meta);
}

// A regular grain which receives the metrics injected as Facets
public class MyGrainWithCounter : Grain, IMyGrainWithCounter
{
    private ICounterMetric<int> _myIntCounter;
    private ITimeSpanMetric _myTSMetric;

    public MyGrainWithCounter(
        [CounterMetricFacet(Name = "MyIntCounter", IsShared = true, Aggregate = true, BatchCount = 100)] myIntCounter,
        [TimeSpanMetricFacet(Name = "MyTSMetric", AutoFlush = true, Window = TimeSpan.FromMinutes(1))] myTSMetric)
    {
        _myIntCounter = myIntCounter;
        _myTSMetric = myTSMetric;
    }

    public async Task DoSomethingLong()
    {
        await _myTSMetric.BeginRecording();
        await Task.Delay(100000); //Do something really useful and that takes time...
        await _myTSMetric.EndRecording();
    }

    public async Task AddSomethingForMe(int amount)
    {
        //To some business logic for the grain method and after we are done increment the counter with the value passed by
        await _myIntCounter.Increment(amount);

        // We need the APM tool to receive the metric data right away, so flush it!
        await _myIntCounter.Flush();
    }
}
//The flushing, aggregating, etc, are all configurations that would be specified at the facet attribute,
//so it can be created and injected before the grain is created. That way we would achieve granularity 
//of the configuration per metric, per grain type. 
//The Telemetry consumers would be almost the same as they are today.
//Depending on the metric type, if they are shared, they could leverage some system target that would be
//counting that particular metric and behaving as a backend for it, also making use of the turn-based concurrency model (single-thread guarantee per grain), 
//so no metric writes would be lost.

Not all the aspects on this implementation would fit a general purpose metric abstraction, but they can give a sense of freedom in terms of public developer APIs. For example, the Facets system used in Orleans could be easily replaceable by a MetricsFactory.Create<T>(params) injected from DI.

In other hand, what we call Telemetry Consumer, is nothing more than another abstraction where logging/metric framework developers will plug in their backend implementations (i.e. AI, NewRelic, App Metric, etc.). That way we decouple metric publishing from consuming in the framework.

The IMetric<T> variations would have particular implementation with different semantics. For example, the ICounterMetric would be something similar to how PerfCounters behave, but it wouldn't care how that metric is consumed. It just care of collect/count, pre-aggregate, and flush to the consumer.

@galvesribeiro
Copy link

@anurse

It looks like in your suggestion aggregation and consumption parameters being specified at the place you write the metrics in your pattern. From what I've seen, there's enough variation in how users want to aggregate that it feels cleaner to push the aggregation logic out to the consumption end. Is that not something you've seen in your scenarios? That would be valuable feedback for us :)

The aggregation doesn't happen on the publisher side, it is just a configuration that tell the IMetric<T> whether the publisher should or not flush that metric right away to the consumer. The consumer can also have other logic to windowing/batching before push it to the underlying/external telemetry system/db. The capture (even with aggregation) can happen pretty fast while the consumer is slow.

Looking at Orleans actor implementation (aka Grain) you have the following:

public class MyGrainWithCounter : Grain, IMyGrainWithCounter
{
    private ICounterMetric<int> _myIntCounter;
    private ITimeSpanMetric _myTSMetric;

    public MyGrainWithCounter(
        [CounterMetricFacet(Name = "MyIntCounter", IsShared = true, Aggregate = true, BatchCount = 100)] myIntCounter,
        [TimeSpanMetricFacet(Name = "MyTSMetric", AutoFlush = true, Window = TimeSpan.FromMinutes(1))] myTSMetric)
    {
        _myIntCounter = myIntCounter;
        _myTSMetric = myTSMetric;
    }

    public async Task DoSomethingLong()
    {
        await _myTSMetric.BeginRecording();
        await Task.Delay(100000); //Do something really useful and that takes time...
        await _myTSMetric.EndRecording();
    }

    public async Task AddSomethingForMe(int amount)
    {
        //To some business logic for the grain method and after we are done increment the counter with the value passed by
        await _myIntCounter.Increment(amount);

        // We need the APM tool to receive the metric data right away, so flush it!
        await _myIntCounter.Flush();
    }
}

In this particular case, since ICounterMetric<int> was configured as IsShared that means that it will share the counter resources across multiple actors using that same counter name (ofc by respecting the concurrency by using cheap locks for example). If you set it as IsShared=false that means that particular metric is only for that given actor instance.

That flexibility allow multiple kinds of metrics to be collected regardless if shared or not, and regardless of whether it is batched or not.

So, answer that first question, the aggregation logic can happen in two places. At the object level and at the consumer level. At the object (non-shared metrics) you can aggregate data from a particular instance of the object. At the consumer, it can have many different logic implementation to aggregate and batch it.

but we also realized that a pattern where the producer can emit key-value pairs and consumers can decide if they can support them or not

Yeah, that was my point. If that key-value pairs are going to be used or not, it is up to the consumer implementation.

I'd hesitate to support IMetric as I think many providers would have a lot of trouble with aggregating say a string ;). Providing the ability to use any numeric type and propagate that type down to the provider (so the provider can apply any necessary optimizations) does seem reasonable though. Just have to think of the way to do it (since you can't do where T: int | double | decimal | ... ... yet?)

People use to think that metrics can be achieved with just numbers. I would say that if we compare with logs, they are just strings. Today there is a trend to mess log with contextual data and you have Structured Logs. As much as I don't like to rely on those structured string parsers, I think have the flexibility to declare whatever data structure you want to represent your metrics can have its use. Telemetry (both logs and metrics) can have structured data on it. If you want to create your own consumer that handlers a strongly-typed data structure that holds all the payload content you want for your telemetry, why should we block it?

I know that there is a key-value on each pair, but sometimes, it would be better if we have strong typing to describe the events. If you look at AI SDK you have RequestTelemetry, ExceptionTelemetry, EventTelemetry etc. All them follow a strongly-typed object hierarchy that gives so much meaning to telemetry than just a simple PerfCounter increment.

IMHO, I think people give Telemetry not much attention and resume it to just "PerformanceCounters" and "Plain Logs".

@analogrelay
Copy link
Contributor Author

People use to think that metrics can be achieved with just numbers. I would say that if we compare with logs, they are just strings.

Do you have an example of a string that is a metric and wouldn't otherwise be represented as an Event? All the possible consumers I've explored (Influx, Statsd, App Insights) only support numeric types for metrics. Metrics are values which are viewed almost entirely as aggregates (yes, as @Drawaes points out you may want to view the current value of the metric, but in general you aggregate). We want to support offloading that aggregation to later processing and thus want to avoid aggregating in-process, but it's still a value that is designed to be aggregated.

I know that there is a key-value on each pair, but sometimes, it would be better if we have strong typing to describe the events

I'm pretty sure @benaadams will hunt me down if I make him allocate objects to write a metric ;). Metrics are often written in the hot paths of an app. Being able to emit a single double on the stack and have it stuffed into a ring buffer or aggregated with other values all without allocating is pretty valuable to making sure metrics can be recorded down all the hot paths.

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

Yeah I have only seen numbers. Anything else is a log or it's a number against a new key. IMHO ;)

@galvesribeiro
Copy link

I'm pretty sure @benaadams will hunt me down if I make him allocate objects to write a metric ;).

Then we should kill all services like NewRelic, Application Insights, DataDog HQ, AppDynamics, and all other pretty stablished APM tools/services as all they have object allocation for complex (not just a number) metrics. :)

Metrics are often written in the hot paths of an app.

Agreed.

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

You don't wanna open that can of worms.. cause yeah I wouldn't have most of those in my hotpath ... ;)

@benaadams
Copy link
Contributor

Then we should kill all services like NewRelic, Application Insights, DataDog HQ, AppDynamics, and all other pretty stablished APM tools/services as all they have object allocation for complex (not just a number) metrics.

And they become your bottleneck if you try to log metrics at very high speed; the documentation for Application Insights even says:

Aggregation. When working with metrics, every single measurement is rarely of interest. Instead a summary of what happened during a particular time period is important. Such a summary is called aggregation. In the above example, the aggregate metric sum for that time period is 1 and the count of the metric values is 2. When using the aggregation approach, you only invoke TrackMetric once per time period and send the aggregate values. This is the recommended approach since it can significantly reduce the cost and performance overhead by sending fewer data points to Application Insights, while still collecting all relevant information.

@Drawaes
Copy link

Drawaes commented Sep 20, 2017

Again, I think the issue here is conflation... I am thinking ... I want to measure stock ticks per second, latency through a system, or say bullets fired per second and map it to CPU use, or memory or GC collections, I want to know when a systems latency is climbing and map it in realtime (or near realtime with small amounts of aggregation) to stuff going on. If I want higher level stats, I can use methods discussed previously, say structured logging or app insights or something.

@nblumhardt
Copy link
Contributor

FWIW, the last fairly large system I implemented metrics in with InfluxDB/Grafana used a combination of those approaches - some important or high-cardinality metrics went over as raw points, and high-frequency counters were mostly pre-aggregated (point volumes were an ongoing problem for quite a while).

@macrogreg
Copy link

Hi everybody,

I am excited to see such a lively thread on this. I will read it in detail and respond more thoroughly in a couple of days when I return from an ongoing industry conference. But to inform the discussion, let me say that we are currently working on a Metrics Library that we hope to release as a public Beta within the next few months. You can see the work in progress in our Applications Insights SDK Labs repo in the "metrics/master" branch under the "Metrics" directory:

https://github.com/Microsoft/ApplicationInsights-SDK-Labs/tree/metrics/master/Metrics

While many details are already set for the initial version, I am looking forward to a discussion to help us iterate and improve.

@galvesribeiro
Copy link

Guys, again, I'm not saying you should allocate those object for all metrics. Also, I'm not saying you should collect metrics right to AI (or whatever APM tools) from your hot path. You can collect simple metrics at the publisher side, which should be fast, but at the consumer side (which is where you create those objects) you can afford to be slow and capture more context, make aggregations, batching, etc.

For example, in a Online Travel Agency/Airline system, speed of the return of a trip query is all that matters. However, you still need to know how much time was spent in a query on each source (Amadeus, Sabre, LiveTrip, etc.)? How many of those queries were converted to a booking? what is the average ticket? All those are business metrics. If in a given point in time, the booking rate drops, and the time spent in query rise, you will need to correlate it and having just a number saying the number of purchase, is not rich enough to give me that information (even worse if you rely on simple log strings!), so I'll only know that "we sell less flights" but will never know "why?".

My point is, APM is way more than simple counters. If it has a performance cost to use it, then it is up to the person designing the system to take into consideration whether or not this information is relevant to them. IMHO, it is a tradeoff, like anything on CS.

@galvesribeiro
Copy link

BTW, my feeling here is that we are conceptually changing subjects.

I guess people here want a new Windows Performance Counters v2. If that is the case, I'm out of the discussion as I'm talking about metrics in the context of APM/Telemetry.

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 21, 2017 via email

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 21, 2017

You got to start from the requirements. This is missing so many aspects of monitoring/tracing such as sampling, correlation and tagging/context information. You design and release this and will be come running back changing it soon. Google's Dapper paper which has put the foundations of such efforts in the industry, is a good starting point.

@aliostad Do you know the new Activity type in System.Diagnostics.DiagnosticSource? It is based on OpenTracing's Span construct (it lacks most of the other constructs though).

In terms of requirements I would also like to know what the actual usages will be in the asp.net core framework.

@aliostad
Copy link

aliostad commented Sep 21, 2017

No I have not seen, @cwe1ss thanks for the ping.
So does this hook on to the current activity?

It is great to see that .NET has adopted OpenTracing.

@codefromthecrypt
Copy link

codefromthecrypt commented Sep 21, 2017 via email

@cwe1ss
Copy link
Contributor

cwe1ss commented Sep 21, 2017

So does this hook on to the current activity?

This proposal does not, no.

It is great to see that .NET has adopted OpenTracing.

It only has so partially. It e.g doesn't have support for inject/extract, it doesn't have a 'tracer' component (The closest is its integration with DiagnosticListener).

@aliostad
Copy link

@adriancole yes, that is great to see we finally started to have good things.

@alhardy
Copy link

alhardy commented Sep 23, 2017

Agree with conflicting points here, and it's been an awesome insight reading through this.

For a top-notch, world class performing web framework, agree on the points raised around no allocations and not forcing aggregations in-process.

I would like to mention that for some applications a simple metrics reporting approach is ideal, and having aggregated metrics available in-process opens up some interesting use cases. Things like statsd and telegraf are great, however if the cost of reporting and aggregating in-process is pretty much negligible for particular applications I don't see why one would bother with maintaining and configuring an aggregator, especially when the cost of this is much less than typical things an application has to do to be useful, HTTP/DB calls for example. Similarly re logging for example, some prefer to batch and push directly to elasticsearch, others prefer to use logstash or beats.

The data just flows towards the lower-level logging mechanisms and then there is one choke point where the data becomes consumable (ideally something like EventListener or DiagnosticListener).

That ^ makes a lot of sense IMHO. l don't however see why I would need an IMetricsLogger and IMetric exposed to define and report on my own metrics, unless all the reporters consuming Microsoft.Extensions.Logging already existed.

@Drawaes
Copy link

Drawaes commented Sep 23, 2017

@alhardy I think both your scenarios are doable with the proposal. In fact I think basically every ones are.

If there is a simple lightweight metrics collection, then its up to the "collector" how the metric behaves. If you need "low touch" you can have a collector that just forwards to statsd/telegraf etc.

If you don't want to go to those lengths there is scope for the community to make a solid inproc collector that does simple and fancy aggregation.

Finally if the metrics are available, then in the richer tracing/telemetry scenario you can use existing methods with say the Activity type, and just use the metrics to enrich those Activities with extra information.

I don't think the addition of a common way (if designed correctly and available for low down framework metrics, or at least those adaptable to it) precludes anyones wishes so far stated.

As for how the "internal framework" things can be reported easily. Maybe a list of all the instrumentation that is currently available could be made with community/cross MS team help. Then both internally and from the community we could attack making "adapters" for these events that you can wire up to the metrics concept. I could imagine something like

IMetrics metrics = loggerFactory.Metrics;
metrics.CollectGCEvents();
metrics.CollectSslStreamEvents();

Which could attach a low alloc EventSource => IMetric adapater

@analogrelay
Copy link
Contributor Author

This has been a really active thread and we've gotten a lot of feedback. I think the main thing we've discovered is that we just aren't really ready to dive deep into a rich abstraction for Metrics in 2.1. We still want to provide metrics within the ASP.NET Platform though (replacing the old ASP.NET 4.x Performance Counters) so we're going to be closing this item for now and work on providing our own platform metrics via the EventCounter type in .NET Standard. Metrics emitted by that type are available in-process via EventListener and we can build components off that to forward metrics to data stores.

When it comes time to revisit a broader metrics abstraction, we'll revisit all this feedback.

@cwe1ss
Copy link
Contributor

cwe1ss commented Oct 11, 2017

Awesome! Thank you for listening to our feedback!!

@markvincze
Copy link

Hey @anurse,

What's the current state of getting metrics about .NET Core internals with Event Counters?
I'd particularlay be interested in things like object allocation and garbage collection, and my end goal would be to publish those as a Prometheus metrics stream.
Is there anything like this already in a working state (on Linux)? (I tried to google, but didn't find much info.)

Thanks!

@davidfowl
Copy link
Member

What's the current state of getting metrics about .NET Core internals with Event Counters?

There are no event counters in .NET Core as yet. This is something we plan to tackle more broadly in 3.0.

I'd particularlay be interested in things like object allocation and garbage collection, and my end goal would be to publish those as a Prometheus metrics stream.

We're doing work in 2.2 (dotnet/coreclr#18649) to expose the runtime events via an EventListener so that you can write code in the application to forward to something like Prometheus or influxdb (or anything else).

Is there anything like this already in a working state (on Linux)? (I tried to google, but didn't find much info.)

Today there are a couple of things you might be able to use but the fidelity of information may not be the greastest ATM. @brianrob Can give more details here.

@brianrob
Copy link

With dotnet/coreclr#18649, we now have some very basic support for consuming runtime events via EventListener. You can find an example of how to use it here: https://github.com/dotnet/coreclr/blob/master/tests/src/tracing/runtimeeventsource/RuntimeEventSourceTest.cs. Note that at this time, the fidelity is definitely not perfect and there is still more work to be done before I would consider the feature ready for use.

That said, feel free to give it a try and provide feedback. It should work on all supported platforms.

@markvincze
Copy link

Hey @davidfowl @brianrob,
Thanks for the info! I'll look out for the 2.2 release then, and I might give the current implementation a try.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests