Multiple service names per process and process-global metrics #65

felixbarny · 2019-03-13T10:57:00Z

When implementing support for multiple service_names (elastic/apm-server#1175) per JVM for the Java agent (elastic/apm-agent-java#514), I have noticed a problem in regards to metrics.

Background: a Server in Java can host multiple applications in a single JVM/process. This might be very Java specific. However, legacy .NET classic applications offer a similar concept with AppDomains. Currently, we don't support .NET Classic (only .NET Core).

The problem is that metrics, like process CPU utilization, are not specific to a single service within a JVM but are global for the JVM. How can we ensure that the metrics can be associated and correlated with all of the services it belongs to? To make matters worse, we will, in the future, likely have metrics which do belong to a specific service. For example response time metrics.

To be more specific about what functionality would be missing: say someone has deployed service foo and bar together in a JBoss application server on host A and host B. Now, they want to get the average process CPU utilisation of the foo service. Naturally, they apply a query bar filter like service.name: foo. The query yields no results because the metrics neither have the service.name set to foo, not to bar. Instead, it would be the detected default service name, in this case jboss-application. They however could find out the PID of all of the foo services, apply a filter for all hosts where foo is running on and filter on all the pids. Or, they could filter by the name jboss-application. However, that would also yield results for totally unrelated JVMs which also happen to be JBoss applications.

A couple of approaches come to mind:

Use a default service_name for the metrics, for example jboss-application (status quo)
- Pros
  - easy to implement
- Cons
  - we can't easily correlate the process global metrics with service-specific metrics, as the service.names don't match.
Use the PID or add an agent_id (status quo)
- Pros
  - process global metrics can be associated with service-specific metrics traces and spans.
- Cons
  - While we can correlate, it is very unintuitive. In the upcoming metrics UI, users can filter by the service name, for example, and expect to see all metrics for that specific service. But actually, they would have to apply an host and PID or agent ID filter. That also makes it impossible to aggregate metrics across multiple instances of a service.
Send the global metrics multiple times - once for each unique service.name within a process
- Pros
  - We can associate and correlate the global process metrics with service-specific metrics and traces
  - Filtering by the service.name works and behaves intuitively
- Cons
  - We send up more data than needed/duplicate data
  - Changes in Intake API required: we need the possibility to override the service_name per metricset
Enable overriding the service_name on a per-metricset-basis and allow the service.name to be an array.
Process global metrics would then have all the service.names which are active for the current process. If a metricset is service specific, it overrides the the service.name and sets a single value.
- Pros
  - No duplication of data
  - We can associate and correlate the global process metrics with service-specific metrics and traces
  - Filtering by the service.name works and behaves intuitively
- Cons
  - Changes in Intake API required: we need the possibility to override the service name per metricset. The service name can have multiple values.
  - Increased complexity

I would personally vote for the last option as it seems to resemble reality most closely. The question is if we should block elastic/apm-agent-java#514 until we have support for that in the intake API or if we should go ahead with a temporary solution which does not allow for an intuitively working service.name filter.

@elastic/apm-agent-devs Does the runtime on your respective agents support multiple applications within a single process? Do you think having the ability to override the service.name with possibly a list of service names is feasible?

@alvarolobato @roncohen @eyalkoren do you think we have to consider this issue a blocker for elastic/apm-agent-java#514?

@simitt Do you think it would be feasible to make it possible to override the service.name on a per-metricset basis? What about changing the type of service.name from string to string[]?

The text was updated successfully, but these errors were encountered:

simitt · 2019-03-13T17:04:10Z

pulling in @elastic/apm-server

axw · 2019-03-14T02:35:10Z

I would personally vote for the last option as it seems to resemble reality most closely.

I also think the last option is best. One other option would be to introduce some kind of "application server" metadata, but I think that would probably just make things more complicated.

Does the runtime on your respective agents support multiple applications within a single process? Do you think having the ability to override the service.name with possibly a list of service names is feasible?

Theoretically you could, but it would be unusual to do that in Go.

I'm considering adding the ability to override the service name in the Go agent, in order to implement service mesh adapters. For that, there isn't likely to be any common metrics, as the agent would not even be running on the same machine as the services.

jalvz · 2019-03-14T09:57:59Z

Personally I think options 3 and 4 are best, but prefer 3.

Changing the type of service.name is breaking if I understand correctly, and is going to be unintuitive for all the cases in which is not needed, which are majority:
eg, code-wise we need to treat the case where a transaction has 2 services names because the model allows it even if it doesn't make sense, what are we supposed to do there? etc

The main drawback of option 3 (sending dup data) is valid, but given that it will affect a minority of agents and a probably minority of deployments (several services on the same JVM), I think is acceptable for the benefits.

felixbarny · 2019-03-14T11:38:00Z

Changing the type of service.name is breaking if I understand correctly

Not necessarily. The JSON schema could allow strings and string arrays.

watson · 2019-03-14T13:01:46Z

Technically you could have the same use-case in Node.js, but I've never seen this in the wild, and would kind of consider it an anti-pattern.

Regarding the best way to solve this, for simplicity's sake, I'd prefer if our solution would end up duplicating data in Elasticsearch. I don't care if the agent sends it up twice or if the APM Server duplicates the data before storing it, but if the end result is two identical metric-sets - one for each service - then we keep the querying as simple as possible (as it doesn't change compared to how it is today).

One downside of the duplicated data is that if a user ever creates a dashboard showing the total system CPU aggregated across all services on that system, they will get the wrong result. But I'm not sure if that's a use-case we want to support.

felixbarny · 2019-03-19T16:51:24Z

but given that it will affect a minority of agents and a probably minority of deployments (several services on the same JVM), I think is acceptable for the benefits.

I'm not sure I agree. It's a perfectly valid use-case, especially in enterprise Java, to even have dozens of applications deployed in the same application server.

@eyalkoren had an idea for a temporary workaround which would not require server changes: the agent would send the same metrics in multiple requests to the APM Server, each with a different metadata event (containing a different service.name). That would, however, increase the number of concurrent TCP connections from the agent to the server from 1 to 2. @elastic/apm-server is that feasible from your perspective?

I still think allowing for multiple service names would be the cleanest solution ultimately.

code-wise we need to treat the case where a transaction has 2 services names because the model allows it even if it doesn't make sense, what are we supposed to do there?

Couldn't you just forward the service names to ES as they are sent by the agent?

graphaelli · 2019-03-19T17:36:59Z

That would, however, increase the number of concurrent TCP connections from the agent to the server from 1 to 2.

Would it actually be exactly 2 or just some N > 1? Either way, connection count isn't typically a problem for apm-server given ratios of app-server:apm-server in typical deployments.

I still think allowing for multiple service names would be the cleanest solution ultimately.

For intake and persistence that is certainly feasible and backwards compatible, and would be best for storage usage, however I am not sure if the UI would need some work to handle multiple service names (making this backwards incompatible) - perhaps @elastic/apm-ui could confirm. After running this query to add another service name to existing metrics for the opbeans-go the service showed up in the service list and the metrics there seemed right:

POST apm-7.0.0-*-metric-*/_update_by_query
{
  "script": {
    "source": "ctx._source.service.name = ['opbeans-go', 'another-go-app']",
    "lang": "painless"
  },
  "query": {
    "term": {
      "service.name": "opbeans-go"
    }
  }
}

Of course, a lot more validation would be needed.

roncohen · 2019-03-20T12:55:30Z

@felixbarny if we have people actively requesting the possibility to use multiple service names, i think it would be fine to release the PR you linked. Having that in will unblock all those folks. They might not be able to use metrics fully yet, but at least all the regular APM stuff will become available to them. We can fix metrics (whatever that ends up meaning) in a following release

sorenlouv · 2019-03-20T13:19:11Z

I am not sure if the UI would need some work to handle multiple service names (making this backwards incompatible) - perhaps @elastic/apm-ui could confirm

I expected this to cause issues with how we query, but after doing some preliminary tests, it appears to work seamlessly.

Indexing a few docs:

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA", "serviceNameB"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA", "serviceNameC"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameD", "serviceNameC"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA"]}

POST sqren-arrays/_doc
{"service_name" : "serviceNameA"}

Querying by service name returns the two first docs

GET sqren-arrays/_search 
{
  "query": {
    "term": {
      "service_name.keyword": "serviceNameA"
    }
  }
}

So afaict this doesn't require any changes to the UI side and is therefore backwards compatible from ui POV.

sorenlouv · 2019-03-20T13:25:50Z

Oh, one thing that could be problematic: the returned docs differ depending on if they were indexed as strings[] or strings:

{
  "_source": {
    "service_name": ["serviceNameA"]
  }
}

vs.

{
  "_source": {
    "service_name": "serviceNameA"
  }
}

I'll have to double check how we handle that. That could require changes to the UI side, and thereby be a breaking.

Is this only relevant for metric docs, or will it also affect transaction docs? If so it will be problematic.

felixbarny · 2019-03-20T14:27:31Z

Is this only relevant for metric docs, or will it also affect transaction docs? If so it will be problematic.

In practice, it will only be relevant for metrics but in theory, there could be a transaction with multiple service names (although that would not make sense). Maybe worth checking though how much it would screw up the UI. If it would display ["serviceNameA"] (stringified JSON) instead of serviceNameA, I would count that as a success.

graphaelli · 2019-03-20T14:55:51Z

Limiting this capability to just metrics is not a problem on the apm-server intake - metric currently doesn't even allow overriding metadata.

Have you given thought to what the intake would look like if we were to proceed with multiple service support for just metrics? Is it just name that needs to be a list? Would it be difficult to avoid using context like in other events / is that confusing for library authors?

if a current metricset payload looks like:

{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "timestamp": 1496170422281000
  }
}

does this look reasonable for multiple services:

{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "service": {
      "name": [
        "service1",
        "service2"
      ]
    },
    "timestamp": 1496170422281000
  }
}

felixbarny · 2019-03-20T15:34:33Z

That would work for me. Note that this is a bit inconsistent with how it's done for transactions and spans (see also elastic/apm-server#1175 (comment)).

Also, the advantage of adding support for multiple service names in service would be that I could send all the service names of the process in the metadata event and only override on metrics which are service-specific (we don't have any of those yet).

Example Intake API v2 body:

{"metadata":{ "service": {"name": ["service1", "service2"], ...}, ...}}\n
// process-global metrics inherit the service names from metadata
{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "timestamp": 1496170422281000
  }
}\n
// service-specific metrics override the service name
{
  "metricset": {
    "samples": {
      "transaction_duration": {
        "avg": 6.520832e+06
      }
    },
    "labels": {
      "transaction_name": "SearchController#doSearch"
    },
    "service": { "name": "service1" },
    "timestamp": 1496170422281000
  }
}

Obviously, transactions and spans would also have to explicitly override the service.name if the process yields multiple services but they would have to do that anyway.

Having to manually add all service names to all process-global metrics would be a bit cumbersome and would increase the request size a bit but it's something I could live with if it can't be avoided.

sorenlouv · 2019-03-20T16:32:20Z

Maybe worth checking though how much it would screw up the UI. If it would display ["serviceNameA"] (stringified JSON) instead of serviceNameA, I would count that as a success.

For transactions this would cause problems like incorrect urls: /serviceA,serviceB/transactions/request.

If we can limit this to metric docs, it should not affect the UI in any way.

felixbarny · 2019-03-22T16:08:22Z

Where is this link generated? In discover?
I can make sure in the agent that there can never be a transaction with multiple service names.

No matter which approach we are taking, we should avoid that agents have to check the version of the APM server in order to send multiple service names. The beauty of @graphaelli's suggestion is that it does not conflict with anything because it's currently not possible to override any metadata for metrics. Also, we can ensure at the API level that transactions can't have multiple service names.

If we go with allowing a list of services on the metadata event, I propose leaving service.name as is and introduce service.names. Agents can then send meta data like this:
{"metadata":{ "service": {"name": "service1", "names": ["service1", "service2"], ...}, ...}}. Older APM Server versions would just ignore names and newer versions would ignore name if names is present.

ruflin · 2019-03-25T12:05:29Z

Note from an ECS perspective: We had quite a few discussion on naming files plural and singular and sticked to keep (almost) everything singular as any field could be a single value or an array. As mentioned before, from an Elasticsearch perspective there is not really a difference. My preference is to have it all in service.name. @felixbarny If I understand you correct, this would also be the case in your last comment, but the apm-server would move the field from names to name to have the same output to Elasticsearch?

felixbarny · 2019-03-25T12:20:29Z

@felixbarny If I understand you correct, this would also be the case in your last comment, but the apm-server would move the field from names to name to have the same output to Elasticsearch?

Correct. names would just be in the Intake API

sorenlouv · 2019-03-25T12:49:05Z

Where is this link generated? In discover?

Various places within APM ui:

It's also used when linking to Discover:

https://github.com/elastic/kibana/blob/657552d88853f750666a28f1945f21ce0acd33d4/x-pack/plugins/apm/public/components/shared/Links/DiscoverLinks/DiscoverErrorLink.tsx#L18

simitt · 2019-04-09T14:43:42Z

I'd like to avoid implicit rules on the json spec level as much as possible. Introducing metadata.service.names alongside metadata.service.name and allowing to override it on metric event level, as well as enforcing to override on transaction/error event level if multiple metadata.service.names are set, seems pretty error prone to me.

If multiple service.names can only be set for metrics then let's define this for metrics only and not on the metadata level, as it is event type specific. So basically +1 on @graphaelli s suggestion #65 (comment).

tobiasstadler · 2021-09-13T15:07:37Z

Is anyone working on this?

tobiasstadler · 2021-10-28T07:05:46Z

Should it also be possible to ingest multiple service versions along multiple service names?

felixbarny · 2021-10-28T10:23:39Z

Good question. It may have impact on the deployment annotations that are created. That code needs to be aware of how to associate the service.name array with the service.version array. Also, correlating JVM metrics across different versions of a given service may be more difficult when service.name and service.version both become an array.

As of elastic/apm-server#6407, we'd also have the option of just copying the process-global metrics and send it multiple times, overriding the service.name and service.version. It would be less space efficient but it would not have any dependency on the UI or the server. That's basically the third approach listed in the original description. It's an option @eyalkoren has recently advocated for.

Still unsure about the tradeoffs but multiple service names per service are rather the exception than the norm. It's not a thing in most other languages and the trend is to go towards more lightweight alternatives to app servers in Java and .NET (IIS also supports serving multiple apps). In that light, it may not be worth adding more complexity to the UI and Server for the marginal storage savings when looking at the overall picture.

tobiasstadler · 2021-10-29T06:37:36Z

Does service.name being a list work with elastic/elasticsearch#74660? I suppose service.name will be a dimension then. Most probably the metric values will be copied (when ingested by elasticsearch) for each service.name value, because each service.name value will be part of different time series? So it might not be that much of a problem for the agents to copying and send the metrics multiple times (once for each service.name in a process) anyway?

felixbarny · 2021-10-29T07:37:42Z

Good question, not sure if metric dimensions will support array values.

tobiasstadler · 2021-10-29T07:46:49Z

If even if arrays will be supported it might not be a good idea to use them. E.g. if on an application server an application is undeployed the metricset reported before the undeploy will be a different time series than the metricset after the undeploy (as I understand it) which might make aggregations based on the service.name more costly. Another example is the start of an application server when multiple service will be deployed one after the other.

eyalkoren · 2021-11-02T11:22:13Z

Documenting our decision on how we want to proceed on this one: we will go with duplicating metricsets for each service.name.

With @tobiasstadler's latest contribution to APM Server, each agent can now go ahead and implement that without any dependency.

felixbarny added the discussion label Mar 13, 2019

felixbarny mentioned this issue Mar 20, 2019

Use display-name or context path as the default service name elastic/apm-agent-java#514

Merged

4 tasks

sorenlouv mentioned this issue Sep 13, 2021

[APM] Show JVMs based on agent elastic/kibana#110042

Closed

felixbarny mentioned this issue Oct 7, 2021

[APM] Use the agent ids to display service metrics elastic/kibana#113605

Closed

9 tasks

AlexanderWert mentioned this issue Oct 19, 2021

Allow to overwrite service.name for metricset objects elastic/apm-server#6388

Closed

felixbarny added the 8.1-candidate label Oct 27, 2021

eyalkoren closed this as completed Nov 2, 2021

eyalkoren mentioned this issue Nov 2, 2021

Better support for metrics in multi-services environments elastic/apm-agent-java#2230

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple service names per process and process-global metrics #65

Multiple service names per process and process-global metrics #65

felixbarny commented Mar 13, 2019

simitt commented Mar 13, 2019

axw commented Mar 14, 2019

jalvz commented Mar 14, 2019

felixbarny commented Mar 14, 2019

watson commented Mar 14, 2019

felixbarny commented Mar 19, 2019

graphaelli commented Mar 19, 2019 •

edited

Loading

roncohen commented Mar 20, 2019

sorenlouv commented Mar 20, 2019 •

edited

Loading

sorenlouv commented Mar 20, 2019 •

edited

Loading

felixbarny commented Mar 20, 2019

graphaelli commented Mar 20, 2019

felixbarny commented Mar 20, 2019

sorenlouv commented Mar 20, 2019 •

edited

Loading

felixbarny commented Mar 22, 2019

ruflin commented Mar 25, 2019

felixbarny commented Mar 25, 2019

sorenlouv commented Mar 25, 2019 •

edited

Loading

simitt commented Apr 9, 2019

tobiasstadler commented Sep 13, 2021

tobiasstadler commented Oct 28, 2021

felixbarny commented Oct 28, 2021

tobiasstadler commented Oct 29, 2021

felixbarny commented Oct 29, 2021

tobiasstadler commented Oct 29, 2021

eyalkoren commented Nov 2, 2021

Multiple service names per process and process-global metrics #65

Multiple service names per process and process-global metrics #65

Comments

felixbarny commented Mar 13, 2019

simitt commented Mar 13, 2019

axw commented Mar 14, 2019

jalvz commented Mar 14, 2019

felixbarny commented Mar 14, 2019

watson commented Mar 14, 2019

felixbarny commented Mar 19, 2019

graphaelli commented Mar 19, 2019 • edited Loading

roncohen commented Mar 20, 2019

sorenlouv commented Mar 20, 2019 • edited Loading

sorenlouv commented Mar 20, 2019 • edited Loading

felixbarny commented Mar 20, 2019

graphaelli commented Mar 20, 2019

felixbarny commented Mar 20, 2019

sorenlouv commented Mar 20, 2019 • edited Loading

felixbarny commented Mar 22, 2019

ruflin commented Mar 25, 2019

felixbarny commented Mar 25, 2019

sorenlouv commented Mar 25, 2019 • edited Loading

simitt commented Apr 9, 2019

tobiasstadler commented Sep 13, 2021

tobiasstadler commented Oct 28, 2021

felixbarny commented Oct 28, 2021

tobiasstadler commented Oct 29, 2021

felixbarny commented Oct 29, 2021

tobiasstadler commented Oct 29, 2021

eyalkoren commented Nov 2, 2021

graphaelli commented Mar 19, 2019 •

edited

Loading

sorenlouv commented Mar 20, 2019 •

edited

Loading

sorenlouv commented Mar 20, 2019 •

edited

Loading

sorenlouv commented Mar 20, 2019 •

edited

Loading

sorenlouv commented Mar 25, 2019 •

edited

Loading