Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple service names per process and process-global metrics #65

Closed
felixbarny opened this issue Mar 13, 2019 · 26 comments
Closed

Multiple service names per process and process-global metrics #65

felixbarny opened this issue Mar 13, 2019 · 26 comments

Comments

@felixbarny
Copy link
Member

When implementing support for multiple service_names (elastic/apm-server#1175) per JVM for the Java agent (elastic/apm-agent-java#514), I have noticed a problem in regards to metrics.

Background: a Server in Java can host multiple applications in a single JVM/process. This might be very Java specific. However, legacy .NET classic applications offer a similar concept with AppDomains. Currently, we don't support .NET Classic (only .NET Core).

The problem is that metrics, like process CPU utilization, are not specific to a single service within a JVM but are global for the JVM. How can we ensure that the metrics can be associated and correlated with all of the services it belongs to? To make matters worse, we will, in the future, likely have metrics which do belong to a specific service. For example response time metrics.

To be more specific about what functionality would be missing: say someone has deployed service foo and bar together in a JBoss application server on host A and host B. Now, they want to get the average process CPU utilisation of the foo service. Naturally, they apply a query bar filter like service.name: foo. The query yields no results because the metrics neither have the service.name set to foo, not to bar. Instead, it would be the detected default service name, in this case jboss-application. They however could find out the PID of all of the foo services, apply a filter for all hosts where foo is running on and filter on all the pids. Or, they could filter by the name jboss-application. However, that would also yield results for totally unrelated JVMs which also happen to be JBoss applications.

A couple of approaches come to mind:

  • Use a default service_name for the metrics, for example jboss-application (status quo)
    • Pros
      • easy to implement
    • Cons
      • we can't easily correlate the process global metrics with service-specific metrics, as the service.names don't match.
  • Use the PID or add an agent_id (status quo)
    • Pros
      • process global metrics can be associated with service-specific metrics traces and spans.
    • Cons
      • While we can correlate, it is very unintuitive. In the upcoming metrics UI, users can filter by the service name, for example, and expect to see all metrics for that specific service. But actually, they would have to apply an host and PID or agent ID filter. That also makes it impossible to aggregate metrics across multiple instances of a service.
  • Send the global metrics multiple times - once for each unique service.name within a process
    • Pros
      • We can associate and correlate the global process metrics with service-specific metrics and traces
      • Filtering by the service.name works and behaves intuitively
    • Cons
      • We send up more data than needed/duplicate data
      • Changes in Intake API required: we need the possibility to override the service_name per metricset
  • Enable overriding the service_name on a per-metricset-basis and allow the service.name to be an array.
    Process global metrics would then have all the service.names which are active for the current process. If a metricset is service specific, it overrides the the service.name and sets a single value.
    • Pros
      • No duplication of data
      • We can associate and correlate the global process metrics with service-specific metrics and traces
      • Filtering by the service.name works and behaves intuitively
    • Cons
      • Changes in Intake API required: we need the possibility to override the service name per metricset. The service name can have multiple values.
      • Increased complexity

I would personally vote for the last option as it seems to resemble reality most closely. The question is if we should block elastic/apm-agent-java#514 until we have support for that in the intake API or if we should go ahead with a temporary solution which does not allow for an intuitively working service.name filter.

@elastic/apm-agent-devs Does the runtime on your respective agents support multiple applications within a single process? Do you think having the ability to override the service.name with possibly a list of service names is feasible?

@alvarolobato @roncohen @eyalkoren do you think we have to consider this issue a blocker for elastic/apm-agent-java#514?

@simitt Do you think it would be feasible to make it possible to override the service.name on a per-metricset basis? What about changing the type of service.name from string to string[]?

@simitt
Copy link
Contributor

simitt commented Mar 13, 2019

pulling in @elastic/apm-server

@axw
Copy link
Member

axw commented Mar 14, 2019

I would personally vote for the last option as it seems to resemble reality most closely.

I also think the last option is best. One other option would be to introduce some kind of "application server" metadata, but I think that would probably just make things more complicated.

Does the runtime on your respective agents support multiple applications within a single process? Do you think having the ability to override the service.name with possibly a list of service names is feasible?

Theoretically you could, but it would be unusual to do that in Go.

I'm considering adding the ability to override the service name in the Go agent, in order to implement service mesh adapters. For that, there isn't likely to be any common metrics, as the agent would not even be running on the same machine as the services.

@jalvz
Copy link
Contributor

jalvz commented Mar 14, 2019

Personally I think options 3 and 4 are best, but prefer 3.

Changing the type of service.name is breaking if I understand correctly, and is going to be unintuitive for all the cases in which is not needed, which are majority:
eg, code-wise we need to treat the case where a transaction has 2 services names because the model allows it even if it doesn't make sense, what are we supposed to do there? etc

The main drawback of option 3 (sending dup data) is valid, but given that it will affect a minority of agents and a probably minority of deployments (several services on the same JVM), I think is acceptable for the benefits.

@felixbarny
Copy link
Member Author

Changing the type of service.name is breaking if I understand correctly

Not necessarily. The JSON schema could allow strings and string arrays.

@watson
Copy link
Contributor

watson commented Mar 14, 2019

Technically you could have the same use-case in Node.js, but I've never seen this in the wild, and would kind of consider it an anti-pattern.

Regarding the best way to solve this, for simplicity's sake, I'd prefer if our solution would end up duplicating data in Elasticsearch. I don't care if the agent sends it up twice or if the APM Server duplicates the data before storing it, but if the end result is two identical metric-sets - one for each service - then we keep the querying as simple as possible (as it doesn't change compared to how it is today).

One downside of the duplicated data is that if a user ever creates a dashboard showing the total system CPU aggregated across all services on that system, they will get the wrong result. But I'm not sure if that's a use-case we want to support.

@felixbarny
Copy link
Member Author

but given that it will affect a minority of agents and a probably minority of deployments (several services on the same JVM), I think is acceptable for the benefits.

I'm not sure I agree. It's a perfectly valid use-case, especially in enterprise Java, to even have dozens of applications deployed in the same application server.

@eyalkoren had an idea for a temporary workaround which would not require server changes: the agent would send the same metrics in multiple requests to the APM Server, each with a different metadata event (containing a different service.name). That would, however, increase the number of concurrent TCP connections from the agent to the server from 1 to 2. @elastic/apm-server is that feasible from your perspective?

I still think allowing for multiple service names would be the cleanest solution ultimately.

code-wise we need to treat the case where a transaction has 2 services names because the model allows it even if it doesn't make sense, what are we supposed to do there?

Couldn't you just forward the service names to ES as they are sent by the agent?

@graphaelli
Copy link
Member

graphaelli commented Mar 19, 2019

That would, however, increase the number of concurrent TCP connections from the agent to the server from 1 to 2.

Would it actually be exactly 2 or just some N > 1? Either way, connection count isn't typically a problem for apm-server given ratios of app-server:apm-server in typical deployments.

I still think allowing for multiple service names would be the cleanest solution ultimately.

For intake and persistence that is certainly feasible and backwards compatible, and would be best for storage usage, however I am not sure if the UI would need some work to handle multiple service names (making this backwards incompatible) - perhaps @elastic/apm-ui could confirm. After running this query to add another service name to existing metrics for the opbeans-go the service showed up in the service list and the metrics there seemed right:

POST apm-7.0.0-*-metric-*/_update_by_query
{
  "script": {
    "source": "ctx._source.service.name = ['opbeans-go', 'another-go-app']",
    "lang": "painless"
  },
  "query": {
    "term": {
      "service.name": "opbeans-go"
    }
  }
}

Of course, a lot more validation would be needed.

@roncohen
Copy link

@felixbarny if we have people actively requesting the possibility to use multiple service names, i think it would be fine to release the PR you linked. Having that in will unblock all those folks. They might not be able to use metrics fully yet, but at least all the regular APM stuff will become available to them. We can fix metrics (whatever that ends up meaning) in a following release

@sorenlouv
Copy link
Member

sorenlouv commented Mar 20, 2019

I am not sure if the UI would need some work to handle multiple service names (making this backwards incompatible) - perhaps @elastic/apm-ui could confirm

I expected this to cause issues with how we query, but after doing some preliminary tests, it appears to work seamlessly.

Indexing a few docs:

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA", "serviceNameB"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA", "serviceNameC"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameD", "serviceNameC"]}

POST sqren-arrays/_doc
{"service_name" : ["serviceNameA"]}

POST sqren-arrays/_doc
{"service_name" : "serviceNameA"}

Querying by service name returns the two first docs

GET sqren-arrays/_search 
{
  "query": {
    "term": {
      "service_name.keyword": "serviceNameA"
    }
  }
}

So afaict this doesn't require any changes to the UI side and is therefore backwards compatible from ui POV.

@sorenlouv
Copy link
Member

sorenlouv commented Mar 20, 2019

Oh, one thing that could be problematic: the returned docs differ depending on if they were indexed as strings[] or strings:

{
  "_source": {
    "service_name": ["serviceNameA"]
  }
}

vs.

{
  "_source": {
    "service_name": "serviceNameA"
  }
}

I'll have to double check how we handle that. That could require changes to the UI side, and thereby be a breaking.

Is this only relevant for metric docs, or will it also affect transaction docs? If so it will be problematic.

@felixbarny
Copy link
Member Author

Is this only relevant for metric docs, or will it also affect transaction docs? If so it will be problematic.

In practice, it will only be relevant for metrics but in theory, there could be a transaction with multiple service names (although that would not make sense). Maybe worth checking though how much it would screw up the UI. If it would display ["serviceNameA"] (stringified JSON) instead of serviceNameA, I would count that as a success.

@graphaelli
Copy link
Member

Limiting this capability to just metrics is not a problem on the apm-server intake - metric currently doesn't even allow overriding metadata.

Have you given thought to what the intake would look like if we were to proceed with multiple service support for just metrics? Is it just name that needs to be a list? Would it be difficult to avoid using context like in other events / is that confusing for library authors?

if a current metricset payload looks like:

{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "timestamp": 1496170422281000
  }
}

does this look reasonable for multiple services:

{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "service": {
      "name": [
        "service1",
        "service2"
      ]
    },
    "timestamp": 1496170422281000
  }
}

@felixbarny
Copy link
Member Author

That would work for me. Note that this is a bit inconsistent with how it's done for transactions and spans (see also elastic/apm-server#1175 (comment)).

Also, the advantage of adding support for multiple service names in service would be that I could send all the service names of the process in the metadata event and only override on metrics which are service-specific (we don't have any of those yet).

Example Intake API v2 body:

{"metadata":{ "service": {"name": ["service1", "service2"], ...}, ...}}\n
// process-global metrics inherit the service names from metadata
{
  "metricset": {
    "samples": {
      "heap.sys.bytes": {
        "value": 6.520832e+06
      }
    },
    "timestamp": 1496170422281000
  }
}\n
// service-specific metrics override the service name
{
  "metricset": {
    "samples": {
      "transaction_duration": {
        "avg": 6.520832e+06
      }
    },
    "labels": {
      "transaction_name": "SearchController#doSearch"
    },
    "service": { "name": "service1" },
    "timestamp": 1496170422281000
  }
}

Obviously, transactions and spans would also have to explicitly override the service.name if the process yields multiple services but they would have to do that anyway.

Having to manually add all service names to all process-global metrics would be a bit cumbersome and would increase the request size a bit but it's something I could live with if it can't be avoided.

@sorenlouv
Copy link
Member

sorenlouv commented Mar 20, 2019

Maybe worth checking though how much it would screw up the UI. If it would display ["serviceNameA"] (stringified JSON) instead of serviceNameA, I would count that as a success.

For transactions this would cause problems like incorrect urls: /serviceA,serviceB/transactions/request.

If we can limit this to metric docs, it should not affect the UI in any way.

@felixbarny
Copy link
Member Author

Where is this link generated? In discover?
I can make sure in the agent that there can never be a transaction with multiple service names.

No matter which approach we are taking, we should avoid that agents have to check the version of the APM server in order to send multiple service names. The beauty of @graphaelli's suggestion is that it does not conflict with anything because it's currently not possible to override any metadata for metrics. Also, we can ensure at the API level that transactions can't have multiple service names.

If we go with allowing a list of services on the metadata event, I propose leaving service.name as is and introduce service.names. Agents can then send meta data like this:
{"metadata":{ "service": {"name": "service1", "names": ["service1", "service2"], ...}, ...}}. Older APM Server versions would just ignore names and newer versions would ignore name if names is present.

@ruflin
Copy link

ruflin commented Mar 25, 2019

Note from an ECS perspective: We had quite a few discussion on naming files plural and singular and sticked to keep (almost) everything singular as any field could be a single value or an array. As mentioned before, from an Elasticsearch perspective there is not really a difference. My preference is to have it all in service.name. @felixbarny If I understand you correct, this would also be the case in your last comment, but the apm-server would move the field from names to name to have the same output to Elasticsearch?

@felixbarny
Copy link
Member Author

@felixbarny If I understand you correct, this would also be the case in your last comment, but the apm-server would move the field from names to name to have the same output to Elasticsearch?

Correct. names would just be in the Intake API

@simitt
Copy link
Contributor

simitt commented Apr 9, 2019

I'd like to avoid implicit rules on the json spec level as much as possible. Introducing metadata.service.names alongside metadata.service.name and allowing to override it on metric event level, as well as enforcing to override on transaction/error event level if multiple metadata.service.names are set, seems pretty error prone to me.

If multiple service.names can only be set for metrics then let's define this for metrics only and not on the metadata level, as it is event type specific. So basically +1 on @graphaelli s suggestion #65 (comment).

@tobiasstadler
Copy link

Is anyone working on this?

@tobiasstadler
Copy link

Should it also be possible to ingest multiple service versions along multiple service names?

@felixbarny
Copy link
Member Author

Good question. It may have impact on the deployment annotations that are created. That code needs to be aware of how to associate the service.name array with the service.version array. Also, correlating JVM metrics across different versions of a given service may be more difficult when service.name and service.version both become an array.

As of elastic/apm-server#6407, we'd also have the option of just copying the process-global metrics and send it multiple times, overriding the service.name and service.version. It would be less space efficient but it would not have any dependency on the UI or the server. That's basically the third approach listed in the original description. It's an option @eyalkoren has recently advocated for.

Still unsure about the tradeoffs but multiple service names per service are rather the exception than the norm. It's not a thing in most other languages and the trend is to go towards more lightweight alternatives to app servers in Java and .NET (IIS also supports serving multiple apps). In that light, it may not be worth adding more complexity to the UI and Server for the marginal storage savings when looking at the overall picture.

@tobiasstadler
Copy link

Does service.name being a list work with elastic/elasticsearch#74660? I suppose service.name will be a dimension then. Most probably the metric values will be copied (when ingested by elasticsearch) for each service.name value, because each service.name value will be part of different time series? So it might not be that much of a problem for the agents to copying and send the metrics multiple times (once for each service.name in a process) anyway?

@felixbarny
Copy link
Member Author

Good question, not sure if metric dimensions will support array values.

@tobiasstadler
Copy link

If even if arrays will be supported it might not be a good idea to use them. E.g. if on an application server an application is undeployed the metricset reported before the undeploy will be a different time series than the metricset after the undeploy (as I understand it) which might make aggregations based on the service.name more costly. Another example is the start of an application server when multiple service will be deployed one after the other.

@eyalkoren
Copy link
Contributor

Documenting our decision on how we want to proceed on this one: we will go with duplicating metricsets for each service.name.

With @tobiasstadler's latest contribution to APM Server, each agent can now go ahead and implement that without any dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests