Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate moving sourcemapping to an enrich processor #3606

Closed
axw opened this issue Apr 2, 2020 · 12 comments
Closed

Investigate moving sourcemapping to an enrich processor #3606

axw opened this issue Apr 2, 2020 · 12 comments
Assignees
Labels

Comments

@axw
Copy link
Member

axw commented Apr 2, 2020

I would like us to investigate moving sourcemapping logic out of apm-server, and into an ingest node pipeline. This would enable us to fix #2724, and would likely also speed things up by doing everything in Elasticsearch, where the sourcemaps are stored.

In order to do this, I think we could use the Enrich processor to enrich ingested documents with the sourcemap, followed by a script processor which adjusts/enriches stacktrace fields, and finally removes the sourcemap field.

To use the Enrich processor, we would need to store a field which concatenates the properties we use to match them into one field: service name, service version, and file/URL path.

@axw
Copy link
Member Author

axw commented Dec 3, 2020

There are some ramifications to this change that I had not previously considered, as there are several things that depend on the stacktrace:

@axw
Copy link
Member Author

axw commented Dec 3, 2020

Regarding apm-server.rum.exclude_from_grouping and apm-server.rum.library_frames: Painless regex support is enabled by default from 7.10 onwards in a "limited" mode: elastic/elasticsearch#63029. I'm not yet sure if this will be good enough for our needs, but I think so. I tend to think that as long as we allow the pipeline to be user-modifiable (just in the ingest pipeline editor?), then it should be reasonable to implement in ingest node.

Error culprit identification should be straightforward.

@axw axw self-assigned this Dec 8, 2020
@axw
Copy link
Member Author

axw commented Dec 8, 2020

Regarding apm-server.rum.exclude_from_grouping and apm-server.rum.library_frames: Painless regex support is enabled by default from 7.10 onwards in a "limited" mode: elastic/elasticsearch#63029. I'm not yet sure if this will be good enough for our needs, but I think so. I tend to think that as long as we allow the pipeline to be user-modifiable (just in the ingest pipeline editor?), then it should be reasonable to implement in ingest node.

In Painless, regex support is limited to constant patterns. At the top of https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-regexes.html:

Regular expression constants are directly supported. To ensure fast performance, this is the only mechanism for creating patterns. Regular expressions are always constants and compiled efficiently a single time.

So, the only way we could match filenames against a configurable regular expression would be by recreating the pipeline each time the config changes. Alternatively, we could switch the config over to using wildcard patterns like we use in the agents, such as in sanitize_field_names. Then we could inject the config into events, pick that up in the pipeline, and apply them with wildcard-matching logic written in Painless.

I'm leaning towards the latter at the moment. It'll mean a more complicated pipeline script, but it will provide more consistency in configuration across APM.

@jalvz jalvz mentioned this issue Dec 8, 2020
15 tasks
@axw
Copy link
Member Author

axw commented Dec 9, 2020

Apart from error grouping key calculation (elastic/apm-data#146), I've got everything (I think?) working in master...axw:sourcemap-enrich. It's a little bit messy, but should demonstrate how things would work.

There's a substantial amount of Painless. This includes: sourcemapping, identifying library frames (using wildcard matching, see #3606 (comment)), and identifying the error culprit.

@axw
Copy link
Member Author

axw commented Mar 10, 2021

I've created a new branch rebased on master, moving all the ingest node stuff into the pipeline we install: master...axw:sourcemap-enrich-take2

In that branch the pipeline is defined in ingest/pipeline/definition.yml, and definition.json is generated from that. The reason for using YAML is to make it enable comments and multiline strings, both of which are important for these painless scripts.

The pipeline uses the fingerprint processor to compute error.grouping_key. The good news is that it works, but the bad news is that it can't work in the same way as the existing grouping key calculation without changes to the fingerprint processor:

  • there's always a delimiter byte (0) written to the digest, which of course changes the hash
  • the hash is base64-encoded, whereas we have been hex-encoding

The other issue that we'll have is that since the fingerprint processor was only added in 7.12, we can't just introduce it to the pipeline as that would break compatibility with older versions of Elasticsearch. So if we are doing this, I think we can only add it in the integration package.

Seeing as the hashes will change, we may as well:

@axw
Copy link
Member Author

axw commented Mar 16, 2021

I'm running some performance tests. Sending 1000x RUM errors and comparing methods (server vs. ingest).

  • In the server approach I have made some minor improvements to stop recording an error when a matching sourcemap can't be found (RUM: sourcemap.error on and sourcemap.updated are set overzealously #4958), meaning the performance will be slightly better than what is currently released.
  • In the ingest approach I have disabled source mapping in the server, and rely on ingest entirely.
  • In either case I've effectively disabled rate limiting, and increased the bulk max size to 2000.
  • Whichever way I run it, I can reliably index 2000+ errors/s; there's no externally discernible difference at this load. Increasing the load by 10x also led to similar relative ingestion rates.

I ran the test three times each and checked node stats each time, looking at the time spent in the "apm" ingest pipeline. Over each three runs, the pipeline averages ~0.1ms per event for in-server source mapping, and ~1ms per event for ingest source mapping. I also instrumented the time spent in the server in applying sourcemaps, and it works out to ~0.1ms per error each with 6 stack frames = 60000 frames.

So the ingest approach is considerably slower with worst case 10x slowdown for ingestion.

Now we need to answer:

  1. Is it worth continuing with the ingest approach?
  2. What are our alternatives?

For (1): I'm tending towards a no. The performance loss may not be apparent with a single APM Server given its current ingestion rate (in)capability, but with a cluster APM Servers handling heavy RUM traffic we could end up bottlenecked on ingest node. On top of all that, moving the process to ingest node carries some risk, and requires breaking changes.

For (2): knowing what we know now about Fleet hooks, we could do something like as follows

  1. Introduce config into APM Server for specifying source maps, along the lines of:
apm-server:
  rum:
    source_maps:
      - service.name: opbeans-rum
        service.version: 1.2.3
        bundle.filepath: /test/e2e/general-usecase/bundle.js.map
        sourcemap.url: http://somewhere.com/bundle.js.map

2. Add a sourcemap upload endpoint to Kibana which store source maps in an index similar to today, but with no Enrichment index. On upload, and on Fleet policy creation/update, Kibana will inject apm-server.rum.source_maps entries with the URL pointing back at Kibana.

3. Add a sourcemap download endpoint to Kibana which fetches source maps from the index.

I'm essentially suggesting something like Endpoint's artifacts: https://github.com/elastic/kibana/tree/master/x-pack/plugins/security_solution/server/endpoint/routes/artifacts.

Assuming we want to make the sourcemap download API require authorisation, we could do this by poking the Fleet API access key into APM Server, similar to Endpoint: https://github.com/elastic/beats/blob/b03a935282bec78b3160e74e66dfe29e2df8ee69/x-pack/elastic-agent/spec/endpoint.yml#L51-L53

  1. Add a sourcemap upload endpoint to Kibana which stores source maps using the Fleet Artifacts API. Source maps will be stored in .fleet-artifacts. On upload and on Fleet policy creation/update, Kibana will inject a reference to the artifact into APM policies.

@simitt
Copy link
Contributor

simitt commented Mar 16, 2021

This performance difference comes a bit unexpected; great that you measured it. The alternative with injecting a reference to the artifact sounds like a good approach.

@axw
Copy link
Member Author

axw commented Mar 16, 2021

One thing to note: if we don't go with ingest node, then we don't get to fix #2724. We shouldn't kill performance in pursuit of that goal though.

@axw
Copy link
Member Author

axw commented Mar 18, 2021

https://github.com/axw/kibana/tree/apm-sourcemap-routes is a hackish POC which adds source map upload, list, and delete routes to Kibana, storing sourcemaps as fleet artifacts. On upload/delete, references to the artifacts are injected into APM policies.

@axw
Copy link
Member Author

axw commented Mar 22, 2021

@vigneshshanmugam is going to put together some thoughts on how we can improve our source maps experience overall, so I'll wait until that's available before we make a final call and open implementation issues. At this stage it looks likely that we will move ahead with the artifacts approach in favour of ingest node.

@axw
Copy link
Member Author

axw commented Mar 25, 2021

I'm going to open implementation issues now covering the creation of a new Kibana endpoint with minimal differences compared to the existing APM Server source map upload endpoint, to simplify migration. We can always introduce another, simpler, endpoint later on.

@axw
Copy link
Member Author

axw commented Mar 25, 2021

Closing this in favour of #5002 and elastic/kibana#95393

@vigneshshanmugam when you have time to write up your proposal, please share we me and we can create new issues

@axw axw closed this as completed Mar 25, 2021
@axw axw removed the [zube]: Done label Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants