Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix adblock memory leak #877

Merged
merged 4 commits into from
Feb 8, 2024

Conversation

Joe-Degler
Copy link
Contributor

@Joe-Degler Joe-Degler commented Feb 6, 2024

Fixed memory leaks caused by request resolve failure (e.G. due to adblocker). Thanks to community member @Joe-Degler for contributing this fix.

Overview

On our website ( playbook.com ), some users were noticing severe performance degredation when uploading big files. Memory usage grew, and cancelling of the upload did not resolve the issue, eventually causing in a crash of the upload or the website itself.

After investigation, we were able to pin down the issue to the New Relic Browser Agent. However, the issue only occured when adblock (in my case: uBlock Origin) was enabled.

We noticed that two URLs specifically were being blocked: https://js-agent.newrelic.com/nr-spa-1.245.0.min.js and https://bam.nr-data.net/.

(Note: Issue occured also with newest version of Browser Agent)

The issue itself is not actually related to uploading - this only made the symptoms much more obvious, due to the size of chunked ArrayBuffers. Even when using the website normally, a slow upwards creep of memory heap size happens.

The main problem is the buffer / backlog of the ee (event emitter?) holding on to every raw event, but in case of failure to load additional assets or submitting them it's never getting flushed. This results in the backlog growing indefinitely, slowly consuming the users memory. Upload makes this particularly noticable because the buffer (in this case, I think it was spa specifically) holds on to every ArrayBuffer reference.

We were able to work around the issue with some monkeypatching - it would be great to get these issues fixed upstream, instead.

We found 3 specific places that resulted in this problem:


Problem #1: send() method of harvest.js is not using the appropriate event listener for failing requests

Link to Change

The harvester attempts to submit data to https://bam.nr-data.net/.

The load event is not fired if the XHR is aborted client-side. This causes the callback to never be fired - in turn, never calling cbCallback (even with a 400+), which in turn again, never causes the abort here to be called.

Replacing this event with the loadend event seems appropriate. It fires even on an error, but returns an "failed" XHR object with status code 0.

I've adjusted the sent boolean to return false if the request has not been sent (eg, didn't leave the client). I'm not 100% sure on the implications here, and this change is irrelevant for the fix itself.

With this change, the callback is now successfully invoked for any successful, failed, or errored request.


Problem #2: The abort() function of the contextual-ee.js replaces the backlog object

Link to change

When abort() is called, it replaces globalInstance.backlog with a fresh new {} object. However, all "sub-features" (or emitters, I guess?) hold a direct reference to the original ee.backlog object. As they reference this object directly, two things happen:

  • Even when the main emitter is aborted, in some circumstances force was true, causing the emitter to continue trying to emit
  • Because the sub-emitter is using a direct reference to backlog (which has been stored on initialization), it accesses the "old" (already discarded by the main emitter) object indefinitely. On this object, the feature / group still exists, so it passes the if (bufferGroup) check and continues pushing events to the buffer. This also causes it to never be garbage collected, as references still exist, obviously.

Instead of replacing the main emitter with an empty object, instead iterate over its keys and delete all the existing groups. This clears the buffer, as expected.


Problem #3: Failure to import additional code does not drain buffers and does not unset handlers

Link to change

There's two distinct places I could find that try to initiate an additional import of code from https://js-agent.newrelic.com/ to import additional features. In both of these, if they fail, the corresponding buffers are not drained adequately, and keep emitting to the buffer.
The first case is in api.js, which, if it fails, only raises a warning that Downloading runtime APIs failed. No effort is made in draining and de-registering the api buffer/feature.

The second case is in instrument-base.js, in which it does start a drain in the exception handler for each drain that failed, but the drain is immediately short-circuiting because the feature is already registered in the registry for the agent.

I went with a "crude" fix here, still aligning with what I think is the spirit of the architecture, in the form of an force parameter, which forces the method to drain the group immediately.


All three of these fixes were necessary to banish the memory leak and make things work properly. They're separated neatly in their own commits.

Related Issue(s)

I found this unresolved issue from a year ago.

Testing

Replicating this is suprisingly simple. Install uBlock origin, or, alternatively, block both urls locally in your environments.
If you're testing locally, uBlock Origin will only block the egress traffic to https://bam.nr-data.net/, but allow loading more data through localhost:63342/build/nr-spa.min.js. Depending on what you're testing, you may have to block the import URL or not.

Create a simple index.html file similar to the github instructions, importing nr-loader-spa.min.js.

Evaluate newrelic.ee.backlog - you'll observe that it's filled with data. Depending on the activities you do on the website, you'll also see that the backlog keeps increasing.

If you've only blocked the data egress sink:

When applying the commit that fixes the harvest callback:
you'll notice that newrelic.ee.aborted == true, and that newrelic.ee.backlog == {}, but if you check out newrelic.initializedAgents[<foobar>].features[<barfoo>].ee.backlog, you'll see it's still referencing the old buffer.

Additionally, applying the commit that fixes abort:
You'll notice that now, the ee of the features of the initializedAgents are also empty, as the original newrelic.ee.backlog now got emptied, as opposed to replaced with an empty object.

If you've also blocked importing additional code from nr-spa.min.js:
Load the page, observe how newrelic.ee.backlog is filled with data. that may keep growing and isn't flushed.

Apply the commit that fixes the drain on import error.

Observe how newrelic.ee.backlog is now empty, and does not grow anymore.

@CLAassistant
Copy link

CLAassistant commented Feb 6, 2024

CLA assistant check
All committers have signed the CLA.

@patrickhousley
Copy link
Contributor

Thank you so much @Joe-Degler I am taking a look at this now.

@patrickhousley patrickhousley changed the title Fix adblock memory leak fix: Fix adblock memory leak Feb 7, 2024
Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (0a3a980) 78.67% compared to head (de5ed00) 78.65%.
Report is 3 commits behind head on main.

Files Patch % Lines
src/loaders/api/api.js 0.00% 3 Missing ⚠️
src/features/page_view_event/aggregate/index.js 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #877      +/-   ##
==========================================
- Coverage   78.67%   78.65%   -0.03%     
==========================================
  Files         145      145              
  Lines        6519     6522       +3     
  Branches     1267     1267              
==========================================
+ Hits         5129     5130       +1     
- Misses       1181     1183       +2     
  Partials      209      209              
Flag Coverage Δ
unit-tests 55.48% <63.63%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@patrickhousley
Copy link
Contributor

All unit and jil tests for all browsers are passing locally. One test failed in wdio but I think it's just flakiness.

[Safari iOS #16-28] Running: Safari on iOS
[Safari iOS #16-28] Session ID: 303df308a0a74ba59381d108bd8d0a83
[Safari iOS #16-28]
[Safari iOS #16-28] » /tests/specs/session-replay/ingest.e2e.js
[Safari iOS #16-28] Session Replay Ingest Behavior
[Safari iOS #16-28]    ✖ Should empty event buffer when sending (3 retries)
[Safari iOS #16-28]    ✓ Should stop recording if 429 response (1 retries)
[Safari iOS #16-28]
[Safari iOS #16-28] 1 passing (3m 22.9s)
[Safari iOS #16-28] 1 failing
[Safari iOS #16-28]
[Safari iOS #16-28] 1) Session Replay Ingest Behavior Should empty event buffer when sending
[Safari iOS #16-28] expect(received).toBeGreaterThan(expected)

Expected: > 0
Received:   0
[Safari iOS #16-28] Error: expect(received).toBeGreaterThan(expected)
[Safari iOS #16-28]
[Safari iOS #16-28] Expected: > 0
[Safari iOS #16-28] Received:   0
[Safari iOS #16-28]     at Context.<anonymous> (/Users/phousley/code/newrelic-browser-agent/tests/specs/session-replay/ingest.e2e.js:18:43)
[Safari iOS #16-28]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[Safari iOS #16-28]

Spec Files:	 1075 passed, 3 retries, 1 failed, 24 skipped, 1100 total (100% completed) in 01:01:47

I will run through the manual tests and work on building out e2e tests for the three specific issues tomorrow.

@Joe-Degler
Copy link
Contributor Author

No Problem! Do you need me to do anything else?

@cwli24 cwli24 merged commit 695415b into newrelic:main Feb 8, 2024
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants