Race condition with Image Loader makes it unusable in a production environment #23436

kmdev1 · 2021-03-26T21:04:55Z

What version of Next.js are you using?

10.0.9

What version of Node.js are you using?

v12.19.0, v14.16.0

What browser are you using?

Any browser

What operating system are you using?

Any OS

How are you deploying your application?

AWS

Describe the Bug

The default image loader at present seems to suffer from a race condition. Unfortunately this issue has now meant that for us, the pretty awesome functionality, cannot be used at all.

Assuming the image cache does not contain the given image, when making a request the loader will go off and do it's thing and then return the given image and add it to its cache. The problem comes when you have real traffic.

Example scenario:

Lets assume you have a simple page that has 3 images that are utilising next/image. The page has an average traffic of 10 req/s for a given screen size. This page is brand new and the images have never been server before (ie the image's do not yet exist in the cache).

Currently when this above scenario happens, c.30 calls to the original image URL happen. Each one of these images then resolves, gets transformed and then placed into the cache and returned to the user. The important note here is that we now have a cache folder that contains 10 cached files of the same image three times.

When we replicate the above scenario but in more realistic terms (12 images on a new page with average traffic of 25 req/s) this means that we suddenly have a cache folder that contains 25 images x 12 (300 total). This is because when the first request comes in, the code says hey you're not in the cache, lets go get you. The second request comes in and says, hey you're not in the cache (because the first one is still processing), let's go get you and so on and so on.

So why does this cause the container to die? It doesn't; writing and reading from disk is not the only thing that the loader is doing. Http requests, FS, Image manipulation. These things are heavy on not only memory but CPU too.

In the above example we're asking a container to not only go and fetch 300 images per second, we're also manipulating, writing to file storage, serving back. The chances are then by the time these have finished, the next second comes around with another 300 requests, the chances are probably the images still aren't in the cache because the the CPU is rocketed and so has the memory so things just start to back up and get worse and worse until the container falls over completely.

TL;DR:

Requests to non cached images are not aware of each other which means there's a race condition when the same image is requested from the same user within the same time period of it not being in the cache. This then (with enough traffic) spirals into a horrible loop that is non escapable and leads to the containers falling over.

Expected Behavior

I'd expect that when parallel requests are made, that the image loader is directly only called once and is sat behind some sort of batching mechanism (similar to DataLoader) that removes duplicates and is aware of parallel requests and then maps the responses accordingly. This means that the site could require 500 images per second of the same image, but only do all of the hardware intensive stuff once.

To Reproduce

Create simple page with a single (or multiple) image(s) utilising next/image
Deploy/run the container so you can see memory/cpu usage
Start simultaneous request
Notice Memory/CPU usage climbing
Notice cache folder starting to drastically grow in size
If persisted, eventually hardware limitations will be met

goguda · 2021-03-28T16:45:57Z

I assume this is related to the same thing I'm experiencing with image optimization on anything higher than Next.js v10.0.7. I've tried both v10.0.8 and v10.0.9 but any images that are not in cache seem to cause a memory leak during conversion and memory usage rapidly climbs until my server just runs out of memory and crashes. Conversion also seems much slower overall than previously. Obviously this isn't suitable for production.

I've downgraded back to v10.0.7 for now and I can see memory usage go up by about 10-15MB during a request that would cause several conversions, then the memory gets released like it should. Seems to be a change somewhere after v10.0.7.

kmdev1 · 2021-03-28T19:10:43Z

Hey @goguda,

So I think the issue that you mention could exaggerate the issue that I've raised, however the race condition in itself is a problem within itself unfortunately.

Maybe you'll also notice the same thing as me but when running 10.0.7 locally do you have multiple images in the cache that are the exact same (just with different expiry time stamps - the time stamp is the first bit of the file name)?

To replicate the traffic locally I used a tool called K6 - this replicated 15 users over 10 seconds and I was able to see my cache already being populated with duplicates.

As a side effect of all this over processing this definitely had a hit on hardware for me.

atom258 · 2021-03-30T03:24:03Z

In version 10.1.1, the memory consumption is still too high

This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`next@10.1.0`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues #23189, #23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes

kmdev1 · 2021-03-31T13:30:04Z

Hey @shuding,

I have just seen your comment on the other issue about Memory usage. I think the issues that I raise above are related. I believe this still exists in the canary version (unless I'm missing something 😄) I see the more concurrent traffic, the more memory pretty consistently.

Are you able to replicate?

shuding · 2021-03-31T13:36:54Z

Hi @kmdev1, yes I think these two issues are possibly related, and I'm now trying to reproduce this one. Just want to confirm that you are still able to see the high memory usage in the latest canary with concurrent traffic, right? Asking because in the same situation memory usage will be lower with the fix in #23565.

PranavSathy · 2021-03-31T16:38:10Z

For what it is worth, our organization is on NextJS 10.0.5 and seeing spikes up to 100% memory utilization and then instance cycling due to OOM on Cloud Run. This started happening once we enabled long-running instances (previously we cycled continuously).

#23188

One other option seems to be moving to the new Canary release with @shuding 's fix:

https://github.com/vercel/next.js/releases/tag/v10.1.3-canary.0

UPDATE: Turns out that after we upgraded to 10.1.0, realized this issue and reverted, we had our package.json as ^10.0.5 , so it still resolved to 10.1.0 due to semver policies, and thus our container image caching utilized the build image with 10.1.0 still.

kmdev1 · 2021-04-01T15:58:58Z

Hey @shuding,

Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two.

Of course I know your work doesn't cater for this ticket so will also drop a line confirming what I've seen to on the original

see: - vercel/next.js#23436 - vercel/next.js#23189

kmdev1 · 2021-04-09T20:40:49Z

Hey @shuding,

I was just wondering if you'd started any work on this as I probably have a bit of free time this weekend where I could help out if you hadn't started already? 😄

shuding · 2021-04-12T10:56:55Z

Sorry @kmdev1 that I missed your comment, and yes I'm already working on a PR for this issue 😄

kmdev1 · 2021-04-12T12:31:42Z

No worries at all @shuding! Awesome thank you very much my friend! If there's anything I can do to help just let me know 👍🏻

This PR adds the deduplication logic to image optimizer so the cache will be reused as much as possible if there're multiple concurrent requests. Fixes #23436. ## Bug - [x] Related issues linked using `fixes #number` - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes

This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`next@10.1.0`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues vercel#23189, vercel#23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes

balazsorban44 · 2022-01-28T13:06:53Z

This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

kmdev1 added the bug Issue was opened via the bug report template. label Mar 26, 2021

kmdev1 mentioned this issue Mar 30, 2021

Memory leak in Image Optimization #23189

Closed

aguilarm mentioned this issue Mar 30, 2021

next/image Cache-Control header does not match origin server #23562

Closed

shuding mentioned this issue Mar 30, 2021

Fix memory leak in image optimization #23565

Merged

8 tasks

shuding added kind: bug and removed bug Issue was opened via the bug report template. labels Apr 2, 2021

shuding self-assigned this Apr 2, 2021

dazedbear added a commit to dazedbear/dazedbear.github.io that referenced this issue Apr 5, 2021

build: lowgrade nextjs to 10.0.7 to prevent segment fault

09a4041

see: - vercel/next.js#23436 - vercel/next.js#23189

timneutkens added this to the Iteration 19 milestone Apr 12, 2021

shuding mentioned this issue Apr 12, 2021

Add request deduplication to image optimizer #24000

Merged

8 tasks

kodiakhq bot closed this as completed in #24000 Apr 12, 2021

vercel locked as resolved and limited conversation to collaborators Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with Image Loader makes it unusable in a production environment #23436

Race condition with Image Loader makes it unusable in a production environment #23436

kmdev1 commented Mar 26, 2021

goguda commented Mar 28, 2021 •

edited

Loading

kmdev1 commented Mar 28, 2021

atom258 commented Mar 30, 2021

kmdev1 commented Mar 31, 2021

shuding commented Mar 31, 2021 •

edited

Loading

PranavSathy commented Mar 31, 2021 •

edited

Loading

kmdev1 commented Apr 1, 2021

kmdev1 commented Apr 9, 2021

shuding commented Apr 12, 2021

kmdev1 commented Apr 12, 2021

balazsorban44 commented Jan 28, 2022

Race condition with Image Loader makes it unusable in a production environment #23436

Race condition with Image Loader makes it unusable in a production environment #23436

Comments

kmdev1 commented Mar 26, 2021

What version of Next.js are you using?

What version of Node.js are you using?

What browser are you using?

What operating system are you using?

How are you deploying your application?

Describe the Bug

Expected Behavior

To Reproduce

goguda commented Mar 28, 2021 • edited Loading

kmdev1 commented Mar 28, 2021

atom258 commented Mar 30, 2021

kmdev1 commented Mar 31, 2021

shuding commented Mar 31, 2021 • edited Loading

PranavSathy commented Mar 31, 2021 • edited Loading

kmdev1 commented Apr 1, 2021

kmdev1 commented Apr 9, 2021

shuding commented Apr 12, 2021

kmdev1 commented Apr 12, 2021

balazsorban44 commented Jan 28, 2022

goguda commented Mar 28, 2021 •

edited

Loading

shuding commented Mar 31, 2021 •

edited

Loading

PranavSathy commented Mar 31, 2021 •

edited

Loading