-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition with Image Loader makes it unusable in a production environment #23436
Comments
I assume this is related to the same thing I'm experiencing with image optimization on anything higher than Next.js v10.0.7. I've tried both v10.0.8 and v10.0.9 but any images that are not in cache seem to cause a memory leak during conversion and memory usage rapidly climbs until my server just runs out of memory and crashes. Conversion also seems much slower overall than previously. Obviously this isn't suitable for production. I've downgraded back to v10.0.7 for now and I can see memory usage go up by about 10-15MB during a request that would cause several conversions, then the memory gets released like it should. Seems to be a change somewhere after v10.0.7. |
Hey @goguda, So I think the issue that you mention could exaggerate the issue that I've raised, however the race condition in itself is a problem within itself unfortunately. Maybe you'll also notice the same thing as me but when running 10.0.7 locally do you have multiple images in the cache that are the exact same (just with different expiry time stamps - the time stamp is the first bit of the file name)? To replicate the traffic locally I used a tool called K6 - this replicated 15 users over 10 seconds and I was able to see my cache already being populated with duplicates. As a side effect of all this over processing this definitely had a hit on hardware for me. |
In version 10.1.1, the memory consumption is still too high |
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`next@10.1.0`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues #23189, #23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
Hey @shuding, I have just seen your comment on the other issue about Memory usage. I think the issues that I raise above are related. I believe this still exists in the canary version (unless I'm missing something 😄) I see the more concurrent traffic, the more memory pretty consistently. Are you able to replicate? |
Hi @kmdev1, yes I think these two issues are possibly related, and I'm now trying to reproduce this one. Just want to confirm that you are still able to see the high memory usage in the latest canary with concurrent traffic, right? Asking because in the same situation memory usage will be lower with the fix in #23565. |
For what it is worth, our organization is on NextJS 10.0.5 and seeing spikes up to 100% memory utilization and then instance cycling due to OOM on Cloud Run. This started happening once we enabled long-running instances (previously we cycled continuously). One other option seems to be moving to the new Canary release with @shuding 's fix: https://github.com/vercel/next.js/releases/tag/v10.1.3-canary.0 UPDATE: Turns out that after we upgraded to 10.1.0, realized this issue and reverted, we had our package.json as |
Hey @shuding, Great piece of work there 😄 - The canary release does indeed fix the overload of memory and it now seems to stay similar to that of 10.0.7 with no real noticeable difference between the two. Of course I know your work doesn't cater for this ticket so will also drop a line confirming what I've seen to on the original |
Hey @shuding, I was just wondering if you'd started any work on this as I probably have a bit of free time this weekend where I could help out if you hadn't started already? 😄 |
Sorry @kmdev1 that I missed your comment, and yes I'm already working on a PR for this issue 😄 |
No worries at all @shuding! Awesome thank you very much my friend! If there's anything I can do to help just let me know 👍🏻 |
This PR adds the deduplication logic to image optimizer so the cache will be reused as much as possible if there're multiple concurrent requests. Fixes #23436. ## Bug - [x] Related issues linked using `fixes #number` - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
This RP fixes the problem that the image optimization API uses a large amount of memory, and is not correctly freed afterwards. There're multiple causes of this problem: ### 1. Too many WebAssembly instances are created We used to do all the image processing operations (decode, resize, rotate, encodeJpeg, encodePng, encodeWebp) inside each worker thread, where each operation creates at least one WASM instance, and we create `os.cpus().length - 1` workers by default. That means in the worst case, there will be `N*6` WASM instances created (N is the number of CPU cores minus one). This PR changes it to a pipeline-like architecture: there will be at most 6 workers, and the same type of operations will always be assigned to the same worker. With this change, 6 WASM instances will be created in the worst case. ### 2. WebAssembly memory can't be deallocated It's known that [WebAssembly can't simply deallocate its memory as of today](https://stackoverflow.com/a/51544868/2424786). And due to the implementation/design of the WASM modules that we are using, they're not very suitable for long-running cases and it's more like a one-off use. For each operation like resize, it will allocate **new memory** to store that data. So the memory will increase quickly as more images are processed. The fix is to get rid of `execOnce` for WASM module initializations, so each time a new WASM module will be created and the old module will be GC'd entirely as there's no reference to it. That's the only and easiest way to free the memory use of a WASM module AFAIK. ### 3. WebAssembly memory isn't correctly freed after finishing the operation `wasm-bindgen` generates code with global variables like `cachegetUint8Memory0` and `wasm` that always hold the WASM memory as a reference. We need to manually clean them up after finishing each operation. This PR ensures that these variables will be deleted so the memory overhead can go back to 0 when an operation is finished. ### 4. Memory leak inside event listeners `emscripten` generates code with global error listener registration (without cleaning them up): https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L39-L43 And the listener has references to the WASM instance directly or indirectly: https://github.com/vercel/next.js/blob/99a4ea6/packages/next/next-server/server/lib/squoosh/webp/webp_node_dec.js#L183-L192 (`e`, `y`, `r`). That means whenever a WASM module is created (emscripten), its memory will be kept by the global scope. And when we replace the WASM module with a new one, the newer will be added again and the old will still be referenced, which causes a leak. Since we're running them inside worker threads (which will retry on fail), this PR simply removes these listeners. ### Test Here're some statistics showing that these changes have improved the memory usage a lot (the app I'm using to test has one page of 20 high-res PNGs): Before this PR (`next@10.1.0`): <img src="https://user-images.githubusercontent.com/3676859/113058480-c3496100-91e0-11eb-9e5a-b325e484adac.png" width="500"> Memory went from ~250MB to 3.2GB (peak: 3.5GB) and never decreased again. With fix 1 applied: <img src="https://user-images.githubusercontent.com/3676859/113059060-921d6080-91e1-11eb-8ac6-83c70c1f2f75.png" width="500"> Memory went from ~280MB to 1.5GB (peak: 2GB). With fix 1+2 applied: <img src="https://user-images.githubusercontent.com/3676859/113059207-bf6a0e80-91e1-11eb-845a-870944f9e116.png" width="500"> Memory went from ~280MB to 1.1GB (peak: 1.6GB). With fix 1+2+3+4 applied: <img src="https://user-images.githubusercontent.com/3676859/113059362-ec1e2600-91e1-11eb-8d9a-8fbce8808802.png" width="500"> It's back to normal; memory changed from ~300MB to ~480MB, peaked at 1.2GB. You can clearly see that GC is working correctly here. --- ## Bug - [x] Related issues vercel#23189, vercel#23436 - [ ] Integration tests added ## Feature - [ ] Implements an existing feature request or RFC. Make sure the feature request has been accepted for implementation before opening a PR. - [ ] Related issues linked using `fixes #number` - [ ] Integration tests added - [ ] Documentation added - [ ] Telemetry added. In case of a feature if it's used or not. ## Documentation / Examples - [ ] Make sure the linting passes
This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you. |
What version of Next.js are you using?
10.0.9
What version of Node.js are you using?
v12.19.0, v14.16.0
What browser are you using?
Any browser
What operating system are you using?
Any OS
How are you deploying your application?
AWS
Describe the Bug
The default image loader at present seems to suffer from a race condition. Unfortunately this issue has now meant that for us, the pretty awesome functionality, cannot be used at all.
Assuming the image cache does not contain the given image, when making a request the loader will go off and do it's thing and then return the given image and add it to its cache. The problem comes when you have real traffic.
Example scenario:
Lets assume you have a simple page that has 3 images that are utilising next/image. The page has an average traffic of 10 req/s for a given screen size. This page is brand new and the images have never been server before (ie the image's do not yet exist in the cache).
Currently when this above scenario happens, c.30 calls to the original image URL happen. Each one of these images then resolves, gets transformed and then placed into the cache and returned to the user. The important note here is that we now have a cache folder that contains 10 cached files of the same image three times.
When we replicate the above scenario but in more realistic terms (12 images on a new page with average traffic of 25 req/s) this means that we suddenly have a cache folder that contains 25 images x 12 (300 total). This is because when the first request comes in, the code says hey you're not in the cache, lets go get you. The second request comes in and says, hey you're not in the cache (because the first one is still processing), let's go get you and so on and so on.
So why does this cause the container to die? It doesn't; writing and reading from disk is not the only thing that the loader is doing. Http requests, FS, Image manipulation. These things are heavy on not only memory but CPU too.
In the above example we're asking a container to not only go and fetch 300 images per second, we're also manipulating, writing to file storage, serving back. The chances are then by the time these have finished, the next second comes around with another 300 requests, the chances are probably the images still aren't in the cache because the the CPU is rocketed and so has the memory so things just start to back up and get worse and worse until the container falls over completely.
TL;DR:
Requests to non cached images are not aware of each other which means there's a race condition when the same image is requested from the same user within the same time period of it not being in the cache. This then (with enough traffic) spirals into a horrible loop that is non escapable and leads to the containers falling over.
Expected Behavior
I'd expect that when parallel requests are made, that the image loader is directly only called once and is sat behind some sort of batching mechanism (similar to DataLoader) that removes duplicates and is aware of parallel requests and then maps the responses accordingly. This means that the site could require 500 images per second of the same image, but only do all of the hardware intensive stuff once.
To Reproduce
The text was updated successfully, but these errors were encountered: