WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

guqing · 2024-12-30T09:10:41Z

Description

When using WebFlux to upload large files, the upload process sometimes hangs before reaching the route's business logic. Specifically, the issue occurs during the client’s file upload to the org.springframework.http.codec.multipart.MultipartParser when writing to a temporary file, causing the upload to become stuck. This problem is intermittent but has a certain probability of occurring when uploading files of 200MB, between 2-3GB, and between 4-5GB.

Steps to Reproduce

Set up a Spring Boot application using WebFlux to handle file upload requests.
Upload a large file (e.g., 200MB, 2-3GB, or 4-5GB).
Observe the upload process. Occasionally, the upload hangs before the file is fully transmitted to the server. At this point, the generated temporary file xxx.multipart is incomplete, and the request remains in a pending state without completing the upload.

Additional Resources

A minimal reproduction project has been provided that can replicate the aforementioned issue, However, it’s not guaranteed to happen every time; I tried a few times: large-file-upload.zip
Screen Recording Demonstration:

Kapture.2024-12-30.at.16.32.36.mp4

Relevant logs: upload-file.log

In the screen recording, I used an ISO file that I downloaded from the Manjaro official website: https://download.manjaro.org/gnome/24.2.1/manjaro-gnome-24.2.1-241216-linux612.iso

The file upload progress in the video is stuck at 68% and does not complete, regardless of how long you wait, with no error messages displayed

Expected Behavior

Large files should upload and be processed smoothly without hanging or blocking at any stage.

Actual Behavior

When uploading large files, the upload process sometimes hangs during the MultipartParser phase of writing to the temporary file, preventing the upload from completing.

Environment Information

Spring Boot Version: 3.4.1
Java Versions:
- OpenJDK 17
- Eclipse Temurin JDK 21.0.5
Operating Systems:
- macOS
- Linux
Other Relevant Dependencies: spring-boot-starter-webflux

Note: The issue has been tested and reproduced across all the above-mentioned systems and JDK versions.

Additional Information

This issue is not caused by business logic but by the framework’s handling of large file uploads and writing to temporary files, which leads to blocking. We hope the development team can investigate and resolve this issue to enhance the stability and reliability of large file uploads.

The text was updated successfully, but these errors were encountered:

guqing · 2024-12-30T09:20:55Z

Perhaps it would be more appropriate to raise this issue in the spring-framework project. I apologize for overlooking this. Could the developers please transfer this issue to the spring-framework repository?🤪

JohnNiang · 2025-01-21T02:47:55Z

Several uploading tries of the same file 2.7G Nov 26 15:38 Fedora-Server-dvd-x86_64-41-1.4.iso.

    1.7 GiB [####################]  6138071211851868018.multipart
    1.5 GiB [#################   ]  14413026015246421036.multipart
    1.4 GiB [################    ]  15576621334079039706.multipart
    1.1 GiB [############        ]  14246059143399172415.multipart

sdeleuze · 2025-02-11T10:43:04Z

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

chenggangpro · 2025-02-11T12:06:51Z

Through my debugging process. If upstream() != null && !this.sink.isCancelled() && this.sink.requestedFromDownstream() == 0 && !this.requestOutstanding.get() then it would hang up for 100% through my local tests about dozens of times. Whenever the upload is successful, the previous debug point will never be reached.

bclozel · 2025-02-11T20:16:24Z

@chenggangpro it would be interesting to know which condition doesn't match. Is "this.sink.requestedFromDownstream() == 0" or "this.requestOutstanding == false"? This is probably a concurrency issue in the parser and we need to pinpoint exactly the issue to fix it.

I think the proposed fix in #34388 accidentally fixes things by over-requesting data but queuing "onNext" might cause other issues, too much memory consumption or even parsing errors?

chenggangpro · 2025-02-12T05:41:24Z

@bclozel I think it's the condition this.sink.requestedFromDownstream() == 0 but I don't know the cause is.

My local debugging is as below:

I add some debug logging points to MultipartParser:

MultipartParser#parse method

Original Source Code Location

public static Flux<Token> parse(Flux<DataBuffer> buffers, byte[] boundary, int maxHeadersSize, Charset headersCharset) {
	return Flux.create(sink -> {
		MultipartParser parser = new MultipartParser(sink, boundary, maxHeadersSize, headersCharset);
		sink.onCancel(parser::onSinkCancel);
		sink.onRequest(l -> logger.warn("===== Sink On request : " + l));// here is the debug logging point
		buffers.subscribe(parser);
	});
}

I didn't add parser.requestBuffer() into the sink.onRequest(...), just the logging point.

MultipartParser#requestBuffer method

Original Source Code Location

private void requestBuffer() {
	if (upstream() != null &&
			!this.sink.isCancelled() &&
			this.sink.requestedFromDownstream() > 0 &&
			this.requestOutstanding.compareAndSet(false, true)) {
		request(1);
	}else if(!this.requestOutstanding.get()){
	   // here is the debug logging point
		logger.warn("===== Request buffer called =================");
		logger.warn("===== Sink is cancelled :" + sink.isCancelled());
		logger.warn("===== Sink requested from down stream :" + sink.requestedFromDownstream());
		logger.warn("===== Request buffer called =================");
	}
}

Then I uploaded manjaro-gnome-24.2.1-241216-linux612.iso, and here is the log large-file-upload-original-debug-logging.zip.

You can see lines between L#483532 and L#483539 of the log file; sink.requestedFromDownstream() is ZERO. Furthermore, therequestBuffer() method is never called within the sink.onRequest(...). So in my opinion this should be the bug point, but I don't know the cause of sink.requestedFromDownstream() == 0. When I first dove into the parser, I added a debug point at MultipartParser.java#L192 with a condition sink.requestedFromDownstream() == 0. This is the reason I added the logging debug point as mentioned earlier.

I kept running PR #34388 for 6 hours yesterday, and there were no hanging issues or parsing errors. However, I am not sure if my fix is actually correct or if there are any potential errors that I haven't noticed. I hope my debugging process is useful for you all in solving this issue.

guqing · 2025-02-12T06:58:15Z

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

Thanks for looking into this! There isn’t a fixed frequency for reproduction—it might take just 3 attempts sometimes, while other times it doesn’t reproduce even after dozens of tries. It seems to require a bit of luck. However, in applications with a large user base, users do encounter this issue. Fortunately, they can usually resolve it by retrying, as seen in cases like halo-dev/halo#7170

sdeleuze · 2025-02-12T16:56:09Z

I just reproduced it.

sdeleuze · 2025-02-19T18:28:56Z

@chemicL We suspect that here when this.sink.requestedFromDownstream() == 0, request(1); is skipped (which is intended), but then the related requestBuffer() method is never invoked again, and we get the transfer of big files hanged forever.

#34388 was proposing to add sink.onRequest(l -> parser.requestBuffer()); in the Flux#create lambda. Adding this additional sink.onRequest invocation kind of makes sense , but we were concerned about potentially over-requesting. I am wondering if we should introduce a refined version of sink.onRequest(l -> parser.requestBuffer()); that would invoke conditionally parser.requestBuffer() only after this.sink.requestedFromDownstream() == 0 has been detected. Does that make sense from your POV? Do you have any guideline or link to similar use case that could help?

chemicL · 2025-02-20T11:40:02Z

@sdeleuze I had just a very brief look. My observations follow. Let me know if you'd need more assistance and I can dedicate more time to help.

My understanding is that there will be a situation where the downstream (the subscriber to FluxCreate represented by the FluxSink) is not able to keep up so issues no demand for some time.

At the same time, the upstream (Flux<DataBuffer>) is done delivering and would continue but the downstream is not interested. So that's typical backpressure kicking in.

Eventually, the downstream would issue the request() to the FluxCreate, but the MultipartParser is not able to observe that -> the implementation doesn't seem to be well connected to observe such a late event. Since no signal is triggered by the upstream, the means to see the downstream demand and issue another upstream request are not triggered.

Consider registering a connection between MultipartParser and FluxSink via FluxSink#onRequest(LongConsumer consumer) and use that hint to drive the continuation of processing.

Now, having seen the above PR (#34388 which aims to introduce exactly that link) and the concerns you raise, I think they are valid and that's a similar case I was facing in FluxBuffer implementation - the resumption has to be detected by tracking downstream demand.

I don't know the specifics of the buffers chunking and buffering but requesting more than one at a time is not bad per se, since reactive signals are still happening sequentially, just the source can buffer up more independently. There would need to be some understanding built-in of how the downstream demand of N translates to an M of upstream demand though, and it might be a safer bet to just request 1 at at time as it's now - and I understand the stalls are the current concern and not the throughput.

sdeleuze · 2025-02-20T13:39:47Z

Thanks for the detailed feedback, based on those insights I am going to reopen #34388 and close this issue. I will do more tests locally before merging it.

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Dec 30, 2024

guqing mentioned this issue Dec 30, 2024

无法正常上传大文件 halo-dev/halo#7170

Closed

4 tasks

snicoll transferred this issue from spring-projects/spring-boot Dec 30, 2024

bclozel added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Feb 4, 2025

chenggangpro mentioned this issue Feb 8, 2025

Add onRequest() hook for propagating request from downstream #34388

Closed

sdeleuze self-assigned this Feb 11, 2025

sdeleuze added the status: waiting-for-feedback We need additional information before we can continue label Feb 11, 2025

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 11, 2025

bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Feb 11, 2025

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 12, 2025

sdeleuze added this to the 6.2.x milestone Feb 12, 2025

sdeleuze added type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Feb 12, 2025

sdeleuze added the status: waiting-for-internal-feedback An issue that needs input from a member or another Spring Team label Feb 19, 2025

sdeleuze removed this from the 6.2.x milestone Feb 20, 2025

sdeleuze removed the status: waiting-for-internal-feedback An issue that needs input from a member or another Spring Team label Feb 20, 2025

sdeleuze added the status: duplicate A duplicate of another issue label Feb 20, 2025

sdeleuze closed this as completed Feb 20, 2025

sdeleuze mentioned this issue Mar 26, 2025

Slow webflux multipart upload #34651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

guqing commented Dec 30, 2024

guqing commented Dec 30, 2024

JohnNiang commented Jan 21, 2025

sdeleuze commented Feb 11, 2025

chenggangpro commented Feb 11, 2025

bclozel commented Feb 11, 2025 •

edited

Loading

chenggangpro commented Feb 12, 2025 •

edited

Loading

guqing commented Feb 12, 2025

sdeleuze commented Feb 12, 2025

sdeleuze commented Feb 19, 2025 •

edited

Loading

chemicL commented Feb 20, 2025

sdeleuze commented Feb 20, 2025

WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

Comments

guqing commented Dec 30, 2024

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment Information

Additional Information

guqing commented Dec 30, 2024

JohnNiang commented Jan 21, 2025

sdeleuze commented Feb 11, 2025

chenggangpro commented Feb 11, 2025

bclozel commented Feb 11, 2025 • edited Loading

chenggangpro commented Feb 12, 2025 • edited Loading

guqing commented Feb 12, 2025

sdeleuze commented Feb 12, 2025

sdeleuze commented Feb 19, 2025 • edited Loading

chemicL commented Feb 20, 2025

sdeleuze commented Feb 20, 2025

bclozel commented Feb 11, 2025 •

edited

Loading

chenggangpro commented Feb 12, 2025 •

edited

Loading

sdeleuze commented Feb 19, 2025 •

edited

Loading