Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebFlux MultipartParser Blocks When Uploading Large Files, Causing Uploads to Occasionally Hang #34178

Closed
guqing opened this issue Dec 30, 2024 · 11 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: duplicate A duplicate of another issue type: bug A general bug

Comments

@guqing
Copy link

guqing commented Dec 30, 2024

Description

When using WebFlux to upload large files, the upload process sometimes hangs before reaching the route's business logic. Specifically, the issue occurs during the client’s file upload to the org.springframework.http.codec.multipart.MultipartParser when writing to a temporary file, causing the upload to become stuck. This problem is intermittent but has a certain probability of occurring when uploading files of 200MB, between 2-3GB, and between 4-5GB.

Steps to Reproduce

  1. Set up a Spring Boot application using WebFlux to handle file upload requests.
  2. Upload a large file (e.g., 200MB, 2-3GB, or 4-5GB).
  3. Observe the upload process. Occasionally, the upload hangs before the file is fully transmitted to the server. At this point, the generated temporary file xxx.multipart is incomplete, and the request remains in a pending state without completing the upload.

Additional Resources

  • A minimal reproduction project has been provided that can replicate the aforementioned issue, However, it’s not guaranteed to happen every time; I tried a few times: large-file-upload.zip
  • Screen Recording Demonstration:
Kapture.2024-12-30.at.16.32.36.mp4

In the screen recording, I used an ISO file that I downloaded from the Manjaro official website: https://download.manjaro.org/gnome/24.2.1/manjaro-gnome-24.2.1-241216-linux612.iso

The file upload progress in the video is stuck at 68% and does not complete, regardless of how long you wait, with no error messages displayed

Expected Behavior

Large files should upload and be processed smoothly without hanging or blocking at any stage.

Actual Behavior

When uploading large files, the upload process sometimes hangs during the MultipartParser phase of writing to the temporary file, preventing the upload from completing.

Environment Information

  • Spring Boot Version: 3.4.1
  • Java Versions:
    • OpenJDK 17
    • Eclipse Temurin JDK 21.0.5
  • Operating Systems:
    • macOS
    • Linux
  • Other Relevant Dependencies: spring-boot-starter-webflux

Note: The issue has been tested and reproduced across all the above-mentioned systems and JDK versions.

Additional Information

This issue is not caused by business logic but by the framework’s handling of large file uploads and writing to temporary files, which leads to blocking. We hope the development team can investigate and resolve this issue to enhance the stability and reliability of large file uploads.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Dec 30, 2024
@guqing
Copy link
Author

guqing commented Dec 30, 2024

Perhaps it would be more appropriate to raise this issue in the spring-framework project. I apologize for overlooking this. Could the developers please transfer this issue to the spring-framework repository?🤪

@snicoll snicoll transferred this issue from spring-projects/spring-boot Dec 30, 2024
@JohnNiang
Copy link

Several uploading tries of the same file 2.7G Nov 26 15:38 Fedora-Server-dvd-x86_64-41-1.4.iso.

    1.7 GiB [####################]  6138071211851868018.multipart
    1.5 GiB [#################   ]  14413026015246421036.multipart
    1.4 GiB [################    ]  15576621334079039706.multipart
    1.1 GiB [############        ]  14246059143399172415.multipart

@bclozel bclozel added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Feb 4, 2025
@sdeleuze sdeleuze self-assigned this Feb 11, 2025
@sdeleuze
Copy link
Contributor

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

@sdeleuze sdeleuze added the status: waiting-for-feedback We need additional information before we can continue label Feb 11, 2025
@chenggangpro
Copy link
Contributor

Through my debugging process. If upstream() != null && !this.sink.isCancelled() && this.sink.requestedFromDownstream() == 0 && !this.requestOutstanding.get() then it would hang up for 100% through my local tests about dozens of times. Whenever the upload is successful, the previous debug point will never be reached.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 11, 2025
@bclozel
Copy link
Member

bclozel commented Feb 11, 2025

@chenggangpro it would be interesting to know which condition doesn't match. Is "this.sink.requestedFromDownstream() == 0" or "this.requestOutstanding == false"? This is probably a concurrency issue in the parser and we need to pinpoint exactly the issue to fix it.

I think the proposed fix in #34388 accidentally fixes things by over-requesting data but queuing "onNext" might cause other issues, too much memory consumption or even parsing errors?

@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Feb 11, 2025
@chenggangpro
Copy link
Contributor

chenggangpro commented Feb 12, 2025

@bclozel I think it's the condition this.sink.requestedFromDownstream() == 0 but I don't know the cause is.

My local debugging is as below:

I add some debug logging points to MultipartParser:

  • MultipartParser#parse method

Original Source Code Location

public static Flux<Token> parse(Flux<DataBuffer> buffers, byte[] boundary, int maxHeadersSize, Charset headersCharset) {
	return Flux.create(sink -> {
		MultipartParser parser = new MultipartParser(sink, boundary, maxHeadersSize, headersCharset);
		sink.onCancel(parser::onSinkCancel);
		sink.onRequest(l -> logger.warn("===== Sink On request : " + l));// here is the debug logging point
		buffers.subscribe(parser);
	});
}

I didn't add parser.requestBuffer() into the sink.onRequest(...), just the logging point.

  • MultipartParser#requestBuffer method

Original Source Code Location

private void requestBuffer() {
	if (upstream() != null &&
			!this.sink.isCancelled() &&
			this.sink.requestedFromDownstream() > 0 &&
			this.requestOutstanding.compareAndSet(false, true)) {
		request(1);
	}else if(!this.requestOutstanding.get()){
	   // here is the debug logging point
		logger.warn("===== Request buffer called =================");
		logger.warn("===== Sink is cancelled :" + sink.isCancelled());
		logger.warn("===== Sink requested from down stream :" + sink.requestedFromDownstream());
		logger.warn("===== Request buffer called =================");
	}
}

Then I uploaded manjaro-gnome-24.2.1-241216-linux612.iso, and here is the log large-file-upload-original-debug-logging.zip.

You can see lines between L#483532 and L#483539 of the log file; sink.requestedFromDownstream() is ZERO. Furthermore, therequestBuffer() method is never called within the sink.onRequest(...). So in my opinion this should be the bug point, but I don't know the cause of sink.requestedFromDownstream() == 0. When I first dove into the parser, I added a debug point at MultipartParser.java#L192 with a condition sink.requestedFromDownstream() == 0. This is the reason I added the logging debug point as mentioned earlier.

I kept running PR #34388 for 6 hours yesterday, and there were no hanging issues or parsing errors. However, I am not sure if my fix is actually correct or if there are any potential errors that I haven't noticed. I hope my debugging process is useful for you all in solving this issue.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 12, 2025
@guqing
Copy link
Author

guqing commented Feb 12, 2025

Thanks for the very high quality reproducer, that's much appreciated. So far I am unable to reproduce after 10 tries with manjaro-gnome-24.2.1-241216-linux612.iso. Could you share some indications on how frequent the issue is on your side?

Thanks for looking into this! There isn’t a fixed frequency for reproduction—it might take just 3 attempts sometimes, while other times it doesn’t reproduce even after dozens of tries. It seems to require a bit of luck. However, in applications with a large user base, users do encounter this issue. Fortunately, they can usually resolve it by retrying, as seen in cases like halo-dev/halo#7170

@sdeleuze
Copy link
Contributor

I just reproduced it.

Image

@sdeleuze sdeleuze added this to the 6.2.x milestone Feb 12, 2025
@sdeleuze sdeleuze added type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Feb 12, 2025
@sdeleuze
Copy link
Contributor

sdeleuze commented Feb 19, 2025

@chemicL We suspect that here when this.sink.requestedFromDownstream() == 0, request(1); is skipped (which is intended), but then the related requestBuffer() method is never invoked again, and we get the transfer of big files hanged forever.

#34388 was proposing to add sink.onRequest(l -> parser.requestBuffer()); in the Flux#create lambda. Adding this additional sink.onRequest invocation kind of makes sense , but we were concerned about potentially over-requesting. I am wondering if we should introduce a refined version of sink.onRequest(l -> parser.requestBuffer()); that would invoke conditionally parser.requestBuffer() only after this.sink.requestedFromDownstream() == 0 has been detected. Does that make sense from your POV? Do you have any guideline or link to similar use case that could help?

@sdeleuze sdeleuze added the status: waiting-for-internal-feedback An issue that needs input from a member or another Spring Team label Feb 19, 2025
@chemicL
Copy link
Member

chemicL commented Feb 20, 2025

@sdeleuze I had just a very brief look. My observations follow. Let me know if you'd need more assistance and I can dedicate more time to help.

My understanding is that there will be a situation where the downstream (the subscriber to FluxCreate represented by the FluxSink) is not able to keep up so issues no demand for some time.

At the same time, the upstream (Flux<DataBuffer>) is done delivering and would continue but the downstream is not interested. So that's typical backpressure kicking in.

Eventually, the downstream would issue the request() to the FluxCreate, but the MultipartParser is not able to observe that -> the implementation doesn't seem to be well connected to observe such a late event. Since no signal is triggered by the upstream, the means to see the downstream demand and issue another upstream request are not triggered.

Consider registering a connection between MultipartParser and FluxSink via FluxSink#onRequest(LongConsumer consumer) and use that hint to drive the continuation of processing.

Now, having seen the above PR (#34388 which aims to introduce exactly that link) and the concerns you raise, I think they are valid and that's a similar case I was facing in FluxBuffer implementation - the resumption has to be detected by tracking downstream demand.

I don't know the specifics of the buffers chunking and buffering but requesting more than one at a time is not bad per se, since reactive signals are still happening sequentially, just the source can buffer up more independently. There would need to be some understanding built-in of how the downstream demand of N translates to an M of upstream demand though, and it might be a safer bet to just request 1 at at time as it's now - and I understand the stalls are the current concern and not the throughput.

@sdeleuze sdeleuze removed this from the 6.2.x milestone Feb 20, 2025
@sdeleuze sdeleuze removed the status: waiting-for-internal-feedback An issue that needs input from a member or another Spring Team label Feb 20, 2025
@sdeleuze sdeleuze added the status: duplicate A duplicate of another issue label Feb 20, 2025
@sdeleuze
Copy link
Contributor

Thanks for the detailed feedback, based on those insights I am going to reopen #34388 and close this issue. I will do more tests locally before merging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: duplicate A duplicate of another issue type: bug A general bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants