Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Last 2 Chunks In Streaming Mode Come Together In Firefox #9502

Closed
CentricStorm opened this issue Sep 16, 2024 · 3 comments
Closed

Bug: Last 2 Chunks In Streaming Mode Come Together In Firefox #9502

CentricStorm opened this issue Sep 16, 2024 · 3 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@CentricStorm
Copy link
Contributor

What happened?

When using /completion with stream: true, the last 2 JSON chunks come together in Firefox, but Chrome seems to handle it fine, so it might be a Firefox bug.

Looking further into this, it seems like HTTP Transfer-Encoding: chunked requires each chunk to be terminated with \r\n, but here \n\n is used instead:

const std::string str =
std::string(event) + ": " +
data.dump(-1, ' ', false, json::error_handler_t::replace) +
"\n\n"; // note: these newlines are important (not sure why though, if you know, add a comment to explain)

This doesn't seem to be just a Windows requirement, but listed as part of the HTTP specification:
HTTP Chunked Transfer Coding

More information, including an example chunked response:
Transfer-Encoding Directives

Name and Version

llama-server.exe
version: 3761 (6262d13)
built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@CentricStorm CentricStorm added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 16, 2024
@CentricStorm
Copy link
Contributor Author

This was roughly how the HTTP API was working before (which still works in Chrome):

const response = await fetch("http://localhost/completion", {
	method: "POST",
	body: JSON.stringify({
		prompt,
		n_predict: 32,
		stream: true
	})
})
for await (const chunk of response.body.pipeThrough(new TextDecoderStream("utf-8"))) {
	if (chunk.startsWith("error")) {
		return
	}
	const data = JSON.parse(chunk.substring(6))
}

The documentation doesn't mention if this is the intended way to use streaming mode.

@ggerganov
Copy link
Owner

Btw, we now also add [DONE]\n\n at the end of the response: #9459

(Not sure if this is relevant, as I have little knowledge about how the HTTP stuff should work.)

@CentricStorm
Copy link
Contributor Author

Btw, we now also add [DONE]\n\n at the end of the response: #9459

I think that's only for the OpenAI-compatible API /chat/completions, not for llama-server's own API /completion.

More research shows that llama-server is currently responding to stream requests using a format closely resembling server-sent events (one difference is that llama-server can send messages with an error field, even though that is non-standard).

This seems strange at first because server-sent events are intended to be used client-side with the EventSource interface...but that doesn't support HTTP POST requests (which llama-server requires).

Using fetch instead to access these server-sent events is probably non-standard, and is most likely the reason why the behavior is different in Firefox and Chrome. In other words, it may not be a bug at all.

Regardless, more information can be added to the documentation, including an example script that manually splits the chunks and works in Firefox as well as Node in #9519.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

2 participants