Fix mirostat state when using multiple sequences #3543

KerfuffleV2 · 2023-10-08T10:02:33Z

As mentioned in #3537, mirostat currently isn't compatible with using multiple sequences.

The main selling point of the way this pull is implemented is such a way that it's pretty simple and uninvasive.

However, I really don't like storing mutable sampler state in gpt_params (even though it's only in common and not the main llama.cpp API). This also required removing the const from the params argument to llama_sample_token. As far as I can see, the existing examples don't care about that.

I feel like the right way to do this is probably to move the sampler state out of gpt_params and have it passed separately. In that case, this is probably also where grammar should be since it's a type of sampler state. So we wouldn't add a new argument to llama_sample_token, we'd replace the current grammar one with sampler state. This of course would require changing a lot more stuff, including all the examples that use llama_sample_token (I don't think it would be too bad though).

Thoughts?

Closes #3537

ggerganov

Yes, dedicated sampling state with grammar and mirostat would be better. Maybe implemented in common/sampling.h/.cpp. It should probably inherit all sampling-related parameters gpt_params, such as temperature, top_p, top_k, etc, so that llama_sample_token accepts struct llama_sampling_state instead of struct gpt_params.

For now we can have this workaround

KerfuffleV2 · 2023-10-08T10:22:08Z

@ggerganov

For now we can have this workaround

Do you actually prefer doing it this way for now?

I don't mind changing this to do it the other way I suggested as long as you agree that approach is okay.

Maybe implemented in common/sampling.h/.cpp.

This actually raises another question that I'm actually dealing with in my seqrep sampler. Right now it's really awkward to have multiple source files in common. This is how I dealt with it:

COMMON_DEPS = common/common.cpp common/common.h build-info.h common/log.h
COMMON_OBJS = common.o
ifndef LLAMA_DISABLE_SEQREP_SAMPLER
COMMON_DEPS += common/seqrep-sampler.cpp common/seqrep-sampler.h
COMMON_OBJS += seqrep-sampler.o
endif
common.o: $(COMMON_DEPS)
	$(CXX) $(CXXFLAGS) -c $< -o $@

simple: examples/simple/simple.cpp                            build-info.h ggml.o llama.o $(COMMON_OBJS) $(OBJS)
	$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

and so on. It's probably less of a pain in cmake.

ggerganov · 2023-10-08T10:30:24Z

We should try separating the sampling state - would be better than the current fix, so let's give it a try if you are up to.

The proposed Makefile looks OK to me.

FSSRepo · 2023-10-08T14:20:09Z

When this commit is merged into the master branch, we need to add the parameter slot.id in llama_sample_token(ctx, NULL, NULL, params, slot.tokens_prev, candidates, slot.i_batch, slot.id) in the llama_sample_token call of the server-parallel in #3490 and this params.sampler_state.erase(slot.id); too.

KerfuffleV2 · 2023-10-08T15:32:05Z

Well, this went from small and self-contained to huge and complicated. I sure hope I'm on the right track after all this.

Pretty much all the sampling stuff got moved into common/sampling.{c,h}. Creating sampling state takes gpt_params and llama_grammar * (can be NULL). The sampling-related params in gpt_params are also in a llama_sampling_params struct now.

The sampling state (llama_sampling_state) holds a copy of the params that were created at init time. It also holds per-sequence state for samplers that use it (Mirostat 1/2, and grammar).

If the per-sequence state doesn't exist when llama_sample_token is called, it will be created with default values. In the case of grammar, this does llama_grammar_copy on the llama_grammar * that was supplied at init time.

If you want to use separate grammar, or separate grammar states per sequence then you'd probably have to manually manage the grammar part yourself. (Not sure this is so easy right now, so I might need to add an interface.)

In the case of parallel generation, when a sequence ends and you will want to reuse it (or just free up memory) you need to call llama_sampling_state_reset with the sequence id. This will reset stuff like Mirostat mu to the initial value so it won't mess with future generation with that sequence id.

I also randomly threw in support for LLAMA_SANITIZE_{THREAD,ADDRESS,UNDEFINED} in the Makefile.

I think this currently doesn't break stuff, but there were some tricky parts like server and speculative.

KerfuffleV2 · 2023-10-08T15:52:18Z

@FSSRepo You're going to hate me when you see the next step in this pull.

Calling sampling is going to look like:

llama_sample_token(ctx, NULL, sampling_state, slot.tokens_prev, candidates, slot.i_batch, slot.id);

It looks like #3490 doesn't support grammar currently? So that's going to make your life easier. Pretty much the only other thing to worry about is calling llama_sampling_state_reset when you're done with a sequence id (but may want to reuse it for a different generation). Pretty much when you hit the EOS token or reach the quota of tokens to generate, you can just reset that sequence id.

Code formatting cleanups and add some comments Silence a warning about id not being used when logging is disabled

KerfuffleV2 · 2023-10-08T18:03:58Z

I exported the function to fetch/create default instances of sampler state. This should fix the problem I mentioned earlier about how it would be hard to do something like parallel generation where each sequence used its own grammar.

By the way, since the ggml-alloc stuff:

ggml-alloc.c:212:32: runtime error: pointer index expression with base 0x00000100a020 overflowed to 0xffffffffffffffff
llama_new_context_with_model: compute buffer total size = 552.88 MB
llama_new_context_with_model: VRAM scratch buffer: 546.75 MB
llama_new_context_with_model: total VRAM used: 6515.44 MB (model: 3368.69 MB, context: 3146.75 MB)
ggml-cuda.cu:6787:51: runtime error: applying non-zero offset 1152 to null pointer

Not sure if that's anything to worry about. The ggml-alloc one is new, the other one isn't. (I was assuming it was just because GCC's address sanitizing stuff doesn't know about CUDA/ROCM.) edit: Might not be a problem since it seems like it's triggering just on doing pointer math that results in something out of bounds rather than actually accessing it.

KerfuffleV2 · 2023-10-11T06:35:38Z

This isn't really approved, right? Even without a full review, any changes I can/should start working on?

ggerganov · 2023-10-11T06:57:56Z

common/sampling.cpp

+        const std::vector<llama_token> & last_tokens,
+         std::vector<llama_token_data> & candidates,


Should we absorb last_tokens and candidates into llama_sampling_state?

Last tokens can be specific to the sequence, right? So this would kind of mean stuff using last tokens would have to be aware of sequences. I kind of feel like this also might limit how people can manipulate last tokens and if there are currently examples that do that kind of thing it might be difficult to adapt them (for me anyway, since I'm not really deeply familiar with most of them).

candidates I'm less sure about, it's basically just a scratch area for the logits in a form samplers can work with (right?) so I think moving it in there is less of a big deal. It's a pretty large structure though, I don't know if that's a consideration. Right now the current stuff in those structs is pretty lightweight.

I don't have very strong feelings about this. I'd like to say "These changes are already complicated enough, let's come back to that" but... I probably never would. :)

ggerganov · 2023-10-11T07:00:14Z

common/sampling.cpp

+    if (seq_state.grammar != NULL) {
+        llama_grammar_accept_token(ctx, seq_state.grammar, id);
+    }


Probably we should add llama_sampling_accept_token() and move this call in there together with update of last_tokens member (if we decide it should become part of llama_sampling_state)

This might be a bit of a pain since it depends on a number of other static functions like decode_utf8, llama_grammar_accept, llama_token_to_str. It looks like the grammar stuff is the only thing that uses them, so maybe they could be moved too. I'm not sure what other parts of the grammar code depends on them though, so it might not be that simple.

ggerganov

This is a great change. I actually think that we should merge llama_sampling directly into llama.cpp. But let's do this after this PR is merged and tested for some time

Not sure if that's anything to worry about.

These errors look benign, but we will look in ways to fix them anyway

common/sampling.h

Fix comments that were out of sync with the pull.

KerfuffleV2 · 2023-10-11T10:18:45Z

Current status: I took the suggestions to rename the functions/types but I didn't do stuff like moving last_tokens into the sampling context (yet). edit: Just to be clear, the "(yet)" doesn't mean I'm actually planning to unless someone insists on it.

ggerganov · 2023-10-11T17:27:08Z

common/sampling.cpp

+llama_token llama_sampling_sample(
+                  struct llama_context * ctx,
+                  struct llama_context * ctx_guidance,
+                  struct llama_sampling_context & sampling_ctx,


Suggested change

struct llama_sampling_context & sampling_ctx,

struct llama_sampling_context & ctx_sampling,

There are a few other places that need similar change for consistency sake

By "a few" you mean 20 or so? :) Hopefully I caught them all. Everything seems to compile/work still.

ggerganov · 2023-10-11T19:36:52Z

Thanks for this - I accidentally merged this too quickly with the old title. Should have updated to the more relevant change of introducing the llama_sampling_context.

ggerganov · 2023-10-12T17:21:19Z

common/sampling.h

+
+    // map of sequence ids to sampler contexts
+    std::unordered_map<llama_seq_id, llama_sampler_sequence_context> sequence_contexts;
+


@KerfuffleV2

Any reason not to have single sequence data in llama_sampling_context?
When we want to sample multiple sequences, we will create on llama_sampling_context for each.

This way, each sequence can also have a separate llama_grammar instance which seems to make sense.

I.e. have it like this:

// general sampler context typedef struct llama_sampling_context { ~llama_sampling_context(); // parameters that will be used for sampling llama_sampling_params params; float mirostat_mu; // mirostat sampler state llama_grammar * grammar; } llama_sampling_context;

…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...

KerfuffleV2 added the bug Something isn't working label Oct 8, 2023

KerfuffleV2 mentioned this pull request Oct 8, 2023

[Bug] Mirostat samplers don't work properly with parallel generation #3537

Closed

ggerganov approved these changes Oct 8, 2023

View reviewed changes

KerfuffleV2 force-pushed the fix-parseq-mirostat branch from b598b76 to e46edaf Compare October 8, 2023 15:43

ggerganov self-requested a review October 8, 2023 17:29

ggerganov added the need feedback Testing and feedback with results are needed label Oct 8, 2023

KerfuffleV2 added 3 commits October 8, 2023 11:31

Fix mirostat state when using multiple sequences

0e6db6f

Fix mirostat by completely refactoring sampling!

fad923a

Try to fix zig build.

52def09

KerfuffleV2 force-pushed the fix-parseq-mirostat branch from e46edaf to 52def09 Compare October 8, 2023 17:35

Export function to fetch/create default sampler states

01bef02

Code formatting cleanups and add some comments Silence a warning about id not being used when logging is disabled

KerfuffleV2 mentioned this pull request Oct 10, 2023

Context sensitive help #3556

Closed

ggerganov reviewed Oct 11, 2023

View reviewed changes

ggerganov approved these changes Oct 11, 2023

View reviewed changes

ggerganov reviewed Oct 11, 2023

View reviewed changes

common/sampling.h Outdated Show resolved Hide resolved

ggerganov reviewed Oct 11, 2023

View reviewed changes

common/sampling.h Outdated Show resolved Hide resolved

Apply some renaming suggestions.

4a34e63

Fix comments that were out of sync with the pull.

KerfuffleV2 requested a review from ggerganov October 11, 2023 10:18

ggerganov reviewed Oct 11, 2023

View reviewed changes

Use more consistant naming convention for sampling contexts

fffa4c0

ggerganov approved these changes Oct 11, 2023

View reviewed changes

ggerganov merged commit 70c29da into ggerganov:master Oct 11, 2023

ggerganov mentioned this pull request Oct 12, 2023

server : parallel decoding and multimodal #3589

Closed

9 tasks

ggerganov reviewed Oct 12, 2023

View reviewed changes

ggerganov mentioned this pull request Oct 12, 2023

sampling : one sequence per sampling context #3601

Closed

KerfuffleV2 deleted the fix-parseq-mirostat branch November 17, 2023 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mirostat state when using multiple sequences #3543

Fix mirostat state when using multiple sequences #3543

KerfuffleV2 commented Oct 8, 2023

ggerganov left a comment

KerfuffleV2 commented Oct 8, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023 •

edited

Loading

KerfuffleV2 commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023 •

edited

Loading

KerfuffleV2 commented Oct 11, 2023

ggerganov Oct 11, 2023

KerfuffleV2 Oct 11, 2023

ggerganov Oct 11, 2023

KerfuffleV2 Oct 11, 2023

ggerganov left a comment

KerfuffleV2 commented Oct 11, 2023 •

edited

Loading

ggerganov Oct 11, 2023

KerfuffleV2 Oct 11, 2023

ggerganov commented Oct 11, 2023

ggerganov Oct 12, 2023 •

edited

Loading

		const std::vector<llama_token> & last_tokens,
		std::vector<llama_token_data> & candidates,

	struct llama_sampling_context & sampling_ctx,
	struct llama_sampling_context & ctx_sampling,


		// map of sequence ids to sampler contexts
		std::unordered_map<llama_seq_id, llama_sampler_sequence_context> sequence_contexts;

Fix mirostat state when using multiple sequences #3543

Fix mirostat state when using multiple sequences #3543

Conversation

KerfuffleV2 commented Oct 8, 2023

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 8, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023 • edited Loading

KerfuffleV2 commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023

KerfuffleV2 commented Oct 8, 2023 • edited Loading

KerfuffleV2 commented Oct 11, 2023

ggerganov Oct 11, 2023

Choose a reason for hiding this comment

KerfuffleV2 Oct 11, 2023

Choose a reason for hiding this comment

ggerganov Oct 11, 2023

Choose a reason for hiding this comment

KerfuffleV2 Oct 11, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 11, 2023 • edited Loading

ggerganov Oct 11, 2023

Choose a reason for hiding this comment

KerfuffleV2 Oct 11, 2023

Choose a reason for hiding this comment

ggerganov commented Oct 11, 2023

ggerganov Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

FSSRepo commented Oct 8, 2023 •

edited

Loading

KerfuffleV2 commented Oct 8, 2023 •

edited

Loading

KerfuffleV2 commented Oct 11, 2023 •

edited

Loading

ggerganov Oct 12, 2023 •

edited

Loading