Remove broken final state loop #874

br3no · 2024-05-07T15:54:28Z

Fixes #856

The code this PR removes introduces an artificial and erroneous loop transition in every final state that is always traversed, regardless of the generation.

The comment doesn't make sense in my opinion, as the if above just handles exactly this case.

Removing this piece of code fixes the bug that surfaced in the upgrade of outlines in the vLLM integration.

br3no · 2024-05-07T16:04:39Z

A related matter: https://github.com/outlines-dev/outlines/blob/4f8433d8d6633b0780c3a6c27981f9adffbe49f5/outlines/generate/generator.py#L94

This code is fundamentally broken, in my opinion, because it always stops generation when a final state is reached, regardless of outgoing transitions it may have. Instead, the condition for stopping should be that a stop-token has been generated. Right?

rlouf · 2024-05-07T16:05:13Z

Is there a minimal reproducing example we could add as a test?

br3no · 2024-05-07T16:06:59Z

The issue does not show in the transformers integration because of the line I posted in the last comment. So the most minimal example at the moment would be the code provided in #856.

br3no · 2024-05-07T16:24:04Z

There are some test errors. I believe there is a condition still to be checked if the state does not exist in the transitions table. I’ll invest some time later today.

@rlouf I didn’t run the tests, because I didn’t know how.

br3no · 2024-05-07T17:41:27Z

I believe this should do it now.

br3no · 2024-05-08T07:35:09Z

Okay, obviously not.

I will invest some time into this today and hopefully come to a solution.

br3no · 2024-05-08T10:11:43Z

I've looked into the breaking tests:

____________________________ test_regex_final_state ____________________________

    def test_regex_final_state():
        """Make sure that the FSM stays in the final state as we keep generating"""
    
        class MockTokenizer:
            vocabulary = {"`": 101, ".": 102, "\n": 103, "eos": 104}
            special_tokens = {"eos"}
            eos_token_id = 104
    
            def convert_token_to_string(self, token):
                return token
    
        regex_str = r"`\n(\.\n)?`\n"
        tokenizer = MockTokenizer()
    
        with pytest.warns(UserWarning):
            fsm = RegexFSM(regex_str, tokenizer)
    
        state = fsm.next_state(state=4, token_id=103)
        assert state == 5
        assert fsm.is_final_state(state)
    
        state = fsm.next_state(state=5, token_id=103)
>       assert state == 5
E       assert -1 == 5

tests/fsm/test_fsm.py:85: AssertionError
____________________________ test_regex_final_state ____________________________

    def test_regex_final_state():
        """Make sure that the FSM stays in the final state as we keep generating"""
    
        class MockTokenizer:
            vocabulary = {"`": 101, ".": 102, "\n": 103, "eos": 104}
            special_tokens = {"eos"}
            eos_token_id = 104
    
            def convert_token_to_string(self, token):
                return token
    
        regex_str = r"`\n(\.\n)?`\n"
        tokenizer = MockTokenizer()
        fsm = RegexGuide(regex_str, tokenizer)
    
        state = fsm.get_next_state(state=4, token_id=103)
        assert state == 5
        assert fsm.is_final_state(state)
    
        state = fsm.get_next_state(state=5, token_id=103)
>       assert state == 5
E       assert -1 == 5

tests/fsm/test_guide.py:183: AssertionError

I believe these tests are testing an assumption that is fundamentally wrong. Final states can have outbound transitions, including into non-terminal states.
I have attached an svg file with a rendering of the state machine described by the state to token map for the following very simple regular expression:

"(12){1,3}"

I'm wondering if the right thing to do would be to remove these tests, or what we would want to test instead.

br3no · 2024-05-10T09:34:26Z

I have changed the tests to verify that if we are in a final state with no outbound transitions, a new generation will lead to us staying in a final state.

lapp0

Basically the same as #884

Should we leave test_fsm.py as is? Otherwise looks good.

br3no · 2024-05-10T11:39:28Z

Basically the same as #884

Should we leave test_fsm.py as is? Otherwise looks good.

Do you mean reverting the changes in test_fsm.py? If so, this will break the build. In essence both test_fsm.py and test_guide.py are testing the same thing, since they share the underlying implementation.

lapp0 · 2024-05-10T11:53:00Z

Thanks, looks good to me!

ekagra-ranjan · 2024-05-13T22:21:49Z

Damn! I was facing this issue on Fri and spent a couple of days to finally figure out the solution only to find that this PR existed :)

ekagra-ranjan · 2024-05-13T22:40:25Z

@br3no can you pls share how did you generate the FSM plot here? #874 (comment)

br3no · 2024-05-14T07:33:12Z

@ekagra-ranjan sure. I used graphviz for that. Here's an example:

import outlines
from transformers import AutoTokenizer
from graphviz import Digraph

def draw_state_machine(graph: dict, final_states: set, tokenizer):
    dot = Digraph()

    # Add nodes
    for state in graph:
        if state in final_states:
            dot.node(str(state), str(state), color='salmon', style='filled', fillcolor='salmon')
        else:
            dot.node(str(state), str(state), color='lightblue', style='filled', fillcolor='lightblue')

    # Prepare edge labels by aggregating transitions between the same nodes
    edge_labels = {}
    for state, transitions in graph.items():
        for transition, end_state in transitions.items():
            if end_state not in graph:
                # Add end states not in the state map
                dot.node(str(end_state), str(end_state), color='salmon', style='filled', fillcolor='salmon')
            label = tokenizer.decode(int(transition))
            edge_key = (str(state), str(end_state))
            if edge_key not in edge_labels:
                edge_labels[edge_key] = label
            else:
                # Append new label to existing label, separated by a comma
                edge_labels[edge_key] += ", " + label

    # Add edges with aggregated labels
    for (start_state, end_state), label in edge_labels.items():
        dot.edge(start_state, end_state, label=label)

    # Render and view the graph
    dot.render('state_machine', view=True, format='svg')

tokenizer_zephyr = AutoTokenizer.from_pretrained("stabilityai/stablelm-2-zephyr-1_6b")

regex = r"(12){1,2}"

model = outlines.models.transformers("stabilityai/stablelm-2-zephyr-1_6b")

generator_zephyr = outlines.generate.regex(
    model,
    regex,
)

draw_state_machine(generator_zephyr.fsm.states_to_token_maps, generator_zephyr.fsm.final_states, tokenizer_zephyr)

ekagra-ranjan · 2024-05-18T17:06:30Z

outlines/fsm/guide.py

@@ -193,12 +193,8 @@ def get_next_state(self, state: int, token_id: int) -> int:
        The new state of the guide.

        """
-        if token_id == self.eos_token_id:
+        if token_id == self.eos_token_id or state not in self.states_to_token_maps:


@br3no I was wondering if we really need the 2nd condition state not in self.states_to_token_maps ? The condition basically checks for states which do not have outgoing edges. But such states would be a part of final states in the FSM and this block of code adds EOS as an edge to such states which makes them have atleast one outgoing edges. Therefore, no states in FSM will be absent in the states_to_token_maps. Wdyt?

@ekagra-ranjan yes, we do need this second condition.

The block of code you linked to does not add an EOS outbound transition to these states. It only adds transitions to final states which are present in states_to_token_maps. But these states are not present there. states_to_token_subsets.get(state) will return None for these states.

I'm not really knowledgeable about the way Outlines (and interegular) build the state machines out of regexes. The matter of fact is that the states_to_token_maps does not contain all states that are reachable. I have noticed this while debugging the code for some example regexes.

This is not a problem in principle, as these states are considered to be final and states_to_token_subsets.get(state) is None is used all over the code to handle this special case (as in the block you linked to).

I actually believe this could be improved and Outlines would profit from removing this special case that needs to be thought of all over the place and could lead to bugs. But this is, as I said, not a problem in principle.

br3no mentioned this pull request May 10, 2024

Endless generation bug popped up during migration to Guide in vLLM integration #856

Closed

lapp0 mentioned this pull request May 10, 2024

Stop using is_final_state and final_states #885

Open

lapp0 reviewed May 10, 2024

View reviewed changes

br3no and others added 3 commits May 10, 2024 07:49

Remove broken final state loop

995058c

Covering the case when the state is missing in the map

8e04ec8

Fix tests

21247f8

brandonwillard force-pushed the endless_generation_vllm branch from 32cc20d to 21247f8 Compare May 10, 2024 12:49

brandonwillard added bug structured generation Linked to structured generation labels May 10, 2024

This was referenced May 10, 2024

[CI/Build] Unpin outlines vllm-project/vllm#4558

Closed

[Frontend][Core] Update Outlines Integration from FSM to Guide vllm-project/vllm#4109

Merged

rlouf merged commit 78852b0 into dottxt-ai:main May 11, 2024
5 checks passed

ekagra-ranjan reviewed May 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove broken final state loop #874

Remove broken final state loop #874

br3no commented May 7, 2024

br3no commented May 7, 2024

rlouf commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 8, 2024

br3no commented May 8, 2024

br3no commented May 10, 2024 •

edited

Loading

lapp0 left a comment

br3no commented May 10, 2024

lapp0 commented May 10, 2024

ekagra-ranjan commented May 13, 2024 •

edited

Loading

ekagra-ranjan commented May 13, 2024

br3no commented May 14, 2024

ekagra-ranjan May 18, 2024 •

edited

Loading

br3no May 18, 2024

Remove broken final state loop #874

Remove broken final state loop #874

Conversation

br3no commented May 7, 2024

br3no commented May 7, 2024

rlouf commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 7, 2024

br3no commented May 8, 2024

br3no commented May 8, 2024

br3no commented May 10, 2024 • edited Loading

lapp0 left a comment

Choose a reason for hiding this comment

br3no commented May 10, 2024

lapp0 commented May 10, 2024

ekagra-ranjan commented May 13, 2024 • edited Loading

ekagra-ranjan commented May 13, 2024

br3no commented May 14, 2024

ekagra-ranjan May 18, 2024 • edited Loading

Choose a reason for hiding this comment

br3no May 18, 2024

Choose a reason for hiding this comment

br3no commented May 10, 2024 •

edited

Loading

ekagra-ranjan commented May 13, 2024 •

edited

Loading

ekagra-ranjan May 18, 2024 •

edited

Loading