Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: GBNF repetition rewrite results in unsupported left recursion #7572

Closed
hoehrmann opened this issue May 27, 2024 · 8 comments
Closed

Bug: GBNF repetition rewrite results in unsupported left recursion #7572

hoehrmann opened this issue May 27, 2024 · 8 comments
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale

Comments

@hoehrmann
Copy link

What happened?

Replacing the "FIXME" strings in https://gist.github.com/hoehrmann/f234c1156ee5ef7b24cb589c14aaefda?permalink_comment_id=5070379#gistcomment-5070379 by the empty string makes llama.cpp see left recursion that is not in the original grammar, presumably due to how it rewrites repetition into recursion. The basic construct is

a ::= "a" ( "" | ... )*

In case of the grammar above, the empty alternative does not add anything to the grammar since it is under a zero-or-more repetition anyway. It would be nice if the grammar rewriter would lift the empty string out of the alternatives and adjust the quantifier if necessary before rewriting this into recursion.

Name and Version

version: 2891 (9a17ab9)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

...
terminate called after throwing an instance of 'std::runtime_error'
  what():  unsupported grammar, left recursion detected for nonterminal at index 30
Aborted (core dumped)
@hoehrmann hoehrmann added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels May 27, 2024
@HanClinto
Copy link
Collaborator

Replacing the "FIXME" strings in https://gist.github.com/hoehrmann/f234c1156ee5ef7b24cb589c14aaefda?permalink_comment_id=5070379#gistcomment-5070379 by the empty string makes llama.cpp see left recursion that is not in the original grammar, presumably due to how it rewrites repetition into recursion.

That is a deliciously complex grammar in your example, btw. I love this as a stress-test, thank you!

The basic construct is

a ::= "a" ( "" | ... )*

In case of the grammar above, the empty alternative does not add anything to the grammar since it is under a zero-or-more repetition anyway.

I'm left wondering what in the world might be meant by ("")* as an option. How long is an infinite number of nothings? It feels like a divide-by-zero or other undefined behavior.

There are many ways to think about the dangers / problems of left-recursion, but one way I like to think about it is that left-recursion occurs when the grammar engine "ticks" forward without making any real progress. So long as there is a path that exists for this, then the grammar engine can process infinitely without making any headway, and thus the engine "hangs".

Before #7083, that's exactly what would happen -- the grammar engine would simply hang until it segfaulted from a lack of memory. So the fact that it alerts you to the problem of left-recursion is a nice improvement from where we were a few weeks ago -- that's encouraging! :)

It would be nice if the grammar rewriter would lift the empty string out of the alternatives and adjust the quantifier if necessary before rewriting this into recursion.

Perhaps, yes. It sounds like what you're asking for is that if there is a grammar segment that is zero-length, then it be auto-removed from the tree. I need to think through the implications of this, and see if there are other ways that this can get us into trouble. Are there ways that this can come back to bite us? I need to think through this...

In the meantime, instead of replacing ("FIXME" | ...) with ("" | ...), what about replacing it with (...)? Would that satisfy your needs, @hoehrmann ?

@hoehrmann
Copy link
Author

@HanClinto Further down in the gist https://gist.github.com/hoehrmann/f234c1156ee5ef7b24cb589c14aaefda?permalink_comment_id=5070397#gistcomment-5070397 is a variant where I removed the redundant empty string alternative. llamap.cpp then goes into SIGSEGV after a couple of lines (or pretty much immediately if you change the root to the <middle> section of the grammar) asking a model to write an Internet-Draft in xml2rfc format.

@HanClinto
Copy link
Collaborator

@HanClinto Further down in the gist https://gist.github.com/hoehrmann/f234c1156ee5ef7b24cb589c14aaefda?permalink_comment_id=5070397#gistcomment-5070397 is a variant where I removed the redundant empty string alternative. llamap.cpp then goes into SIGSEGV after a couple of lines (or pretty much immediately if you change the root to the <middle> section of the grammar) asking a model to write an Internet-Draft in xml2rfc format.

Confirmed, thank you! I ran the following command:

./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf \
    --grammar-file ./grammars/issue7572.gbnf \
    -p "Please generate an Internet-Draft in xml2rfc format." \
    --seed 12345

And it crashed after:

 Please generate an Internet-Draft in xml2rfc format.<rfc>
<front><title>Test</title><author><organization>IETF</organization><address><email>jdoe@example.com</email></address></author><date>
</date>
</front><middle>
<section>
<./test_7572.sh: line 4: 72743 Segmentation fault: 11  ./main -mu https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/resolve/main/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf --grammar-file ./grammars/issue7572.gbnf -p "Please generate an Internet-Draft in xml2rfc format." --seed 12345

I will look into that issue -- not sure if we want to open that as a new issue or not.

In the meantime, do you have any other thoughts about grammars with ("")* in them triggering the left-recursion safety? I'm still not sure of the best course of action forward on that one.

@hoehrmann
Copy link
Author

(This issue is about turning repetition into left recursion and then complaining that the generated left recursion is not supported. That llama.cpp runs out of stack space for the other example is a separate issue, I think.)

(This is probably not what you are looking for, but since you asked:)

Well, I can add this from a document I am working on. This turns a CFG into a graph with nodes that are labeled with one of empty string, a terminal (or a character class), or a stack label (acting as a stack symbol, moving over final A is only allowed when start A is on the top of a stack), and moving over start A means putting that on the stack, and moving over a grammar graph node labeled with a terminal requires that terminal to be present in the input. You can then traverse a string alongside traversing the grammar graph (where you split the current parsing state whenever there are multiple outgoing edges). That probably amounts to a rewrite of the grammar feature, but would make ("")* not a special case, would not suffer from left recursion limitations (if you model the stack as a graph, a left recursive start A would be a loop on the stack graph) and would not suffer from stack space limits (no recursion). If I recall correctly, https://github.com/hoehrmann/demo-parselov?tab=readme-ov-file#higher-level-parser has some code that uses a graph for the stack in this sense.

Grammar graph construction

Rule

A = B ; named rule A is defined as B
graph LR
  start --> startRule;
  startRule["`start id(A)`"] --> B;
  B --> finalRule["`final id(A)`"];
  finalRule --> final;
Loading

Empty string

"" ; zero terminals
graph LR
  start --> e["` `"];
  e --> final;
Loading

Terminal

"a" ; single character "a" (or "A")
graph LR
  start --> term;
  term["`terminal a`"];
  term --> final;
Loading

Non-terminal reference

A ; reference to named rule A
graph TB
  start --> startRef;
  startRef["`start ref id(A)`"] --> startA["`start A`"];
  startA --> elision[ ... ];
  elision --> finalA["`final A`"];
  finalA --> finalRef["`final ref id(A)`"];
  finalRef --> final;
Loading

References to a non-terminal named rule connect to the start and final nodes for the corresponding rule definition. This adds edges to the nodes of the rule definition.

The start ref and final ref nodes store an identifier for the named rule A that is unique to the particular instance of the non-terminal.

Strings

"example" ; e followed by x followed by ...

Strings are sequences of terminals.

Sequence

A B ; A followed by B
graph LR
  start --> A;
  A --> B;
  B --> final;
Loading

One-or-more

1*<A> ; one or more A
graph LR
  start --> enter[ ];
  enter --> A;
  A --> exit[ ];
  exit --> enter;
  exit --> final;
Loading

Zero-or-more is a choice between the empty string and one or more of a pattern.

Other repetition patterns can be spelled out explicitly. For instance, optional patterns are a choice between the pattern and the empty string, and if a pattern needs to appear four times, the pattern can be repeated four times in the grammar.

Unordered Choice

A / B ; A or B or both
graph TD
  start --> enter[ ];
  enter --> A[A];
  enter --> B[B];
  A --> exit[ ];
  B --> exit[ ];
  exit --> final;
Loading

@hoehrmann
Copy link
Author

(The grammar graph does not use recursion to represent infinite repetition; for actual recursion, a pragmatic approach would be to let users specify a depth limit for the grammar. Replacing a recursive rule p by p10 ::= p9, p9 ::= p8, ... and replacing all non-recursive references by copies of their respective right hand sides, the grammar would become a regular expression and can be turned into a DFA which can trivially be used as a logit processor.)

@HanClinto
Copy link
Collaborator

Thank you for the writeup and response!

(This issue is about turning repetition into left recursion and then complaining that the generated left recursion is not supported. That llama.cpp runs out of stack space for the other example is a separate issue, I think.)

I think you're right -- this particular issue is about the way that we complain about not supporting left-recursion rather than handling it more gracefully. My vote would be that you open the running-out-of-stack-space issue as a separate one, and repost your grammar / reproduction steps in there. I suspect it may eventually get merged in with one of the other grammar improvement issues that's already ongoing, but it would be nice to have it broken out regardless.

(This is probably not what you are looking for, but since you asked:)

It's a good read, thank you!

Well, I can add this from a document I am working on. This turns a CFG into a graph with nodes that are labeled with one of empty string, a terminal (or a character class), or a stack label (acting as a stack symbol, moving over final A is only allowed when start A is on the top of a stack), and moving over start A means putting that on the stack, and moving over a grammar graph node labeled with a terminal requires that terminal to be present in the input.

This is really cool -- nice job on all of the Mermaid graphs! For a while I was starting down the road of trying to add Mermaid graph generation to the gbnf-validator program (in order to better help visualize / debug grammars), and you have inspired me to take another swing at that. :)

That probably amounts to a rewrite of the grammar feature, but would make ("")* not a special case, would not suffer from left recursion limitations (if you model the stack as a graph, a left recursive start A would be a loop on the stack graph) and would not suffer from stack space limits (no recursion).

Yes, I suspect that you're correct. I've looked at several alternative methods for handling left-recursion, but they often require such a fundamental shift in the way that one "hangs" the problem and approach to grammar. There are a number of particulars related to such piecemeal token generation that makes it very expensive to scan forward-and-backward in the text stream. The way that we recursively build our grammar trees and tick them all forward one small set of characters at a time works so nicely for the way that LLMs generate text. Because our grammar engine maintains separate trees for every possible pathway simultaneously, we don't require much lookahead (at most, a token's worth of characters into the future), and we never have to resample because we went down a bad pathway (the speculative example application notwithstanding). Overall, the particular needs of grammar as applied to an LLM is a special case that invalidate so many other advanced approaches.

@HanClinto
Copy link
Collaborator

HanClinto commented Jun 6, 2024

If I recall correctly, https://github.com/hoehrmann/demo-parselov?tab=readme-ov-file#higher-level-parser has some code that uses a graph for the stack in this sense.

BTW, I really love the grammar tree application visualization that's in this documentation. Thank you for linking it!

Using the RFC 3986 data file and the string example://0.0.0.0:23#x gives:

["URI", [
  ["scheme", [
    ["ALPHA", [], 0, 1],
    ["ALPHA", [], 1, 2],
    ["ALPHA", [], 2, 3],
    ["ALPHA", [], 3, 4],
    ["ALPHA", [], 4, 5],
    ["ALPHA", [], 5, 6],
    ["ALPHA", [], 6, 7]], 0, 7],
  ["hier-part", [
    ["authority", [
      ["host", [
        ["IPv4address", [
          ["dec-octet", [
            ["DIGIT", [], 10, 11]], 10, 11],
          ["dec-octet", [
            ["DIGIT", [], 12, 13]], 12, 13],
          ["dec-octet", [
            ["DIGIT", [], 14, 15]], 14, 15],
          ["dec-octet", [
            ["DIGIT", [], 16, 17]], 16, 17]], 10, 17]], 10, 17],
      ["port", [
        ["DIGIT", [], 18, 19],
        ["DIGIT", [], 19, 20]], 18, 20]], 10, 20],
    ["path-abempty", [], 20, 20]], 8, 20],
  ["fragment", [
    ["pchar", [
      ["unreserved", [
        ["ALPHA", [], 21, 22]], 21, 22]], 21, 22]], 21, 22]], 0, 22]

Yet another thing that I would love to emulate in our gbnf-validator program.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale
Projects
None yet
Development

No branches or pull requests

2 participants