Ported back new grammar changes from C++ to Python implementation #1637

ExtReMLapin · 2024-07-29T14:01:42Z

ggml-org/llama.cpp#6640
ggml-org/llama.cpp#6467
ggml-org/llama.cpp#7194

ExtReMLapin · 2024-07-29T14:38:02Z

Not working yet, for example :

Ok : root ::= ("EYEYAHA"){5}

Not working : root ::= ("EYEYAHA"){1,5}

from_string grammar:
root ::= root_1 root_5
root_1 ::= [E] [Y] [E] [Y] [A] [H] [A]
root_2 ::= root_1 | print_grammar: error printing grammar: unexpected end of rule: 2,2

ExtReMLapin · 2024-07-29T15:20:09Z

I've been looking again and again and I don't see what I missed from the pr diff 😕

ExtReMLapin · 2024-07-30T07:22:11Z

Help is welcome if you can help @abetlen

Right now, root ::= "A"{1,6} generates :

root ::= [A] root_5
root_1 ::= [A] root_4 |
root_2 ::= [A] root_4 |
root_3 ::= [A] root_4 |
root_4 ::= [A] root_4 |
root_5 ::= [A] root_4 |

abetlen · 2024-08-01T20:55:03Z

Hey @ExtReMLapin thanks for starting on this fix, just getting back to everything now after vacation. I'll take a stab at this over the next couple days as well.

abetlen · 2024-08-04T21:25:23Z

@ExtReMLapin got the new grammar features back-ported and ended up rewriting most of llama_grammar.py. With #1649 this should bring the grammar implementation in-line with llama.cpp.

ExtReMLapin · 2024-08-04T21:43:08Z

Thank you abetlen.
While checking the code I was a little surprised by the multiple else-if and the list hardcoded inside the function instead of being outside (so regenerated on each function call) (ex : decode_utf8)

As a proud lazy man, I asked GPT4 to

Try to write an optimized version of parse_hex decode_utf8 parse_char (I actually expected it to make a jmp table for it)
Write a benchmark and tests for it

Issue is that is seems that not all UTF8 characters are supported
(see test code bellow), Is it really an issue ?

import timeit
import typing

# Original Functions
def original_parse_hex(src: str, size: int) -> typing.Tuple[int, str]:
    pos = 0
    value = 0
    for _ in range(size):
        value <<= 4
        c = src[pos]
        if "a" <= c <= "f":
            value += ord(c) - ord("a") + 10
        elif "A" <= c <= "F":
            value += ord(c) - ord("A") + 10
        elif "0" <= c <= "9":
            value += ord(c) - ord("0")
        else:
            break
        pos += 1
    if pos != size:
        raise ValueError(f"expecting {size} hex chars at {src}")
    return value, src[pos:]


def original_decode_utf8(src: str) -> typing.Tuple[int, str]:
    lookup: list[int] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4]
    first_byte: int = ord(src[0])
    highbits: int = first_byte >> 4
    #first_byte to hex 

    print(highbits)
    length: int = lookup[highbits]
    mask: int = (1 << (8 - length)) - 1
    value: int = first_byte & mask
    end: int = min(len(src), length)

    pos: int = 1
    for pos in range(1, end):
        if not src[pos]:
            break
        value = (value << 6) + (ord(src[pos]) & 0x3F)

    return value, src[pos:] if pos < len(src) else ""


def original_parse_char(src: str) -> typing.Tuple[int, str]:
    if src[0] == "\\":
        if src[1] == "x":
            return original_parse_hex(src[2:], 2)
        elif src[1] == "u":
            return original_parse_hex(src[2:], 4)
        elif src[1] == "U":
            return original_parse_hex(src[2:], 8)
        elif src[1] == "t":
            return ord("\t"), src[2:]
        elif src[1] == "r":
            return ord("\r"), src[2:]
        elif src[1] == "n":
            return ord("\n"), src[2:]
        elif src[1] in ('\\', '"', '[', ']'):
            return ord(src[1]), src[2:]
        else:
            raise ValueError(f"unknown escape at {src}")
    elif src:
        return original_decode_utf8(src)
    raise ValueError("unexpected end of input")


hex_map = {**{f"{x}": x for x in range(10)}, **{chr(x): x - ord('a') + 10 for x in range(ord('a'), ord('f') + 1)}, **{chr(x): x - ord('A') + 10 for x in range(ord('A'), ord('F') + 1)}}
# Optimized Functions
def optimized_parse_hex(src: str, size: int) -> typing.Tuple[int, str]:
    
    value = 0
    for i in range(size):
        c = src[i]
        if c in hex_map:
            value = (value << 4) + hex_map[c]
        else:
            raise ValueError(f"expecting {size} hex chars at {src}")
    return value, src[size:]


prealloc = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4]

def optimized_decode_utf8(src: str) -> typing.Tuple[int, str]:
    first_byte = ord(src[0])
    highbits = first_byte >> 4
    length = prealloc[highbits]
    value = first_byte & ((1 << (8 - length)) - 1)

    for i in range(1, length):
        value = (value << 6) + (ord(src[i]) & 0x3F)
    
    return value, src[length:]


escape_sequences = {
    "x": 2, "u": 4, "U": 8,
    "t": ord("\t"), "r": ord("\r"), "n": ord("\n"),
    "\\": ord("\\"), '"': ord('"'), '[': ord('['), ']': ord(']')
}

def optimized_parse_char(src: str) -> typing.Tuple[int, str]:
    if src[0] == "\\":

        esc = src[1]
        if esc in escape_sequences:
            if esc in 'xuU':
                return optimized_parse_hex(src[2:], escape_sequences[esc])
            return escape_sequences[esc], src[2:]
        raise ValueError(f"unknown escape at {src}")
    elif src:
        return optimized_decode_utf8(src)
    raise ValueError("unexpected end of input")

import random 
def generate_utf8_string(length: int) -> str:
    utf8_chars = [
        chr(random.randint(0x20, 0x7E)),    # ASCII characters
        chr(random.randint(0x80, 0x07FF)),  # Extended Latin and similar
        chr(random.randint(0x0800, 0xFFFF)),  # Multilingual Plane
        chr(random.randint(0x10000, 0x10FFFF)) # Supplementary Planes (Emoji, etc.)
    ]
    return ''.join(random.choice(utf8_chars) for _ in range(length))

def benchmark():
    # Generate a random UTF-8 string of 500 characters
    test_string = generate_utf8_string(500)
    print('Random string : ', test_string)
    
    # Ensure both functions return the same result
    original_result = original_parse_char(test_string)
    optimized_result = optimized_parse_char(test_string)
    
    assert original_result == optimized_result, "The results of original and optimized functions do not match!"
    
    original_time = timeit.timeit(lambda: original_parse_char(test_string), number=100000)
    optimized_time = timeit.timeit(lambda: optimized_parse_char(test_string), number=100000)
    
    print(f"Original parse_char time: {original_time:.6f} seconds")
    print(f"Optimized parse_char time: {optimized_time:.6f} seconds")

if __name__ == "__main__":
    benchmark()

It can easily be fixed by adding a auto cap to 4 is the len > len(byte len array)

Benchmark results anyway

Original parse_char time: 0.075589 seconds
Optimized parse_char time: 0.051987 seconds

ExtReMLapin · 2024-08-05T06:32:47Z

Alright, I gave a try at the office, rule parsing is broken, and few functions are missing (ex from_file).

Test code :

from llama_cpp import LlamaGrammar, Llama

gbnf_str = r"""# This is the same as json.gbnf but we restrict whitespaces at the end of the root array
# Useful for generating JSON arrays

root   ::= arr
value  ::= object | array | string | number | ("true" | "false" | "null") ws

arr  ::=
  "[\n" ws (
            value
    (",\n" ws value)*
  )? "]"

object ::=
  "{" ws (
            string ":" ws value
    ("," ws string ":" ws value)*
  )? "}" ws

array  ::=
  "[" ws (
            value
    ("," ws value)*
  )? "]" ws

string ::=
  "\"" (
    [^"\\\x7F\x00-\x1F] |
    "\\" (["\\bfnrt] | "u" [0-9a-fA-F]) # escapes
  )* "\"" ws

number ::= ("-"? ([0-9] | [1-9] [0-9])) ("." [0-9]+)? ([eE] [-+]? [1-9] [0-9])? ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= | " " | "\n" [ \t]
"""
gguf = "/opt/IdExtend/models/llm/mistral-7b-instruct-v0.2.Q5_K_M.gguf"



grammar = LlamaGrammar.from_string(gbnf_str, verbose=False)
model = Llama(gguf, n_ctx=8192, n_gpu_layers=-1, tensor_split=[1,0,0], verbose=False)


stream = model.create_completion("In a json format give me a list of known stars :", grammar=grammar, stream=True, max_tokens=1024)
for output in stream:
    print(output['choices'][0]['text'], end="")

abetlen · 2024-08-07T00:17:54Z

@ExtReMLapin just fixed the last bug, was re-assigning out_elements by mistake inside of parse_sequence.

Do you mind opening another PR for those changes? For now I just wanted to keep the implementation as close to the c++ as possible but obviously there's room to optimize (may be better to do some other kind of caching here though).

ExtReMLapin · 2024-08-07T03:45:50Z

Thanks for the fix, will do !

ExtReMLapin added 3 commits July 29, 2024 14:35

Backported . (any chat) from llama.cpp

45d2252

unfinished {count,optionalmax)

90c2bc4

implemented slice function in std:vector

5c050e8

ExtReMLapin mentioned this pull request Jul 29, 2024

Grammars bracket repetition symbol not working #1547

Closed

fixed mistake done while reading

4c74a82

ported ggml-org/llama.cpp#7194

1fd8840

multiple fixes, var copy

81cf909

This was referenced Aug 1, 2024

The latest version kills python kernel with LlamaGrammar #1623

Closed

segmentation fault 0.2.84 when using function calling #1636

Closed

Merge branch 'main' into patch-1

9fb809f

ExtReMLapin mentioned this pull request Aug 1, 2024

grammars: x{min,max} repetition operator ggml-org/llama.cpp#6640

Merged

5 tasks

anakin87 mentioned this pull request Aug 1, 2024

Llama.cpp tests failing deepset-ai/haystack-core-integrations#938

Closed

abetlen added 2 commits August 4, 2024 17:08

Merge branch 'main' into patch-1

71eef77

Rewrite LlamaGrammar internals in python style

6d53877

abetlen marked this pull request as ready for review August 4, 2024 21:16

axel7083 mentioned this pull request Aug 6, 2024

Add a recipe that works with function calling models containers/ai-lab-recipes#562

Closed

bugfix

7308d53

abetlen merged commit dff186c into abetlen:main Aug 7, 2024
13 checks passed

ExtReMLapin deleted the patch-1 branch August 7, 2024 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ported back new grammar changes from C++ to Python implementation #1637

Ported back new grammar changes from C++ to Python implementation #1637

Uh oh!

ExtReMLapin commented Jul 29, 2024 •

edited

Loading

Uh oh!

ExtReMLapin commented Jul 29, 2024

Uh oh!

ExtReMLapin commented Jul 29, 2024

Uh oh!

ExtReMLapin commented Jul 30, 2024 •

edited

Loading

Uh oh!

abetlen commented Aug 1, 2024

Uh oh!

abetlen commented Aug 4, 2024

Uh oh!

ExtReMLapin commented Aug 4, 2024 •

edited

Loading

Uh oh!

ExtReMLapin commented Aug 5, 2024 •

edited

Loading

Uh oh!

abetlen commented Aug 7, 2024 •

edited

Loading

Uh oh!

Uh oh!

ExtReMLapin commented Aug 7, 2024

Uh oh!

Uh oh!

Ported back new grammar changes from C++ to Python implementation #1637

Ported back new grammar changes from C++ to Python implementation #1637

Uh oh!

Conversation

ExtReMLapin commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Jul 29, 2024

Uh oh!

ExtReMLapin commented Jul 29, 2024

Uh oh!

ExtReMLapin commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Aug 1, 2024

Uh oh!

abetlen commented Aug 4, 2024

Uh oh!

ExtReMLapin commented Aug 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ExtReMLapin commented Aug 7, 2024

Uh oh!

Uh oh!

ExtReMLapin commented Jul 29, 2024 •

edited

Loading

ExtReMLapin commented Jul 30, 2024 •

edited

Loading

ExtReMLapin commented Aug 4, 2024 •

edited

Loading

ExtReMLapin commented Aug 5, 2024 •

edited

Loading

abetlen commented Aug 7, 2024 •

edited

Loading