Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataError: string or blob too big #1180

Closed
fxwiegand opened this issue Sep 28, 2024 · 3 comments
Closed

DataError: string or blob too big #1180

fxwiegand opened this issue Sep 28, 2024 · 3 comments
Labels

Comments

@fxwiegand
Copy link

Describe the issue as clearly as possible:

Hello everyone!
We are trying to build a helper to generate some configs (with a fixed schema) for a tool we developed over the last years and outlines seem like an amazing solution. Howewer, our json schema is not just 20 line but much rather around 750 (pretty formatted though) and this is causing the following error to occur: DataError: string or blob too big.

Steps/code to reproduce the bug:

Just run the given example for usage for json schema with this schema: https://raw.githubusercontent.com/koesterlab/datavzrd-chatbot/refs/heads/main/schema.json

Expected result:

No bug...

Error message:

File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/app.py", line 52, in <module>
    main()
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/app.py", line 33, in main
    config_json = generate_datavzrd_config(user_input, schema)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/app.py", line 15, in generate_datavzrd_config
    generator = outlines.generate.json(model, json.dumps(schema))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/outlines/generate/json.py", line 59, in json
    generator = regex(model, regex_str, sampler)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/outlines/generate/regex.py", line 33, in regex
    fsm = RegexGuide(regex_str, model.tokenizer)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/outlines/fsm/guide.py", line 145, in __init__
    ) = create_s
tates_mapping(regex_string, tokenizer)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/outlines/caching.py", line 119, in wrapper
    result = wrapper.__memory__.get(cache_key, default=ENOVAL, retry=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/felixwiegand/PycharmProjects/datavzrd-chatbot/venv/lib/python3.12/site-packages/diskcache/core.py", line 1165, in get
    rows = self._sql(select, (db_key, raw, time.time())).fetchall()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Outlines/Python version information:

Version information

``` Outlines v0.0.46 ```

Context for the issue:

No response

@fxwiegand fxwiegand added the bug label Sep 28, 2024
@fxwiegand fxwiegand changed the title <Please write a descriptive title> Outlines v0.0.46 Sep 28, 2024
@fxwiegand fxwiegand changed the title Outlines v0.0.46 DataError: string or blob too big Sep 28, 2024
@rlouf
Copy link
Member

rlouf commented Sep 28, 2024

Thank you for reporting this error! It looks like a problem with the way caching, a refactor of that part is long overdue! @lapp0

@fxwiegand
Copy link
Author

Thank you for reporting this error! It looks like a problem with the way caching, a refactor of that part is long overdue! @lapp0

That is actually great to hear! I was worried I messed up at some point or the schema was the problem! Would be very thankful if this would be fixed. Also thanks a lot for the fast reply - much appreciated!

@lapp0
Copy link
Contributor

lapp0 commented Oct 3, 2024

I can't reproduce your error. Perhaps your python3 / sqlite3 version is different from mine?

I did notice that the resulting regex is 1587257009 characters, which exceeds the default SQLite limit:

Maximum length of a string or BLOB

The maximum number of bytes in a string or BLOB in SQLite is defined by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this macro is 1 billion (1 thousand million or 1,000,000,000). You can raise or lower this value at compile-time using a command-line option like this:

-DSQLITE_MAX_LENGTH=123456789

Mitigation should involve providing Outlines users the flexibility to cache to any database . (This would also solve many users problems here: vllm-project/vllm#4193)

We're actively working on upgrades to our JSON Schema -> regex tooling, but currently the "items" constraint isn't applied properly and results in a massive unconstrained array:

      "type": [
        "array",
        "null"
      ],
      "items": {
        "type": "string"
      }

Please instead use

{
  "anyOf": [
    {
      "type": "null"
    },
    {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  ]
}

@rlouf rlouf closed this as completed Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants