Skip to content

Regression: PR #17786 breaks model loading on Windows/MSVC for models with complex tokenizer regex #17830

@dranger003

Description

@dranger003

Name and Version

version: 7310 (db97837)
built with MSVC 19.50.35719.0 for x64

Description

PR #17786 introduced std::regex_constants::nosubs | std::regex_constants::optimize flags to prevent segfaults on highly repetitive input. This fix works on Linux/GCC but breaks Windows/MSVC builds when loading models with complex tokenizer patterns (e.g., gpt-4o).

Error

Failed to process regex: [^\r\n\p{L}\p{N}]?((?=[\p{L}])([^a-z]))*((?=[\p{L}])([^A-Z]))+(?:'[sS]|...
Regex error: regex_error(error_stack): There was insufficient memory to determine whether the regular expression could match the specified character sequence.
llama_model_load: error loading model: error loading model vocabulary: Failed to process regex

Cause

MSVC's std::regex implementation has severe stack limitations with complex patterns containing nested lookaheads. The nosubs | optimize flags appear to trigger a code path that exhausts the stack during pattern compilation.

Suggested Fix

Use platform-specific flags:

#ifdef _MSC_VER
    // MSVC's std::regex has stack limitations with complex patterns
    constexpr auto regex_flags = std::regex_constants::ECMAScript;
#else
    // Prevents catastrophic backtracking on repetitive input
    constexpr auto regex_flags = std::regex_constants::nosubs | std::regex_constants::optimize;
#endif

This preserves the fix for Linux while avoiding the regression on Windows. Note that _MSC_VER is also defined for clang-cl, which uses the same broken MSVC STL implementation.

First Bad Commit

1be9783

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingwindowsIssues specific to Windows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions