-
Notifications
You must be signed in to change notification settings - Fork 14k
Description
Name and Version
version: 7310 (db97837)
built with MSVC 19.50.35719.0 for x64
Description
PR #17786 introduced std::regex_constants::nosubs | std::regex_constants::optimize flags to prevent segfaults on highly repetitive input. This fix works on Linux/GCC but breaks Windows/MSVC builds when loading models with complex tokenizer patterns (e.g., gpt-4o).
Error
Failed to process regex: [^\r\n\p{L}\p{N}]?((?=[\p{L}])([^a-z]))*((?=[\p{L}])([^A-Z]))+(?:'[sS]|...
Regex error: regex_error(error_stack): There was insufficient memory to determine whether the regular expression could match the specified character sequence.
llama_model_load: error loading model: error loading model vocabulary: Failed to process regex
Cause
MSVC's std::regex implementation has severe stack limitations with complex patterns containing nested lookaheads. The nosubs | optimize flags appear to trigger a code path that exhausts the stack during pattern compilation.
Suggested Fix
Use platform-specific flags:
#ifdef _MSC_VER
// MSVC's std::regex has stack limitations with complex patterns
constexpr auto regex_flags = std::regex_constants::ECMAScript;
#else
// Prevents catastrophic backtracking on repetitive input
constexpr auto regex_flags = std::regex_constants::nosubs | std::regex_constants::optimize;
#endifThis preserves the fix for Linux while avoiding the regression on Windows. Note that _MSC_VER is also defined for clang-cl, which uses the same broken MSVC STL implementation.