You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: handle UTF-8 multibyte sequences truncated at buffer boundary
Fix false positives where valid UTF-8 sequences were incorrectly
flagged as binary when truncated at the MAX_BYTES boundary.
Changes:
- Introduce scanBytes variable to separate scan range from validation range
- Read extra bytes (MAX_BYTES + UTF8_BOUNDARY_RESERVE) to capture
complete sequences at boundary
- Maintain MAX_BYTES scan limit for binary detection logic
- Enable UTF-8 validation to access up to MAX_BYTES + UTF8_BOUNDARY_RESERVE
Tests:
- Add 7 boundary test cases including real-world Python file with
Chinese characters (utf8-boundary-truncation_case.py)
- Covers 2/3/4-byte UTF-8 sequences at positions near MAX_BYTES boundary
- All 40 tests pass
Technical details:
- Minimal change preserving all existing UTF-8 detection logic
- scanBytes controls loop boundary and percentage calculations
- totalBytes allows validation of sequences crossing MAX_BYTES boundary
- Maintains backward compatibility and binary detection thresholds
This addresses the same issue as PR #90 but with a simpler, more
maintainable approach. If accepted, PR #90 will be closed.
0 commit comments