Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(ocr_mkcontent): improve language detection and content formatting (…
…#458) Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages.
- Loading branch information