Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

quark67 · 2024-10-04T15:17:08Z

What I ask is not really a new "language" detection, because in code source, we can have strings in English or French or German etc.

But when I need to OCR a (long) snipped code from a YouTube video, the char } needs to be recognized as a } (UTF-8: 007D, RIGHT CURLY BRACKET), not as a ｝ (UTF-8: FF5D, FULLWIDTH RIGHT CURLY BRACKET) (the latter is really one character, not a } with a space: try to select it).

But it can be usefull that the OCR recognize diacritics chars (e.g.: éàâèêôùûëïç and œ and æ, and same in uppercase, if the code contains French strings). But not non breakable space (UTF-8 :00A0), nor curly quotes (‘’“”).

Perhaps this "code" or "raw" mode car be more accurate about not confusing { with a (, and also don't tend to add spaces when there is no space in fact.

Example (code in the LaTeX language):

\documentclass[11pt,a4paper]{article}
\usepackage[french]{babel}
\usepackage[T1]{fontenc}
\usepackage{lmodern}

is recognized as:

(documentclass [11pt, a4paper] (article}
\usepackage[french] {babel}
\usepackage[T1] {fontenc}
\usepackage{lmodern}

(\ and { replaced by (, spaces added. These errors are not always made).

In the same snipped code from YouTube video, a . was recognized as a •. Clearly a • has nothing to do in a "code" or "raw" text. So a char like "•" needs to be excluded. But between a "." and a "•", we can easily spot the error and correct it manually. Between a "}" and a "｝" it's more difficult (especially when it ends a line). Or between a space and a non-breakable space.

I hope it's not too difficult to add this feature. Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

quark67 commented Oct 4, 2024

Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

Comments

quark67 commented Oct 4, 2024