Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a "code" mode or a "raw" mode, excluding all strange non-ASCII characters #41

Open
quark67 opened this issue Oct 4, 2024 · 0 comments

Comments

@quark67
Copy link

quark67 commented Oct 4, 2024

What I ask is not really a new "language" detection, because in code source, we can have strings in English or French or German etc.

But when I need to OCR a (long) snipped code from a YouTube video, the char } needs to be recognized as a } (UTF-8: 007D, RIGHT CURLY BRACKET), not as a (UTF-8: FF5D, FULLWIDTH RIGHT CURLY BRACKET) (the latter is really one character, not a } with a space: try to select it).

But it can be usefull that the OCR recognize diacritics chars (e.g.: éàâèêôùûëïç and œ and æ, and same in uppercase, if the code contains French strings). But not non breakable space (UTF-8 :00A0), nor curly quotes (‘’“”).

Perhaps this "code" or "raw" mode car be more accurate about not confusing { with a (, and also don't tend to add spaces when there is no space in fact.

Example (code in the LaTeX language):

\documentclass[11pt,a4paper]{article}
\usepackage[french]{babel}
\usepackage[T1]{fontenc}
\usepackage{lmodern}

is recognized as:

(documentclass [11pt, a4paper] (article}
\usepackage[french] {babel}
\usepackage[T1] {fontenc}
\usepackage{lmodern}

(\ and { replaced by (, spaces added. These errors are not always made).

In the same snipped code from YouTube video, a . was recognized as a . Clearly a has nothing to do in a "code" or "raw" text. So a char like "•" needs to be excluded. But between a "." and a "•", we can easily spot the error and correct it manually. Between a "}" and a "}" it's more difficult (especially when it ends a line). Or between a space and a non-breakable space.

I hope it's not too difficult to add this feature. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant