You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What I ask is not really a new "language" detection, because in code source, we can have strings in English or French or German etc.
But when I need to OCR a (long) snipped code from a YouTube video, the char } needs to be recognized as a } (UTF-8: 007D, RIGHT CURLY BRACKET), not as a } (UTF-8: FF5D, FULLWIDTH RIGHT CURLY BRACKET) (the latter is really one character, not a } with a space: try to select it).
But it can be usefull that the OCR recognize diacritics chars (e.g.: éàâèêôùûëïç and œ and æ, and same in uppercase, if the code contains French strings). But not non breakable space (UTF-8 :00A0), nor curly quotes (‘’“”).
Perhaps this "code" or "raw" mode car be more accurate about not confusing { with a (, and also don't tend to add spaces when there is no space in fact.
(\ and { replaced by (, spaces added. These errors are not always made).
In the same snipped code from YouTube video, a . was recognized as a •. Clearly a • has nothing to do in a "code" or "raw" text. So a char like "•" needs to be excluded. But between a "." and a "•", we can easily spot the error and correct it manually. Between a "}" and a "}" it's more difficult (especially when it ends a line). Or between a space and a non-breakable space.
I hope it's not too difficult to add this feature. Thanks.
The text was updated successfully, but these errors were encountered:
What I ask is not really a new "language" detection, because in code source, we can have strings in English or French or German etc.
But when I need to OCR a (long) snipped code from a YouTube video, the char
}
needs to be recognized as a}
(UTF-8: 007D, RIGHT CURLY BRACKET), not as a}
(UTF-8: FF5D, FULLWIDTH RIGHT CURLY BRACKET) (the latter is really one character, not a}
with a space: try to select it).But it can be usefull that the OCR recognize diacritics chars (e.g.: éàâèêôùûëïç and œ and æ, and same in uppercase, if the code contains French strings). But not non breakable space (UTF-8 :00A0), nor curly quotes (‘’“”).
Perhaps this "code" or "raw" mode car be more accurate about not confusing
{
with a(
, and also don't tend to add spaces when there is no space in fact.Example (code in the LaTeX language):
is recognized as:
(
\
and{
replaced by(
, spaces added. These errors are not always made).In the same snipped code from YouTube video, a
.
was recognized as a•
. Clearly a•
has nothing to do in a "code" or "raw" text. So a char like "•" needs to be excluded. But between a "." and a "•", we can easily spot the error and correct it manually. Between a "}" and a "}" it's more difficult (especially when it ends a line). Or between a space and a non-breakable space.I hope it's not too difficult to add this feature. Thanks.
The text was updated successfully, but these errors were encountered: