-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16
Comments
After compiling all the innaccuracies, it looks like it's 99% this ell and eye mismatch, so I think a sed regex would work perfectly fine. I'm not closing this yet though because I'm not sure if the DPI option is working correctly. |
In retrospect I definitely overstated in the readme how accurate vobsubocr is, especially with the default settings. It is kind of surprising to me how tesseract itself doesn't seem to be able to differentiate I and L in the context of words; I would have expected that viewing words as a whole would be how most OCR worked nowadays. AFAICT the dpi option is a somewhat magic option that Tesseract uses in some vaguely defined way, but doesn't otherwise affect the data that vobsubocr feeds it. I hadn't considered that scaling the subtitles up like that would be effective, but it makes sense, given that it can average the colors of the outlines of the subtitles which would make more "smooth" text. There are probably some pretty good AI models too that can help with that nowadays. It'll require some significant re-engineering, though, unfortunately, and I'm not sure I have the time in the near future to tackle it. I'll leave the issue open specifically to track such a feature. Also, re: automatic spell-checking, besides sed, I wonder if hunspell or something could automatically process all the files after the fact? Though I imagine it would present a lot of trouble with something like Inuyasha which would have a lot of Japanese names. |
Yeah, i was trying with aspell but it's just nonstop with anime, not worth
it. I found that using a regex for sed to check for capital eyes in the
middle of words that arent capitalized, not right after an apostrophe, and
not the first letter works perfectly. It shouldnt have any false positives.
The eye and ell confusion seems to be the only major problem I had so I'm
good with the results. Thank you for making this, saved me so much trouble
from getting subp2tiff on my system.
…On Mon, Jun 19, 2023, 18:45 Eliza ***@***.***> wrote:
In retrospect I definitely overstated in the readme how accurate vobsubocr
is, especially with the default settings. It is kind of surprising to me
how tesseract itself doesn't seem to be able to differentiate I and L in
the context of words; I would have expected that viewing words as a whole
would be how most OCR worked nowadays.
AFAICT the dpi option is a somewhat magic option that Tesseract uses in
some vaguely defined way, but doesn't otherwise affect the data that
vobsubocr feeds it. I hadn't considered that scaling the subtitles up like
that would be effective, but it makes sense, given that it can average the
colors of the outlines of the subtitles which would make more "smooth"
text. There are probably some pretty good AI models too that can help with
that nowadays. It'll require some significant re-engineering, though,
unfortunately, and I'm not sure I have the time in the near future to
tackle it. I'll leave the issue open specifically to track such a feature.
Also, re: automatic spell-checking, besides sed, I wonder if hunspell or
something could automatically process all the files after the fact? Though
I imagine it would present a lot of trouble with something like Inuyasha
which would have a lot of Japanese names.
—
Reply to this email directly, view it on GitHub
<#16 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEU2Q7XHFOLB3532MDKV53TXMDQCTANCNFSM6AAAAAAZHAT52M>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This may not be an issue with vobsubocr, but I am getting a ton of issues with the output of this/tesseract.
For example, I am ripping Inuyasha, and tons of words that have an (ell) in it are being turned into a capital (eye).
I was using subp2tiff previously but after reinstalling linux I realized it's a mess trying to get that installed again and this program simplified everything for me, so thank you.
The problem now though is I have roughly 950 episodes I need processed and with this innaccurate tesseract model (even using the "best" from github) I'm getting very obvious errors.
Are there any plans to implement spell checking, or do you know of any other way I can get around this issue? I don't remember this being so prevalent when I was using subp2tiff and I think it might be because I was resizing all the images to 200% with lanczos filter with imagemagick. I would just continue doing that but the --dump process doesn't include a way to put it all back together after the images have been externally processed.
I tried changing the DPI of the output but it seems to do nothing with --dump, so I don't know if it's doing anything during processing. I tried -d 300 (tesseract recommendation) and -d 600. -d 600 was still giving images with a height of ~52px.
I am running Debian testing and latest cargo/rustc and libtesseract-dev version 5.3.0-2
This is the sub+idx and processed srt file for 4th episode of Inuyasha as an example
sub1.zip
The text was updated successfully, but these errors were encountered: