Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

Open
knuxyl opened this issue Jun 14, 2023 · 3 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@knuxyl
Copy link

knuxyl commented Jun 14, 2023

This may not be an issue with vobsubocr, but I am getting a ton of issues with the output of this/tesseract.
For example, I am ripping Inuyasha, and tons of words that have an (ell) in it are being turned into a capital (eye).

I was using subp2tiff previously but after reinstalling linux I realized it's a mess trying to get that installed again and this program simplified everything for me, so thank you.

The problem now though is I have roughly 950 episodes I need processed and with this innaccurate tesseract model (even using the "best" from github) I'm getting very obvious errors.

Are there any plans to implement spell checking, or do you know of any other way I can get around this issue? I don't remember this being so prevalent when I was using subp2tiff and I think it might be because I was resizing all the images to 200% with lanczos filter with imagemagick. I would just continue doing that but the --dump process doesn't include a way to put it all back together after the images have been externally processed.

I tried changing the DPI of the output but it seems to do nothing with --dump, so I don't know if it's doing anything during processing. I tried -d 300 (tesseract recommendation) and -d 600. -d 600 was still giving images with a height of ~52px.

I am running Debian testing and latest cargo/rustc and libtesseract-dev version 5.3.0-2

This is the sub+idx and processed srt file for 4th episode of Inuyasha as an example
sub1.zip

@knuxyl
Copy link
Author

knuxyl commented Jun 15, 2023

After compiling all the innaccuracies, it looks like it's 99% this ell and eye mismatch, so I think a sed regex would work perfectly fine.

I'm not closing this yet though because I'm not sure if the DPI option is working correctly.

@elizagamedev elizagamedev added enhancement New feature or request good first issue Good for newcomers labels Jun 19, 2023
@elizagamedev
Copy link
Owner

In retrospect I definitely overstated in the readme how accurate vobsubocr is, especially with the default settings. It is kind of surprising to me how tesseract itself doesn't seem to be able to differentiate I and L in the context of words; I would have expected that viewing words as a whole would be how most OCR worked nowadays.

AFAICT the dpi option is a somewhat magic option that Tesseract uses in some vaguely defined way, but doesn't otherwise affect the data that vobsubocr feeds it. I hadn't considered that scaling the subtitles up like that would be effective, but it makes sense, given that it can average the colors of the outlines of the subtitles which would make more "smooth" text. There are probably some pretty good AI models too that can help with that nowadays. It'll require some significant re-engineering, though, unfortunately, and I'm not sure I have the time in the near future to tackle it. I'll leave the issue open specifically to track such a feature.

Also, re: automatic spell-checking, besides sed, I wonder if hunspell or something could automatically process all the files after the fact? Though I imagine it would present a lot of trouble with something like Inuyasha which would have a lot of Japanese names.

@knuxyl
Copy link
Author

knuxyl commented Jun 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants