DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

knuxyl · 2023-06-14T22:53:42Z

This may not be an issue with vobsubocr, but I am getting a ton of issues with the output of this/tesseract.
For example, I am ripping Inuyasha, and tons of words that have an (ell) in it are being turned into a capital (eye).

I was using subp2tiff previously but after reinstalling linux I realized it's a mess trying to get that installed again and this program simplified everything for me, so thank you.

The problem now though is I have roughly 950 episodes I need processed and with this innaccurate tesseract model (even using the "best" from github) I'm getting very obvious errors.

Are there any plans to implement spell checking, or do you know of any other way I can get around this issue? I don't remember this being so prevalent when I was using subp2tiff and I think it might be because I was resizing all the images to 200% with lanczos filter with imagemagick. I would just continue doing that but the --dump process doesn't include a way to put it all back together after the images have been externally processed.

I tried changing the DPI of the output but it seems to do nothing with --dump, so I don't know if it's doing anything during processing. I tried -d 300 (tesseract recommendation) and -d 600. -d 600 was still giving images with a height of ~52px.

I am running Debian testing and latest cargo/rustc and libtesseract-dev version 5.3.0-2

This is the sub+idx and processed srt file for 4th episode of Inuyasha as an example
sub1.zip

knuxyl · 2023-06-15T01:58:28Z

After compiling all the innaccuracies, it looks like it's 99% this ell and eye mismatch, so I think a sed regex would work perfectly fine.

I'm not closing this yet though because I'm not sure if the DPI option is working correctly.

elizagamedev · 2023-06-19T23:45:35Z

In retrospect I definitely overstated in the readme how accurate vobsubocr is, especially with the default settings. It is kind of surprising to me how tesseract itself doesn't seem to be able to differentiate I and L in the context of words; I would have expected that viewing words as a whole would be how most OCR worked nowadays.

AFAICT the dpi option is a somewhat magic option that Tesseract uses in some vaguely defined way, but doesn't otherwise affect the data that vobsubocr feeds it. I hadn't considered that scaling the subtitles up like that would be effective, but it makes sense, given that it can average the colors of the outlines of the subtitles which would make more "smooth" text. There are probably some pretty good AI models too that can help with that nowadays. It'll require some significant re-engineering, though, unfortunately, and I'm not sure I have the time in the near future to tackle it. I'll leave the issue open specifically to track such a feature.

Also, re: automatic spell-checking, besides sed, I wonder if hunspell or something could automatically process all the files after the fact? Though I imagine it would present a lot of trouble with something like Inuyasha which would have a lot of Japanese names.

knuxyl · 2023-06-20T02:43:46Z

Yeah, i was trying with aspell but it's just nonstop with anime, not worth it. I found that using a regex for sed to check for capital eyes in the middle of words that arent capitalized, not right after an apostrophe, and not the first letter works perfectly. It shouldnt have any false positives. The eye and ell confusion seems to be the only major problem I had so I'm good with the results. Thank you for making this, saved me so much trouble from getting subp2tiff on my system.

…

On Mon, Jun 19, 2023, 18:45 Eliza ***@***.***> wrote: In retrospect I definitely overstated in the readme how accurate vobsubocr is, especially with the default settings. It is kind of surprising to me how tesseract itself doesn't seem to be able to differentiate I and L in the context of words; I would have expected that viewing words as a whole would be how most OCR worked nowadays. AFAICT the dpi option is a somewhat magic option that Tesseract uses in some vaguely defined way, but doesn't otherwise affect the data that vobsubocr feeds it. I hadn't considered that scaling the subtitles up like that would be effective, but it makes sense, given that it can average the colors of the outlines of the subtitles which would make more "smooth" text. There are probably some pretty good AI models too that can help with that nowadays. It'll require some significant re-engineering, though, unfortunately, and I'm not sure I have the time in the near future to tackle it. I'll leave the issue open specifically to track such a feature. Also, re: automatic spell-checking, besides sed, I wonder if hunspell or something could automatically process all the files after the fact? Though I imagine it would present a lot of trouble with something like Inuyasha which would have a lot of Japanese names. — Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEU2Q7XHFOLB3532MDKV53TXMDQCTANCNFSM6AAAAAAZHAT52M> . You are receiving this because you authored the thread.Message ID: ***@***.***>

elizagamedev added enhancement New feature or request good first issue Good for newcomers labels Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

knuxyl commented Jun 14, 2023 •

edited

Loading

knuxyl commented Jun 15, 2023 •

edited

Loading

elizagamedev commented Jun 19, 2023

knuxyl commented Jun 20, 2023 via email

DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

DPI not working? And possibility for spell checking, improved tesseract model, or alternative? #16

Comments

knuxyl commented Jun 14, 2023 • edited Loading

knuxyl commented Jun 15, 2023 • edited Loading

elizagamedev commented Jun 19, 2023

knuxyl commented Jun 20, 2023 via email

knuxyl commented Jun 14, 2023 •

edited

Loading

knuxyl commented Jun 15, 2023 •

edited

Loading