-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract OCR functions and add unit tests for them. #23
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best PR I have seen in a while. Thanks a lot for this <3
expected_user = "@NASA" | ||
expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \ | ||
"history again in a quiet way: go.nasa.gov/2kOO1cc" | ||
self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like pngs were ignored by .gitignore. Can you manually add them? :)
fakemenot/ocr.py
Outdated
from pytesseract import image_to_string | ||
|
||
|
||
def find_user_and_text_in_tweet_image(image_path: str) -> (str, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work in Python2 :(
fakemenot/ocr.py
Outdated
return extract_values_from_desktop_tweet(words) | ||
|
||
|
||
def extract_values_from_desktop_tweet(words: List[str]) -> (str, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
fakemenot/ocr.py
Outdated
return user, body | ||
|
||
|
||
def prepare_image_for_ocr(image_path: str) -> Optional[Image.Image]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python2 compatibility is broken
_init_ file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
Ok fixed the issues. :) Btw. if you are really worried about compatibility... My IDE is showing me some warning about the "import configparser" in the main.py. But I don't really know how to avoid that now. |
removed_elements = 0 | ||
ltweet, orig_len = tweet[0].split(' '), len(tweet[0].split(' ')) | ||
# Compare each element of body to element in body. TODO: Optimize | ||
for ele in body: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Body is being passed as a string here. So, every iteration is going to produce a single character. It's best to body.split(' ')
before this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didnt touch that part. Just literally ripped out the OCR bits, e.g. the parts that set the variables potential_user and body.
That whole analysis part can probably be replaced by the SequenceMatcher.ratio() function that I also used in the unit tests. Seems to do the same thing?
expected_user = "@NASA" | ||
expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \ | ||
"history again in a quiet way: go.nasa.gov/2kOO1cc" | ||
self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems to be broken for me. After OCR, I get this :
NASAO
@ @NASA m V
For 70 years, planes loudly flew supersonic 8L
barriers were broken. Now we're making
history again in a quiet way:
go.nasa.gov/2k001cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. There are probably differences between the linux and windows ocr. But this is a real problem, which needs to be fixed...
It seems to sometimes recognize the "official account" badge as an @, which then messes up the user handle extraction.
I'll try to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok changed it. Try it now :)
Potental handle has to be at least 2 chars long. Also the image was made a bit bigger for more stable OCR results.
Hmm, OCR seems to be behaving differently on platforms. Image 3 is broken on Linux now because it's blown up a bit too much and detects handle incorrectly. I'm starting to wonder why. There should be a cleaner solution :/ |
I don't know how to fix that quickly. The unittests themselves are correct, they check the right thing. I could revert the ocr extraction code to the original (where it checks for the "v" to select the tweet body), so it's sure that I didn't break anything. And then just let the test fail (justified) and you could open a seperate issue for that? |
init file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
A step towards solving #5 .