Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract OCR functions and add unit tests for them. #23

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Double-A-92
Copy link

init file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
A step towards solving #5 .

Copy link
Owner

@pwnfoo pwnfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best PR I have seen in a while. Thanks a lot for this <3

expected_user = "@NASA"
expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \
"history again in a quiet way: go.nasa.gov/2kOO1cc"
self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pngs were ignored by .gitignore. Can you manually add them? :)

fakemenot/ocr.py Outdated
from pytesseract import image_to_string


def find_user_and_text_in_tweet_image(image_path: str) -> (str, str):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work in Python2 :(

fakemenot/ocr.py Outdated
return extract_values_from_desktop_tweet(words)


def extract_values_from_desktop_tweet(words: List[str]) -> (str, str):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

fakemenot/ocr.py Outdated
return user, body


def prepare_image_for_ocr(image_path: str) -> Optional[Image.Image]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python2 compatibility is broken

_init_ file was emptied and it's contents moved into main.py. That was needed for the new unit test structure.
@Double-A-92
Copy link
Author

Ok fixed the issues. :)

Btw. if you are really worried about compatibility... My IDE is showing me some warning about the "import configparser" in the main.py. But I don't really know how to avoid that now.

removed_elements = 0
ltweet, orig_len = tweet[0].split(' '), len(tweet[0].split(' '))
# Compare each element of body to element in body. TODO: Optimize
for ele in body:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Body is being passed as a string here. So, every iteration is going to produce a single character. It's best to body.split(' ') before this :)

Copy link
Author

@Double-A-92 Double-A-92 Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt touch that part. Just literally ripped out the OCR bits, e.g. the parts that set the variables potential_user and body.

That whole analysis part can probably be replaced by the SequenceMatcher.ratio() function that I also used in the unit tests. Seems to do the same thing?

expected_user = "@NASA"
expected_text = "For 70 years, planes loudly flew supersonic & barriers were broken. Now we're making " \
"history again in a quiet way: go.nasa.gov/2kOO1cc"
self.common_test_ocr_tweet("res/test_ocr_2.png", expected_user, expected_text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to be broken for me. After OCR, I get this :

NASAO
@ @NASA m V
For 70 years, planes loudly flew supersonic 8L
barriers were broken. Now we're making
history again in a quiet way:
go.nasa.gov/2k001cc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. There are probably differences between the linux and windows ocr. But this is a real problem, which needs to be fixed...

It seems to sometimes recognize the "official account" badge as an @, which then messes up the user handle extraction.

I'll try to fix that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok changed it. Try it now :)

Potental handle has to be at least 2 chars long. Also the image was made a bit bigger for more stable OCR results.
@pwnfoo
Copy link
Owner

pwnfoo commented Oct 18, 2017

Hmm, OCR seems to be behaving differently on platforms. Image 3 is broken on Linux now because it's blown up a bit too much and detects handle incorrectly. I'm starting to wonder why. There should be a cleaner solution :/

@Double-A-92
Copy link
Author

Double-A-92 commented Oct 18, 2017

I don't know how to fix that quickly. The unittests themselves are correct, they check the right thing.

I could revert the ocr extraction code to the original (where it checks for the "v" to select the tweet body), so it's sure that I didn't break anything.

And then just let the test fail (justified) and you could open a seperate issue for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants