-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonical url intergration batch 3 #905
Canonical url intergration batch 3 #905
Conversation
…om/jknndy/recipe-scrapers into canonical_url_intergration_batch_3
@jknndy ah, I think I see what you mean about the GitHub-suggested recommended commands. I think what to do instead is to ensure that you have an up-to-date copy of the Then you'll find the conflicts locally. A shortcut for two of them -- the HTML files, where we don't want to merge two files but instead simply take the one from this branch -- is to reset the file to the version stored in a specific branch: |
Thanks very much! got it & i think we're in good shape for now. Once #915 is merged ill run through this again and then we should be good to go. edit: not sure whats causing this failure, the files listed in the linter error weren't changed in this PR. & the coverage failures are related to #909 generate.py:211:96: E231 missing whitespace after ','
generate.py:222:85: E231 missing whitespace after ','
recipe_scrapers/goustojson.py:20:20: E231 missing whitespace after ':'
recipe_scrapers/grandfrais.py:57:40: E231 missing whitespace after ':'
recipe_scrapers/kptncook.py:32:26: E231 missing whitespace after ':'
recipe_scrapers/kuchniadomowa.py:15:23: E231 missing whitespace after ':'
recipe_scrapers/mindmegette.py:26:41: E231 missing whitespace after ':'
recipe_scrapers/nihhealthyeating.py:62:37: E231 missing whitespace after ':'
recipe_scrapers/weightwatchers.py:118:55: E702 multiple statements on one line (semicolon)
recipe_scrapers/woolworths.py:14:27: E231 missing whitespace after ':' |
That is strange indeed about the lint failures. The list of files there looks similar to the differences between the |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as off-topic.
This comment was marked as off-topic.
@jknndy it seems that the If you merge the latest changes from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thank you for your patience with this one @jknndy! Almost there with the canonical URL batches :)
"1 peça de filé mignon para rosbife (cerca de 750 g)", | ||
"1 colher (chá) de mostarda amarela em pó", | ||
"1 colher (chá) de páprica defumada", | ||
"750 g de filé mignon em peça para rosbife", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg. I've noticed while doing a final readthrough that there's a problem with the text decoding here. I'm taking a bit of a look at that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
charchef
does a good job of fixing these malencoded text entries, but is fairly slow the first time it runs:
>>> from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars
>>> text = '1 colher (chá) de páprica defumada'
>>> aa_repair_bad_conversion_to_utf8(text)
['1 colher (chá) de páprica defumada']
(that took 10 seconds or so, maybe slightly longer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LatinFixer
by the same author can achieve the same result here, and fast!
>>> from LatinFixer import LatinFix
>>> text = '1 colher (chá) de páprica defumada'
>>> LatinFix(text).apply_wrong_chars()
1 colher (chá) de páprica defumada
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a possible change to do this; it isn't perfect yet, though. Some ingredient lines begin with fractional amounts -- ⅔
, for example -- and those aren't being decoded clearly yet.
@jknndy do you remember what browser you downloaded the testhtml
files with? There's a possibility that something in the encoding headers/negotiation that it used at the time has resulted in unusual results in the downloaded HTML. When I view the original recipe page live in a web browser, I find that it's more readable than when I view the corresponding .testhtml
file locally in the same browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved this problem by re-downloading a copy of the HTML; the text encoding received from the site and/or the encoding used to write the file seem to have resolved the problem. So despite that LatinFixer
adventure (it's useful to know about that library), it isn't added as a dependency/string handler here at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grabbed all the updated test html using the header defined in _abstract so it's strange that would have caused an issue. Once I'm down with the last batch I have a few to revisit that I skipped over, I'll add this one to this list & try some other recipes to see what happens.
Batch 3, scrapers M - R, few reworks but nothing major in this one!
Non-scraper changes