Canonical url intergration batch 3 #905

jknndy · 2023-10-15T21:01:08Z

Batch 3, scrapers M - R, few reworks but nothing major in this one!

Non-scraper changes

Added \u00C2 (Â) to normalize_string util

…om/jknndy/recipe-scrapers into canonical_url_intergration_batch_3

recipe_scrapers/_utils.py

recipe_scrapers/onehundredonecookbooks.py

recipe_scrapers/panelinha.py

recipe_scrapers/primaledgehealth.py

recipe_scrapers/rezeptwelt.py

recipe_scrapers/maangchi.py

recipe_scrapers/momswithcrockpots.py

recipe_scrapers/onehundredonecookbooks.py

recipe_scrapers/primaledgehealth.py

recipe_scrapers/panelinha.py

tests/test_mykitchen101.py

jayaddison · 2023-10-26T10:06:52Z

@jknndy ah, I think I see what you mean about the GitHub-suggested recommended commands.

I think what to do instead is to ensure that you have an up-to-date copy of the main branch (git checkout main; git pull origin main) and then merge that into this branch (git checkout canonical_url_intergration_batch_3; git merge main).

Then you'll find the conflicts locally. A shortcut for two of them -- the HTML files, where we don't want to merge two files but instead simply take the one from this branch -- is to reset the file to the version stored in a specific branch: git reset canonical_url_intergration_batch_3 -- tests/test_data/maangchi.testhtml). The conflicts in tests/test_matprat_1.py will require manual resolution, though.

jknndy · 2023-10-26T20:23:02Z

Thanks very much! got it & i think we're in good shape for now. Once #915 is merged ill run through this again and then we should be good to go.

edit: not sure whats causing this failure, the files listed in the linter error weren't changed in this PR. & the coverage failures are related to #909

generate.py:211:96: E231 missing whitespace after ','
generate.py:222:85: E231 missing whitespace after ','
recipe_scrapers/goustojson.py:20:20: E231 missing whitespace after ':'
recipe_scrapers/grandfrais.py:57:40: E231 missing whitespace after ':'
recipe_scrapers/kptncook.py:32:26: E231 missing whitespace after ':'
recipe_scrapers/kuchniadomowa.py:15:23: E231 missing whitespace after ':'
recipe_scrapers/mindmegette.py:26:41: E231 missing whitespace after ':'
recipe_scrapers/nihhealthyeating.py:62:37: E231 missing whitespace after ':'
recipe_scrapers/weightwatchers.py:118:55: E702 multiple statements on one line (semicolon)
recipe_scrapers/woolworths.py:14:27: E231 missing whitespace after ':'

jayaddison · 2023-10-26T21:34:41Z

That is strange indeed about the lint failures. The list of files there looks similar to the differences between the main and v15 branches - but nothing about this pull request should involve the v15 branch. Odd. I'll try to take more of a look soon.

jayaddison · 2023-10-26T22:26:10Z

@jknndy it seems that the coverage and lint workflows don't work yet with Python 3.12 - and that became the default for the setup-python action recently. I've pinned those workflows to use lower versions of Python short-term (#917, #918) until we can upgrade.

If you merge the latest changes from main into this branch, tests should pass again 🤞

jayaddison

Looks great - thank you for your patience with this one @jknndy! Almost there with the canonical URL batches :)

jayaddison · 2023-10-28T10:34:49Z

tests/test_panelinha_1.py

-                "1 peça de filé mignon para rosbife (cerca de 750 g)",
-                "1 colher (chá) de mostarda amarela em pó",
-                "1 colher (chá) de páprica defumada",
+                "750 g de filÃ© mignon em peÃ§a para rosbife",


Arg. I've noticed while doing a final readthrough that there's a problem with the text decoding here. I'm taking a bit of a look at that now.

charchef does a good job of fixing these malencoded text entries, but is fairly slow the first time it runs:

>>> from charchef import aa_convert_utf8_to_ascii_,aa_repair_bad_conversion_to_utf8,aa_replace_non_printable_chars >>> text = '1 colher (chÃ¡) de pÃ¡prica defumada' >>> aa_repair_bad_conversion_to_utf8(text) ['1 colher (chá) de páprica defumada']

(that took 10 seconds or so, maybe slightly longer)

LatinFixer by the same author can achieve the same result here, and fast!

>>> from LatinFixer import LatinFix >>> text = '1 colher (chÃ¡) de pÃ¡prica defumada' >>> LatinFix(text).apply_wrong_chars() 1 colher (chá) de páprica defumada

I've pushed a possible change to do this; it isn't perfect yet, though. Some ingredient lines begin with fractional amounts -- ⅔, for example -- and those aren't being decoded clearly yet.

@jknndy do you remember what browser you downloaded the testhtml files with? There's a possibility that something in the encoding headers/negotiation that it used at the time has resulted in unusual results in the downloaded HTML. When I view the original recipe page live in a web browser, I find that it's more readable than when I view the corresponding .testhtml file locally in the same browser.

I've resolved this problem by re-downloading a copy of the HTML; the text encoding received from the site and/or the encoding used to write the file seem to have resolved the problem. So despite that LatinFixer adventure (it's useful to know about that library), it isn't added as a dependency/string handler here at the moment.

I grabbed all the updated test html using the header defined in _abstract so it's strange that would have caused an issue. Once I'm down with the last batch I have a few to revisit that I skipped over, I'll add this one to this list & try some other recipes to see what happens.

… fractional units

This reverts commit b5b7450.

…gin with fractional units" This reverts commit b8d74eb.

This reverts commit 506ae1e.

…er library" This reverts commit c3ae0d9.

Relates-to commit 74fc5e9.

…ee (hhursev#905)

jknndy added 16 commits October 11, 2023 17:40

Adds canonical_url for M - R

892fbe5

New test cases

1bd117c

Test case updates for M*

443cab9

marthastewart corrections

055c18c

Updates to maangchi

4d2a8a6

Finalizing M scrapers

106955f

Test case updates for N*

5da27a8

Onehundredonecookbooks rewrite

e2c36a8

Test case updates for P*

f9bc1d2

P*, primalledgehealth & normalize string

c7d7593

Update paninihappy.py

baf697c

Test case updates for R*

564b82b

Merge branch 'canonical_url_intergration_batch_3' of https://github.c…

5063272

…om/jknndy/recipe-scrapers into canonical_url_intergration_batch_3

rezeptwelt updates

2367f8b

Finalizing updates

57cdad7

Tox run

d836997

jknndy marked this pull request as ready for review October 16, 2023 02:31

jayaddison reviewed Oct 17, 2023

View reviewed changes

recipe_scrapers/_utils.py Outdated Show resolved Hide resolved

moved \u00C2 to scrapers

e6f17a0